Expert Interview

Fine-Tuning LLMs to Improve Adverse Drug Event Detection and Reporting

April 7, 2025

The expedient, accurate detection of the severity of adverse drug events (ADEs) remains a serious challenge. Despite technological advancements, data-driven models often fail to precisely identify ADEs due to the complexity and inconsistency of patient responses to medications, the range and subtle nature of adverse reactions, and the underreporting of ADEs by health care providers.

Considering these and other challenges, Westat took the initiative in late 2024 to develop a pilot project to find solutions. Sean Chickery, DHSc, a Principal Research Associate for Clinical Research, led the effort and collaborated with data scientists led by Gizem Korkmaz, PhD, Vice President of Data Science and AI for Statistics and Data Science, and Marcelo Simas, PhD, Associate Vice President for Technology and Digital Solutions. Here, Chickery explains the tool and methodology used to develop a proof of concept focused on classifying the severity of adverse drug events and how the team measured success.

Q. What are some other challenges in identifying ADEs and what was Westat’s solution?

A. Medical reports on ADEs are often incomplete or inconsistent, and an enormous amount of unstructured data can conceal signals of an event. These data include millions of rows of adverse events, each containing up to 30 fields of information. Because adverse events can be entered by anyone from medical professionals to consumers, the information is often fragmented.

Westat has extremely talented data scientists and programmers who excel in solving important research questions like this using cutting-edge technology. We developed a proof of concept using a subset of drugs with adverse events (e.g., antibiotics, anticoagulants, or diabetic medications) with a small dataset from an established adverse events database to investigate fine-tuning large language models (LLMs) for ADE detection. Our pilot project was aimed at classifying the severity of adverse events and identifying whether each event resulted in a death or not.

Q. What LLMs did you use?

A. We used two LLMs — Open AI’s GPT 2.0 and Llama—which are capable of recognizing named entities and can refine and expand the dataset to include a broad spectrum of datasets and drugs. Then we compared different machine learning models—Random Forest, DB Scan, XG Boost—for our standard data and used logistic regression as a baseline.

Q. What methodology ensured the dataset was well-suited for training the LLMs?

A. Our setup was in a structured format. Of nearly 400,000 ADEs reported during the second quarter of 2024, we selected 4,000 ADEs with 3,000 non-death events and 1,000 events resulting in deaths. We used 70% of them for training the LLM, 20% to fine-tune it, and 10% to test it. We made sure that the adverse event records were complete, formatted, and labeled correctly for the LLM.

To evaluate how well the model performed, we measured its accuracy and how effectively it balanced correct predictions (precision) and completeness (recall) using the F1 score and area under the curve, known as AUC. The F1 score combines precision and recall into a single number, making it useful when you need a balance between the two—especially in situations where one is more important than the other. A higher F1 score (closer to 1) means the model is performing well. The AUC measures how well a model can distinguish between different categories (such as severe vs. non-severe disease).

Q. What challenges did you encounter in scaling the model to larger datasets and ensuring generalizability to different drugs and adverse events?

A. This effort required a tremendous amount of computing power. We can conquer this challenge by using cloud-based GPU models and scaling the process by integrating with more advanced LLMs like GPT 4.5 through an API. The team will test the generalizability of the model by scaling it to a broader range of the data.

Q. What was the outcome of your study?

A. Our initial findings indicated that fine-tuned LLMs effectively classified the severity of adverse drug events with an accuracy of 85-86% and an AUC of 87%, surpassing the performance of traditional machine learning models. These results highlight the utility of LLM-based pharmacovigilance models in efficiently determining event severity, which can significantly enhance regulatory decision-making. However, further validation using real-world datasets is essential to confirm the robustness and generalizability of these models in clinical practice. Leveraging LLMs for severity classification offers a scalable approach, significantly reducing the extensive time and resources typically required by manual severity assessments.

Q. Where can researchers learn more about these findings?

A. An abstract based on this work underwent a blind peer-review process and was accepted as a refereed presentation at the American Statistical Association’s (ASA’s) Symposium on Data Science and Statistics (SDSS) in late April 2025. Westat’s John Riddles will present the paper, titled “Enhancing Public Health Surveillance: Fine-Tuning Large Language Models for Adverse Drug Event Classification.” The full team of authors includes Josh Turner, John Riddles, Julianna Lee, Jeremy Corry, Rashi Saluja, Sean Chickery, Gizem Korkmaz, Marcelo Simas, and Kevin Wilson.

Q. How can this research impact lives?

A. This effort will do much to increase the safety of taking drugs and protecting people from ADEs. However, more work needs to be done. We started expanding into broader datasets and will develop a methodology that will indicate the severity of the drug interaction and if the event was due to a person’s age, sex, interaction with another drug, or other variables. The bottom line is that our work will save lives.

Focus Areas

Biomedical Informatics and Data Coordination Clinical Research Clinical Trials Health Informatics Real-World Data and Evidence

Capabilities

Biomedical Informatics and Data Coordination Data Analytics, Clinical Data Science, and AI Data Integration, Harmonization, and Complex Analytics Data Science EHR Harmonization and Analysis Health Informatics Machine Learning and Artificial Intelligence

Featured Expert

Sean Chickery

Principal Research Associate

Insights

Deep Dive with Our Experts

view all insights

Expert Interview
Fine-Tuning LLMs to Improve Adverse Drug Event Detection and Reporting
April 2025

The expedient, accurate detection of the severity of adverse drug events (ADEs) remains a serious challenge. Despite technological advancements, data-driven models often fail to precisely…
Perspective
Leveraging Quantum Computing to Accelerate Biomedical Innovations
March 2025

Quantum computing is poised to revolutionize health care and biomedical research, making a tangible impact on Americans’ everyday lives. By rapidly analyzing vast genetic and…
Perspective
Collaborating to Enhance Student Success Nationwide
January 2025

Sharing best practices, creating connections, and collaboratively tackling challenges to improve student success was the purpose of the recent Promise Neighborhoods and Full-Service Community Schools…

Clinical Research

Biostatistics and Epidemiology

Fine-Tuning LLMs to Improve Adverse Drug Event Detection and Reporting

How can we help?

Want to work with us?