PHAROS: 4 agents, 60 seconds, 1 missed drug safety signal away from disaster
Elasticsearch Agent Builder Hackathon

The FDA receives about two million adverse drug event reports every year. Pharmaceutical companies are legally required to detect safety signals within 15 calendar days of a serious report. In practice, pharmacovigilance analysts are manually reviewing documents scattered across the FDA Adverse Event Reporting System (FAERS), EudraVigilance, electronic health records (EHRs), and social media. Detection takes weeks to months, and each signal eats 40+ hours of analyst time.
The cost of being slow isn't abstract. Merck's failure to catch cardiac signals from Vioxx cost $4.85 billion in settlements. A single missed signal can trigger fines between $100 million and $1 billion. But the real cost is patients taking drugs that should have been flagged while nobody noticed fast enough.
I'm Prajwal Sutar, an independent developer who's spent the past year pushing real data through large language model (LLM)-based pipelines ingestion, async orchestration, and multi-agent coordination. I couldn't find a single existing tool that ties together signal detection, report generation, and escalation in one automated pipeline. So, I built one during the Elasticsearch Agent Builder Hackathon.
What PHAROS does
PHAROS (Pharmacovigilance Autonomous Reasoning and Oversight System) pulls adverse event reports from the FDA FAERS API, runs WHO-standard statistical analysis to find safety signals, generates the actual regulatory paperwork (e.g., MedWatch 3500A forms, PSUR sections, and case narratives), and pushes alerts to Slack, Jira, and email.
From data ingestion to dispatched alert, it takes under 60 seconds.
Here's what that looks like end to end. 50 adverse event reports for a fictional drug called CARDIVEX come in that all report sudden vision loss clustered in Japan, Korea, and India. They get indexed. Within a minute, the system has detected a proportional reporting ratio (PRR) of 18.94 for CARDIVEX/vision loss, identified the JP/KR/IN geographic cluster, generated a MedWatch 3500A form and PSUR section, fired a Slack alert to #safety-critical, created a Jira P1 ticket, and emailed the safety officer. Every action logged to pharos-audit-log — because in pharma, if you didn't log it, it didn't happen.
Four agents handle this, each with a distinct job.
Why four agents, not just one
I split the system because the jobs are different enough that a single agent would be mediocre at all of them. Monitoring for volume spikes is not the same skill as computing statistical ratios, which is not the same as writing regulatory documents, which is not the same as deciding who to page at 2 a.m. Each agent gets a system prompt tuned to its specific task and temperature settings that match: ANALYST runs at 0.0 because you don't want creative PRR numbers. SCRIBE runs at 0.2 for controlled text generation. SENTINEL at 0.1.
The sentinel
SENTINEL watches the pharos-adverse-events index for volume spikes. It uses ES|QL to compare the last 7 days of report volume against a 90-day baseline. If a drug shows a 3x jump, SENTINEL fires an Elastic workflow that kicks off ANALYST. In the CARDIVEX run, it caught a 15x spike.
The analyst
ANALYST is where the real detection happens. It runs the WHO PRR calculation entirely in ES|QL — STATS for counts, EVAL for the ratio math, and WHERE for thresholds — across drug-reaction pairs. Then, it runs temporal analysis with BUCKET(report_date, 1 week) to catch weekly clustering, geographic aggregation on geo.country_code, and a hybrid BM25 + dense vector search to find similar historical signals. Severity classification is tiered: PRR ≥ 5.0 with 5+ cases is CRITICAL, PRR ≥ 2.0 with 3+ cases is HIGH, and anything above 1.5 goes to MONITORING. Confirmed signals get written to pharos-signals.
The scribe
SCRIBE picks up confirmed signals and generates three document types: MedWatch 3500A, PSUR Section VI, and a case narrative. It pulls up to 100 supporting case reports from the adverse events index and produces the documents and indexes them into pharos-regulatory-reports.
The herald
HERALD is the action layer. CRITICAL signals get a Slack alert (Block Kit formatting), a Jira P1 ticket, and emails to the safety officer and VP of Safety. HIGH signals get Slack, Jira P2, and email to the safety officer. MONITORING signals accumulate into a weekly digest. A 2-hour escalation timeout re-alerts the VP of Safety if a CRITICAL signal goes unacknowledged.
The handoffs between agents all run through Elastic workflows — nine workflows total covering agent-to-agent coordination, nightly FAERS ingestion on a cron schedule, Slack/Jira/email dispatch, audit logging, and the escalation timeout.
Keeping the statistics inside Elasticsearch
I made a deliberate choice to keep PRR computation inside ES|QL rather than pulling data into Python. Going in, I assumed I'd need pandas for the statistical work. I was wrong.
The full WHO PRR formula, counting, ratio math, thresholds, temporal bucketing all runs as ES|QL queries. The agents call ES|QL tools, reason over the results, and write back — no pandas, no external compute, and no data transfer bottleneck. The stats scale with the cluster.
ES|QL is less flexible than pandas for arbitrary analysis. But for the WHO formula and weekly BUCKET aggregations, it handles the work cleanly. Cutting out that intermediate Python layer simplified the architecture more than I expected — the agents just query and reason, and there's one fewer place for things to break.
The index design that makes it work
PHAROS runs on four Elasticsearch Serverless indices, and the main one, pharos-adverse-events, is where I spent the most design time.
It has a custom clinical_text_analyzer with snowball stemming for narrative search, a drug_name_analyzer on keyword tokenizer for exact drug matching, dense_vector fields (1,536 dimensions) for narrative embeddings, geo_point for geographic clustering, and nested mappings for reactions. Every query the agents need, fuzzy narrative search, exact drug lookup, geographic aggregation, semantic similarity is supported by the index design. The other three indices are more straightforward: pharos-signals stores detected signals with PRR scores and the analyst's reasoning chain, pharos-regulatory-reports holds generated documents, and pharos-audit-log timestamps every agent action.
The unglamorous problem that almost broke the pipeline
Getting LLMs to return structured JSON reliably was the fight I didn't anticipate.
You ask an LLM for JSON, you get JSON wrapped in three paragraphs of explanation, or JSON inside markdown code fences, or a conversational preamble followed by JSON followed by a helpful summary. The agents hand structured data to each other, so every response needs to parse cleanly. It doesn't matter how good your signal detection is if the ANALYST's output can't be reliably read by SCRIBE.
I spent a lot of time tuning system prompts and ended up writing a JSON extraction function that handles raw JSON, markdown code fences, and JSON buried inside natural language. It's not interesting work, but it's the kind of thing that determines whether a multi-agent pipeline actually runs or just demos well.
What I'd fix first
The PRR calculation is currently a point estimate. A production pharmacovigilance system needs chi-squared confidence bounds and Bayesian IC scoring. The data model already has an ic_score field wired up — it's using an approximation instead of the proper Bayesian calculation. That's the first thing I'd change with more time.
The system also treats "blurred vision" and "vision loss" as separate events. The immediate next step is MedDRA ontology-aware reaction grouping so that the system can catch signals across related terms instead of treating each string as independent. After that, I would pull in EudraVigilance data alongside FAERS for cross-continental correlation.
The broader point
2 million adverse event reports land on someone's desk every year, and the current answer is more analysts running more manual reviews. PHAROS is an argument that the answer is agents that run the WHO statistics, generate the paperwork, and escalate to the right person — all before the analyst has opened their laptop.
PHAROS is open source under MIT. If you work in pharmacovigilance or regulatory affairs and want to run this against real data, I'd like to hear from you.
The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.
In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.
Elastic, Elasticsearch, and associated marks are trademarks, logos or registered trademarks of elasticsearch B.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.