Catching invisible errors: How I built a duplicate detection agent for Kenya's HIV program
Elasticsearch Agent Builder Hackathon
.png)
In many Kenyan county health departments, monitoring and evaluation (M&E) officers can spend entire days running Excel pivot tables, eyeballing patient names, and cross-referencing sample IDs and still only catch around 44% of duplicates. The other 56% stay in the system quietly, inflating the US President’s Emergency Plan for AIDS Relief (PEPFAR) dashboards, wasting reagents and eroding trust in data that clinicians rely on for treatment decisions.
I know this because I see it at work. I'm a solutions architect at a HealthTech firm in Nairobi. We build and maintain health information systems deployed across all 47 counties in Kenya. Duplicate patient records are the kind of problem nobody puts on a slide deck, but everyone feels on the ground.
When the Elasticsearch Agent Builder Hackathon was announced, I didn't have to go looking for a problem. The problem had been sitting on my desk for months.
How duplicates happen (and why they're hard to catch)
Kenya's HIV testing infrastructure runs two critical lab tests: Early Infant Diagnosis (EID) for HIV-exposed infants and Viral Load (VL) monitoring for adults on antiretroviral therapy. The tests are ordered in KenyaEMR, processed at labs, and results flow back through Kenya's health information exchange.
The duplication scenarios are unglamorous and expensive — a mother brings her infant to facility A, then tests again at facility B under a slightly different name; an adult on ART crosses county lines and gets reregistered; someone deliberately uses inconsistent demographics to access services at multiple sites.
Each scenario creates a ghost record. Multiply that across 500+ facilities and the numbers get real: roughly $195,000 per year in duplicated tests, wasted reagents, and inflated reporting. Manual detection takes about two hours per case. At that rate, the backlog only grows.
I wanted something that could scan 1,000 records in seconds and explain its reasoning in language a healthcare worker could understand and act on.
The system: 3 agents each with a specific job
I built a multi-agent system on Elasticsearch 8.11 (Serverless) using Elastic Agent Builder with Claude Sonnet 3.7 as the reasoning model. Rather than one monolithic agent trying to do everything, I split the work into three agents — the detection agent, the risk assessor, and the action recommender. Each one has a narrow scope, specific inputs, and a defined output format.
The detection agent
The detection agent runs ES|QL queries against the patient index, looking for duplicates through three lenses: cross-facility pattern matching (same patient appearing at multiple facilities), demographic analysis (e.g., name variations, inconsistent sex identifiers, and partial ID matches), and temporal anomaly detection (same-day testing at distant facilities). This is the search layer. It surfaces candidates; it doesn't make judgments.
The risk assessor
The risk assessor takes those candidates and scores them 0–100 using weighted signals:
- Cross-facility visits: Up to 40 points
- Demographic inconsistencies: Up to 30 points
- Geographic impossibilities: Up to 20 points
- Timing anomalies: Up to 10 points
Cases land in one of four tiers: CRITICAL, HIGH, MEDIUM, or LOW. I'll explain why I didn't use binary classification in a moment.
The action recommender
The action recommender translates scores into specific next steps calibrated to the Kenyan healthcare context: immediate M&E officer review for CRITICAL cases, flagging for next facility visit for MEDIUM, and staff training recommendations for facilities showing systemic patterns. This agent exists because a risk score alone isn't useful to a health worker. They need to know what to do with it.
Why I used multifactor scoring instead of binary classification
Early in the build, I tried a simpler approach: duplicate or nonduplicate. It didn't survive contact with real data.
The problem is that legitimate follow-up testing looks a lot like duplication. A patient on ART is supposed to show up at the same facility every few months. An infant should be tested repeatedly. Binary classification either flags too many legitimate visits (and health workers learn to ignore all flags) or misses the subtle cases where someone tests at two different facilities on the same day under slightly different names.
The tiered approach lets healthcare workers prioritize. A CRITICAL case with a risk score of 87 (same-day testing, different facilities, and inconsistent sex identifiers) gets immediate attention. A LOW case with a score of 22 (same facility and expected follow-up interval) gets filed. The M&E officer makes the final call, but they're working from evidence instead of gut.
Calibrating the weights took many iterations against real data. I'm still not fully confident they're optimal. But the structure is right, and the weights can be tuned as we collect more field data.
The Elasticsearch work that made it possible
I spent more upfront time on index design than on any other part of the system, and it was the best investment I made.
The index mappings include derived fields computed at index time: cross_facility_flag, total_tests, and facility_count per patient. Key demographic fields have both keyword (exact match) and text (analyzed, fuzzy search) subfields, so the detection agent can switch between strict and fuzzy matching depending on the signal it's chasing — strict matching for sample IDs and fuzzy matching for patient names where "Wanjiku" and "Wanjiku Mary" might be the same person.
I also leaned hard on Elasticsearch aggregations for candidate prefiltering. The system buckets records by facility, test type, and date range before running pairwise comparisons. This is what keeps detection tractable on larger datasets. You don't need to compare every record against every other record if you can narrow the candidate space first.
ES|QL was new to me. I learned it during the hackathon, and it's impressive for real-time analytics at scale. The architecture that worked best for me was pairing ES|QL for pattern detection and aggregations with Python handling the application logic. Considering I was new to it, this separation made my whole system easier to reason about.
What the agents actually found
I tested the system on 1,010 real anonymized patient records from 59 Kenyan healthcare facilities. The scan completed in under 10 seconds.
It identified 131 duplicate patients, including five cases of same-day multifacility testing and four patients with intentionally inconsistent sex identifiers across facilities.
The same-day cases are the ones that surprised me. Manual review would eventually catch name duplicates if someone was patient enough. But spotting that a patient tested at two facilities on the same day geographically distant and under slightly different demographics is the kind of pattern that lives in the data invisibly until you specifically look for it. M&E officers told me those cases would have taken weeks to surface manually if they surfaced at all.
The lesson I didn't expect: Explainability is the product
Early prototypes returned a risk score and a recommendation. I showed them to M&E officers, but they didn't trust the output.
This wasn't a technical failure; the scores were accurate. But a healthcare worker looking at a flagged patient needs to understand why it was flagged before they'll act on it. Is it the name mismatch? The geographic impossibility? The timing? Without that context, the system is a black box, and black boxes get ignored in clinical settings where the stakes are a patient’s treatment.
Building the action recommender to produce specific, evidence-cited explanations was what turned the prototype into something people would actually use. The M&E officer I demoed it to in Nairobi said, "This would have saved me three days last month."
That lesson isn't specific to healthcare. If your AI system's recommendations require a human to act, the explanation is as much the product as the recommendation.
Getting the agent instructions right
Each agent was built in Elastic Agent Builder with custom instructions defining its domain expertise, reasoning steps, and output format. I underestimated how much the quality of those instructions would matter.
Early versions with vague instructions produced inconsistent outputs. The detection agent would sometimes explain its reasoning and sometimes not. The risk assessor would occasionally skip a scoring factor. Getting reliable, evidence-based outputs required being specific about required evidence fields and explicit about the reasoning chain the agent should follow. Treat custom instructions like code: Be precise, test edge cases, and iterate.
What's next?
This isn't a hackathon demo that gets archived. The plan is piloting in five Nairobi County facilities over the next 2–3 months, training M&E officers and collecting real-world performance data to refine the risk weights.
After that, the roadmap includes biometric matching integration and Swahili phonetic name fuzzy matching, which is a real gap in current approaches ("Wanjiku" versus "Wanjiku" is easy. "Njeri" versus "Njery" requires phonetic awareness that standard fuzzy matching doesn't handle well). Eventually, I want the system running in real-time during patient registration in HMIS, catching duplicates before they enter the system rather than after.
Longer term, I want to connect to Kenya's Health Information Exchange and scale it to all 47 counties. Elasticsearch's horizontal scaling and the modular agent design mean the core system doesn't need a rebuild; it just needs extensions. The projected impact at national scale: $195,000 in annual savings and a 70% reduction in duplicate testing. More importantly, clinicians can trust the records they're looking at when making treatment decisions.
The takeaway
If you work in a domain where data quality is a quiet, expensive, human-labor problem, Elastic Agent Builder lets you build something that explains the problem rather than just querying it with tools like ES|QL for pattern detection, multiagent orchestration for layered analysis, and custom instructions for domain-specific reasoning. It came together faster than I expected.
The most satisfying part of this build wasn't the placement. It was watching someone who does this work every day recognize in about ten seconds that the tool understood their problem.
The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.
In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.
Elastic, Elasticsearch, and associated marks are trademarks, logos or registered trademarks of elasticsearch B.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.