Elasticsearch has native integrations with the industry-leading Gen AI tools and providers. Check out our webinars on going Beyond RAG Basics, or building prod-ready apps with the Elastic vector database.
To build the best search solutions for your use case, start a free cloud trial or try Elastic on your local machine now.
In part 1 and part 2 of this series, we built a complete entity resolution pipeline that included preparing entities with context and indexing them for semantic search, extracting entities from articles using hybrid named entity recognition (NER), and matching entities using semantic search and large language model (LLM) judgment. The results were promising, but JSON parsing errors significantly lowered measured accuracy by causing otherwise valid judgments to be discarded. The system wasn’t failing because it made bad judgments; it was failing because it couldn’t reliably express them.
The root of this problem was our somewhat naive choice to use prompt-based JSON generation in which the LLM generates JSON responses in text format. If we asked the LLM to judge more than a couple of matches at a time, the generated JSON was often ill-formed. To mitigate this, we were forced to reduce the processing batch size, which simply won't scale in a production system.
So the prompt-based JSON generation helped validate our approach to entity resolution, but we need a more systematic and reliable method. OpenAI function calling provides a better path by guaranteeing structure and type safety while reducing errors and costs. We chose OpenAI's functions for the educational prototype, but other LLM providers typically provide similar functionality (for example, Claude tools).
Note: While we discuss production challenges here, this is still an educational prototype demonstrating optimization techniques. Real production systems would need additional considerations, like monitoring, alerting, fallback strategies, and comprehensive error handling.
Key concepts: Function calling, schema design, and cost benefits
What is function calling? Function calling is OpenAI's structured output API. With it, we can define schemas for LLM responses, so we always know exactly what we're going to get. By enforcing the JSON format rather than trying to define it in the LLM prompt, we should be able to eliminate parsing errors.
Why is it better than prompt-based JSON? LLMs generate nondeterministic output. One hopes that they'll at least generate content that contains the correct response, but the presentation of that response is unpredictable. With a chatbot, this is often not a problem, but our prototype is trying to programmatically process the output. Computer programs demand consistency, so when the LLM generates what we expect, everything is fine, but as soon as it goes off script, so to speak, the code errors out. We could try to account for the different possibilities, but it would be very difficult to catch everything. We could try to enforce more consistent behavior by adding something like "Always return parsable JSON". We tried this exact technique in the prototype's prompt, but we've seen that prompt-based JSON still goes off the rails pretty quickly, particularly if we try to process a batch of matches.
Function calling makes the LLM generation controllable and predictable, exactly what we need for entity resolution. To aid in the definition of the functions, we’ll also follow minimal schema design principles.
What are minimal schema design principles? Minimal schema design means defining only the fields you need, using simple types, and avoiding nested structures when possible. This reduces token usage (smaller schemas mean fewer tokens), improves reliability (simpler schemas are easier for the LLM to follow), and lowers costs (fewer tokens mean lower API costs).
What are the cost and reliability benefits? Since fewer errors means match processing is much more likely to succeed, even with large batch sizes, we don't have to retry judging matches. The elimination of retries reduces costs by reducing token usage, but using minimal schemas also keeps our token count down. This all leads to a less expensive and more reliable approach that’s much more suitable to use in production.
We need to check one more thing, though. While matches may be getting processed without error, are the errorless results actually correct? How does this new approach compare to the promising results we saw with the prompt-based approach?
Real-world results: Side-by-side comparison
As we did in the previous blog, we ran the function calling approach against the tier 4 dataset, which consists of 206 expected matches across 69 articles. The results demonstrate a dramatic improvement:
| Metric | Prompt-based | Function calling | Improvement |
|---|---|---|---|
| Error rate | 30.2% | 0.0% | 100% elimination |
| Precision | 83.8% | 90.3% | +6.5pp |
| Recall | 62.6% | 90.8% | +28.2pp |
| F1 score | 71.7% | 90.6% | +18.9pp |
| Acceptance rate | 44.8% | 60.2% | +15.4pp |
| True positives | 129 | 187 | +45.0% |
| False negatives | 77 | 19 | -75.3% |
Error elimination: The key differentiator
The most striking difference is the complete elimination of JSON parsing errors. This resulted in a modest precision improvement and a far more dramatic recall improvement. The precision metric captures how often the matches the system accepts were expected in the golden document. So the prototype was decent at judging matches correctly in the prompt-based approach, but function calling does that even better.
Conversely, recall tells us how many of the expected matches were found. When a batch of matches comes back with malformed JSON, the system loses all of those matches. It's likely that Elasticsearch sends many of these matches for judgment, but we lose those matches if judgment fails. The significant recall improvement shows that this hypothesis is correct. Elasticsearch identifies the potential matches and function calling verifies which of those matches are correct.
Note: It’s expected that Elasticsearch will find some incorrect matches because we look at the top two or three results from hybrid search. Most of the time, hybrid search returns the correct match as the top result, but having the LLM judge the top few hits ensures that we see how the LLM handles incorrect matches. If we move from the educational prototype to a production system, we’ll likely tune the Elasticsearch queries more carefully so that we only send promising matches to the LLM, further optimizing our LLM costs.
What's next: The ultimate challenge
Now that we've optimized our LLM integration with function calling, we have a complete entity resolution pipeline with improved reliability and cost efficiency. However, can it handle the ultimate challenge? In the next post, we'll explore how the system handles diverse entity resolution scenarios across 50 different challenge types, including cultural naming conventions, business relationships, titles, and multilingual variations.
Try it yourself
Want to see function calling optimization in action? Check out the Function Calling Optimization notebook for a complete walkthrough with real implementations, detailed explanations, and hands-on examples. The notebook shows you exactly how to use function calling for structured output, compare it with prompt-based JSON, and analyze cost and reliability benefits.
Remember: This is an educational prototype designed to teach optimization concepts. When building production systems, consider additional factors, like multi-provider support, advanced caching strategies, monitoring and alerting, comprehensive error handling, and compliance requirements that aren't covered in this learning-focused prototype.




