By Samarth Patel, Senior Software Engineer
30th December, 2025
At Bizom, we’ve deeply integrated AI agents across our products. These agents help users by summarizing information, extracting insights, and responding with context-aware intelligence. To support this, we generate thousands of summaries every single day.
But as we scaled, a very real problem surfaced: how do we ensure these agents aren’t hallucinating? When an AI confidently produces an incorrect summary, the impact isn’t just a bad user experience. It can mislead users, distort business understanding, and slowly erode trust. We needed a reliable way to continuously evaluate, validate, and keep these summaries in check.
LLMs are not databases. They do not “look up” information. Instead, they generate responses by predicting the most probable next token based on patterns they have seen during training. As Andrej Karpathy famously describes it, they are essentially a compressed representation of the internet, not a factual source of truth.
This works incredibly well when the model has clear, complete context. But real-world systems rarely operate in ideal conditions.
Hallucinations usually creep in when
At a small scale this might feel acceptable or even ignorable. But once you integrate it into the production grade system, even a small hallucination rate becomes a serious reliability problem. A one percent failure rate is no longer harmless. It means dozens of wrong outputs daily, which directly impacts users and business workflows.
Once we acknowledged hallucinations as a real risk, the next question sounded deceptively simple: how do we check if a summary is correct?
Very quickly, we realized there is no straightforward answer.
Unlike deterministic systems, an LLM by nature is indeterministic and their output does not have a single “golden truth.” The same context can legitimately produce multiple valid summaries depending on phrasing, emphasis, and representation. So a rule-based validation system did not work. Manually reviewing summaries clearly did not scale. Asking another LLM to “judge” correctness introduced new uncertainty because now we were trusting one probabilistic system to evaluate another.
In short, correctness was subjective, dynamic, and highly contextual. That was the real challenge.
So instead of trying to define correctness explicitly, we reframed the problem.
Our agents operate inside well-defined contexts within our product. If ten summaries are generated under the same context, they may differ in wording, but semantically they should broadly say the same thing. If one summary suddenly looks very different from the rest, that is a strong signal that something is off.
This thought led us to treat hallucination not as a correctness problem, but as an outlier detection problem.
We start by taking a random sample of the generated summaries from audit logs. These summaries are then transformed into vector representations using embedding models, turning each summary into a point in a high-dimensional space that captures its meaning rather than its exact wording. In this space, summaries originating from the same context should naturally form tight clusters.
A naïve approach would be pairwise similarity across all logs (N×N), but at our scale this is computationally expensive and impractical.
Instead, we plot all vectors in space and compute the centroid — representing “expected” behavior for that context.
For each summary, we then measure its distance from the centroid.
If the distance exceeds a defined threshold, we flag it as an outlier. These are highly likely candidates for hallucination or abnormal behavior.
This approach is:
As AI becomes more deeply integrated into deterministic systems, there’s a growing need for observability platforms to monitor and measure the non-deterministic behavior of models. We highlighted one such approach, introducing observability to detect and control hallucinations, ensuring AI outputs remain reliable and trustworthy.
Subscribe now to keep reading and get access to the full archive.