Scaling AI Agents Without Sacrificing Accuracy

By Samarth Patel, Senior Software Engineer

30th December, 2025

At Bizom, we’ve deeply integrated AI agents across our products. These agents help users by summarizing information, extracting insights, and responding with context-aware intelligence. To support this, we generate thousands of summaries every single day.

 

But as we scaled, a very real problem surfaced: how do we ensure these agents aren’t hallucinating? When an AI confidently produces an incorrect summary, the impact isn’t just a bad user experience. It can mislead users, distort business understanding, and slowly erode trust. We needed a reliable way to continuously evaluate, validate, and keep these summaries in check.

Why Hallucinations Happen (and Why They Get Worse at Scale)

LLMs are not databases. They do not “look up” information. Instead, they generate responses by predicting the most probable next token based on patterns they have seen during training. As Andrej Karpathy famously describes it, they are essentially a compressed representation of the internet, not a factual source of truth.

 

This works incredibly well when the model has clear, complete context. But real-world systems rarely operate in ideal conditions.

 

Hallucinations usually creep in when

  • the model does not have enough reliable grounding data
  • the prompt leaves ambiguity
  • the underlying input is noisy or incomplete
  • the model is forced to “fill gaps” instead of admitting it does not know

 

At a small scale this might feel acceptable or even ignorable. But once you integrate it into the production grade system, even a small hallucination rate becomes a serious reliability problem. A one percent failure rate is no longer harmless. It means dozens of wrong outputs daily, which directly impacts users and business workflows.

How Do You Measure “Correctness” ?

Once we acknowledged hallucinations as a real risk, the next question sounded deceptively simple: how do we check if a summary is correct?

 

Very quickly, we realized there is no straightforward answer.

 

Unlike deterministic systems, an LLM by nature is indeterministic and their output does not have a single “golden truth.” The same context can legitimately produce multiple valid summaries depending on phrasing, emphasis, and representation. So a rule-based validation system did not work. Manually reviewing summaries clearly did not scale. Asking another LLM to “judge” correctness introduced new uncertainty because now we were trusting one probabilistic system to evaluate another.

 

In short, correctness was subjective, dynamic, and highly contextual. That was the real challenge.

 

So instead of trying to define correctness explicitly, we reframed the problem.

 

Our agents operate inside well-defined contexts within our product. If ten summaries are generated under the same context, they may differ in wording, but semantically they should broadly say the same thing. If one summary suddenly looks very different from the rest, that is a strong signal that something is off.

 

This thought led us to treat hallucination not as a correctness problem, but as an outlier detection problem.

Solution: Detecting Hallucinations Using Vector Outliers

We start by taking a random sample of the generated summaries from audit logs. These summaries are then transformed into vector representations using embedding models, turning each summary into a point in a high-dimensional space that captures its meaning rather than its exact wording. In this space, summaries originating from the same context should naturally form tight clusters.

 

A naïve approach would be pairwise similarity across all logs (N×N), but at our scale this is computationally expensive and impractical.

Instead, we plot all vectors in space and compute the centroid — representing “expected” behavior for that context.

Detecting Hallucinations Using Vector Outliers
Vector representation of both the words

For each summary, we then measure its distance from the centroid.

 

If the distance exceeds a defined threshold, we flag it as an outlier. These are highly likely candidates for hallucination or abnormal behavior.

 

This approach is:

  • scalable (single pass instead of N² comparisons)
  • mathematically explainable
  • context-preserving 
  • effective in highlighting truly suspicious summaries

Conclusion

As AI becomes more deeply integrated into deterministic systems, there’s a growing need for observability platforms to monitor and measure the non-deterministic behavior of models. We highlighted one such approach, introducing observability to detect and control hallucinations, ensuring AI outputs remain reliable and trustworthy.

Share this:

Like this:

Like Loading...

Discover more from Algorithmic Jugaad @ Bizom

Subscribe now to keep reading and get access to the full archive.

Continue reading