Deterministic, But Not Reliable

Dec 15, 2025

In September 2025, a blog post from Thinking Machines Lab, an AI research group, began circulating widely in tech circles. The post highlighted a subtle but familiar problem in large-scale AI deployments: even when you send the exact same prompt to the exact same model, you might get two different answers. According to the authors, this inconsistency doesn’t come from anything inherent to the model itself, but from something far more mundane. To keep up with global demand, AI systems group incoming requests into batches so they can be processed in parallel. When the composition of those batches changes, the numerical operations involved in generating a completion change too. Over long outputs, those small floating-point differences can accumulate into visibly different text, even when the model is run deterministically, with fixed decoding parameters and zero temperature.

It’s a clear explanation, and one that resonates with anyone who has tried to reproduce an output from ChatGPT or Gemini only to find that it has shifted slightly since the last run. The authors also show that with carefully designed batch-invariant kernels, these inconsistencies can be eliminated almost entirely. In this setup, the system performs the same numerical operations in the same order regardless of the mix of requests arriving at that moment, so the generation process no longer depends on how inputs happen to be grouped. If the hardware follows an identical sequence of steps, the output becomes reproducible: same prompt, same answer, even under heavy load.

This explanation quickly began making the rounds online, and for understandable reasons. The appeal of a deterministic AI, one that behaves like a traditional piece of software rather than a system that might produce different answers from one run to the next, taps into a long-standing public unease. Many people are unsettled by the idea that AI systems could behave unpredictably in important situations. If infrastructure-level nondeterminism is the cause, then perhaps better engineering is the cure. Fix the batching, fix the randomness, fix the unpredictability.

But as the broader conversation picked it up, the post was sometimes read as addressing a much broader set of problems than it actually does. The nondeterminism described by the authors is real, but it is not the source of the “unpredictability” that most people encounter when they use AI systems. It explains why a system might produce two different long-form answers to the same question. It does not explain why the answer itself is sometimes wrong, illogical, or unsafe. And even if every cloud provider adopted batch-invariant kernels tomorrow, the concerns people raise about hallucinations, inconsistent refusals, or reasoning failures would remain exactly where they are.

This distinction matters because the term unpredictable has quietly become a catch-all in the public conversation about AI. When a model invents a citation or misinterprets a seemingly straightforward instruction, people describe the behavior as unpredictable. When a guardrail that appears firm one moment fails under a slightly different phrasing the next, that too is unpredictable. But these failures are not caused by floating-point jitter. They are artifacts of the model’s training objective, which rewards plausibility rather than accuracy and generalization rather than verification.

A transformer with fixed weights is, in principle, a deterministic function. If you control every source of variability—sampling parameters, kernel execution order, hardware differences—it will produce the same internal activations and the same tokens every time. The nondeterminism that Thinking Machines Lab addressed comes from the infrastructure around the model, not from the model’s reasoning itself. And because it is a property of infrastructure, it can be fixed with infrastructure. Hallucinations, on the other hand, arise from the model’s attempt to predict the most statistically likely continuation of a sequence, whether or not that continuation happens to be true. Even when the temperature is set to zero and every numerical operation is controlled, a model can still be confidently wrong.

This is why relying on determinism as a proxy for reliability is risky. A deterministic output can still be incoherent or harmful, and determinism alone does not guarantee correctness. In fact, while determinism makes failures easier to reproduce and measure, it can also make incorrect outputs appear more authoritative than they are. When a model’s mistakes vary from run to run, users are more likely to notice something is wrong. When the model makes the same mistake every time, that failure can blend into the background of expected behavior. Safety vulnerabilities can show the same pattern; a jailbreak that works intermittently is harder for an attacker to weaponize than one that works with perfect reproducibility. Determinism helps engineers diagnose issues and regulators audit deployments, but it does not address the behavioral instabilities that motivate concerns about AI in the first place.

None of this is an argument against deterministic inference. Reproducibility remains essential for scientific research, enterprise deployments, and regulatory oversight. The Thinking Machines Lab work is a real contribution; it shows that infrastructure-induced variability can be designed away, and that systems built on top of LLMs do not need to depend on the incidental behavior of the underlying hardware. But it is only one layer of the problem. The unpredictability that frustrates users and alarms policymakers lives in a different layer entirely, the model’s internal representation of language, knowledge, and reasoning, where small changes in phrasing or context can produce large changes in behavior.

As AI continues to move deeper into domains that require trust, the distinction between these layers will become more important. People do not care whether a model’s outputs line up identically when requests are grouped differently; they care whether the model gives grounded and stable answers to questions that matter. Safety researchers care less about numerical consistency than about whether a refusal boundary holds under pressure, whether a model’s outputs degrade under distribution shift or long-context load, and whether adversarial prompts can reliably subvert intended behavior. These limitations are not resolved by deterministic infrastructure, and focusing too heavily on one form of unpredictability risks obscuring the more consequential ones.

The recent attention on nondeterminism is a useful reminder that AI systems are shaped as much by the engineering beneath them as by the models themselves. But reproducibility addresses only a small and relatively straightforward slice of unpredictable behavior. A deterministic AI system is easier to test and audit, but that does not make it dependable. A lack of reproducibility is often an infrastructure problem, and achieving reproducible outputs is evidence that those infrastructure issues have been addressed. What reproducibility does not address are the behaviors that make these systems unreliable or unsafe. The unpredictability introduced by batching can be removed with better engineering; the unpredictability that comes from how the model generalizes, reasons, and fails cannot. As these systems take on more consequential roles, the challenge is not eliminating numerical variability, but building models that behave reliably in the ways people actually care about. Determinism helps engineers, but reliability helps everyone.

No Failsafe

Comments

Ready for more?