Why non-deterministic agents belong in creative work

There's a quiet consensus forming inside enterprise AI teams that goes something like this: language models are unreliable, the way to make them reliable is to lock down the parameters, and the path to enterprise-grade output runs through temperature 0, low top-p, and a forty-page prompt. This is correct exactly as far as it goes. For deterministic tasks, code generation, structured data extraction, regulated text, you want the model to pick the highest-probability token every time and never surprise you. The bug is treating this as the universal stance.

For creative work, it's the wrong stance. And "creative work" inside an enterprise is broader than people realize. It includes campaign concepting, product naming, customer-facing copy variations, design exploration, hypothesis generation in research, even the framing of strategy memos. Anywhere the right answer isn't a known answer but a discovered one, the property you want from the agent is exactly the property the deterministic posture engineers out: the willingness to suggest something the team wouldn't have suggested on its own.

What temperature actually controls

A language model doesn't pick "the next word." It produces a probability distribution over thousands of possible next tokens, and a sampling method picks one. Temperature is the parameter that reshapes that distribution before sampling. At temperature 0, the model picks whichever token has the highest probability, every time, deterministically.¹ At temperature 1, it samples from the distribution as the model produced it. Above 1, it flattens the distribution, giving lower-probability tokens a better shot. Above about 1.5, the distribution is flat enough that the output starts to lose coherence. The interesting range for creative work is roughly 0.9 to 1.3.²

The mechanical effect: at low temperature the model produces text that looks like the average of similar training examples. At higher temperature, it produces text that combines training examples in less typical ways. For factual or structural tasks the average is what you want. For creative tasks the less typical combinations are exactly the point.

0 – 0.3 Deterministic. Code, extraction, factual answers, regulated content.

0.7 – 1.0 Default. Most assistant work, conversational, summary tasks.

1.0 – 1.3 Divergent. Brainstorming, naming, copy variation, conceptual exploration.

The brainstorm analogy

Think about the actual cognitive job of a brainstorm. You're trying to surface options the group wouldn't otherwise consider, to get past the obvious and the safe, to make weird connections that turn out to be useful. The worst person to invite to a brainstorm is the most senior expert in the room, because their high prior probability on the "correct" answer crowds out the lower-probability candidates that might actually be more interesting. Good facilitators know this. They explicitly invite outsiders. They run rounds where bad ideas are mandatory. They use prompts like "what would the worst possible solution look like" to flatten the team's distribution over what's allowed.

That last move, deliberately flattening the distribution, is exactly what raising the temperature does to a language model. It's not a bug. It's the thing you wanted.

The same model can do both. But only if you know which mode you're asking for.

The teams getting this wrong are the ones running their brainstorm prompts at temperature 0.2 because that's what worked for their data extraction agent, and then concluding that the model "can't do creative work." It can. They're asking for the boring answer.

What the research actually shows

The literature on AI in creative work has a wrinkle that's worth knowing about. AI accelerates the divergent phase, when you want many ideas, but it can hurt convergence, when you want to pick a good one. A 2025 study comparing AI-assisted to human-only design teams found that AI-assisted teams generated more ideas faster, but also encouraged what the researchers called "premature convergence," narrowing exploration early and producing less functionally refined solutions.³ Human-only teams iterated more and produced higher-quality final designs.

The lesson isn't "don't use AI for creative work." It's "use AI for the part it's good at, and don't let it run the part it's bad at." For divergence, fluency, and breadth: high temperature, AI in the loop, generate aggressively. For convergence, evaluation, and craft: low temperature or no AI at all, and let humans pick. The teams winning at this have separated those two modes deliberately. The teams losing at it are using a single agent at a single temperature for both.

A working pattern: the wake/dream/judge architecture

A research framework called ReMIND, published earlier this year, formalizes this separation in a way that's actually useful for production teams.⁴ The architecture breaks creative ideation into three roles, each implemented as a separate agent or a separate model call:

Wake

Low-temperature baseline. Produces the safe, expected, on-brand answer. This is the "if we did this in the obvious way, here's what we'd ship" output. It anchors the rest of the process and gives the team a reference point.

Dream

High-temperature divergence. Produces unconventional combinations that explicitly depart from the wake baseline. Cranked-up temperature, possibly with techniques like prompt-level constraint relaxation. Most of the dream output will be unusable. The point isn't usability, it's discovery.

Judge

Low-temperature evaluation. Reads both the wake baseline and the dream variations and identifies which dream outputs are interestingly different from baseline in ways that meet the brief. The judge isn't picking the winner, it's filtering the long tail down to a shortlist for a human to pick from.

The architecture matters because it stops a single agent from trying to do incompatible jobs. A single agent at temperature 0.7 produces work that's neither safely on-brand nor usefully weird. It's mush. Three agents at three different settings produce a baseline, a set of departures, and a defensible shortlist.

Where this changes how you build

If you're deploying agents into a workflow that includes creative judgment, the implementation note here is that you almost certainly want more than one agent, with more than one configuration, orchestrated. Concretely:

Naming and copy variation: 5 to 10 candidates at temperature 1.1, deduplicated and ranked by a low-temperature judge, surfaced to a human as a shortlist of 3 with rationale for each.
Campaign concepting: a wake agent producing the on-brand baseline, a dream agent producing departures along axes the brief specifies (more emotional, more technical, more contrarian), a judge agent doing the filtering.
Research hypothesis generation: high temperature for the hypothesis, low temperature for the literature search and the falsifiability check.
Strategy memo framing: wake for the standard frame, dream for three contrarian frames, human picks which to develop.

None of this is exotic. It's just deliberate use of the parameter that's already there.

The governance footnote

If you read the governance piece, you'll have noticed a tension. Higher-temperature agents are by definition less predictable, which makes them harder to govern. The resolution is straightforward: high-temperature agents should not have authority to act. They generate options. A human, or a low-temperature judge that a human reviews, picks. The governance posture becomes "the creative agent has zero action authority and produces only candidate outputs that a separate review step turns into actions." That's a defensible posture for any auditor. It's also the architecture that produces the best creative output anyway, so this is one of the rare cases where the safe answer and the right answer are the same.

What to take from this

If your team is shipping agents and the default temperature is 0 or 0.2 across the board, you're optimizing for one half of the work. The other half, the half that involves discovery and unexpected connections, is being left on the table. The fix isn't to raise the temperature on your customer-service agent. It's to recognize that "deploying agents" is plural, and the right configuration depends on whether you're asking the agent to be reliable or to be interesting.

Pick which one each agent is for. Set the parameters accordingly. And separate the divergent and convergent phases architecturally, so a single agent isn't trying to do both jobs and failing at both.

Sources

Temperature parameter mechanics and probability distribution shaping. WaterCrawl, October 2025.
Practical temperature ranges for creative versus deterministic work. Field Guide to AI, February 2026.
AI-assisted vs human-only creative teams in design tasks. PMC, AI-assisted design synthesis study.
ReMIND framework for separating exploration, evaluation, and consolidation across multiple LLM instances. Arxiv preprint, 2026.
Top-p, top-k, and temperature interactions for output control. Codefinity, generative model sampling guide.

Why non-deterministic agents belong in creative work.

What temperature actually controls

The brainstorm analogy

What the research actually shows

A working pattern: the wake/dream/judge architecture

Where this changes how you build

The governance footnote

What to take from this

Sources

Have a workflow worth automating?