LLM Integration in Production: What Nobody Tells You

Every developer has shipped an AI demo that impressed the room. The model responded brilliantly to the prepared questions, the latency was acceptable, and the outputs looked exactly right. Then real users showed up.

They asked questions nobody anticipated. They pasted in inputs ten times longer than expected. They triggered the model in ways that produced confident nonsense. The latency that was "fine" in testing became unacceptable at 3am when the infrastructure was under load.

This is the gap between LLM demos and LLM products. Here's what fills it.

Prompts Are Code. Treat Them That Way.

The biggest operational mistake teams make is treating prompts as strings. They live in a config file, maybe a database — somewhere that isn't version-controlled, isn't tested, and isn't reviewed.

When a prompt changes and quality degrades, nobody knows what changed or when. When a new model version behaves differently, there's no baseline to compare against.

Fix this by:

Storing prompts in version control, reviewed like code
Writing evaluation tests that run on every prompt change
Tagging prompt versions so you can correlate model behaviour with specific prompt revisions
Never deploying a prompt change without running evals first

This sounds bureaucratic until you've spent two days debugging a regression that turned out to be a single word change in a system prompt.

The Context Window Is Not Free

Every token you send to the model costs money and adds latency. Most implementations are profligate with context — they shove in everything that might be relevant and hope for the best.

A disciplined approach to context management:

Summarise conversation history. Don't append every prior message. Maintain a rolling summary and only include recent exchanges verbatim.
Retrieve, don't stuff. If you're providing documents, use retrieval (RAG) to select the most relevant passages. Don't send the whole document.
Trim the system prompt. System prompts bloat over time as teams keep adding instructions. Audit yours regularly. Every sentence should earn its place.
Set hard limits. Implement maximum input lengths with graceful degradation, not errors.

Latency Is a Product Decision

The p99 latency of a streaming LLM call can be 10–30x the median. Your 500ms median response becomes a 10-second response for a meaningful percentage of users. If your product can't tolerate that, you need an architecture that doesn't depend on synchronous model calls.

Patterns for latency-sensitive applications:

Streaming responses. Start rendering output as it arrives. Users perceive streamed responses as faster even when total time is identical.
Speculative caching. Pre-generate likely responses for common queries. Cache at the semantic level, not the exact string level.
Async with callbacks. For non-interactive workloads, process asynchronously and notify when complete.
Model tiering. Use fast, cheap models for simple classification and routing. Reserve heavy models for reasoning.

Failure Modes You Need to Handle

Hallucination Under Pressure

Models hallucinate more when context is noisy, queries are ambiguous, or the model is asked about things outside its knowledge. In production:

For factual applications, always ground responses in retrieved documents and instruct the model to decline if the answer isn't in the context
Implement output validation — if the response doesn't match expected structure or contains signals of uncertainty, flag it for review or fall back gracefully
Never let the model generate critical information (prices, legal terms, medical advice) without a validation step

Prompt Injection

If your application takes user input and incorporates it into prompts, you're vulnerable to prompt injection — users crafting inputs designed to override your instructions.

Mitigations:

Separate user content from system instructions clearly, using model-specific delimiters
Validate and sanitise user inputs before they reach the model
Use models with built-in injection resistance where available
Monitor outputs for signs of instruction override

Vendor Outages

Every major LLM provider has had multi-hour outages. If your product is entirely dependent on a single provider, your product is down when they're down.

Implement:

Fallback providers configured and tested (not just configured)
Circuit breakers that detect degraded performance and switch automatically
Graceful degradation — what does your product do when AI is unavailable?

Evaluation: The Part Everyone Skips

Evaluation is the discipline that separates teams that improve reliably from teams that guess. A minimal production eval suite includes:

Regression tests. A curated set of inputs with known-good outputs. Run automatically on every change. Fail the build if quality drops.
Adversarial tests. Inputs designed to break the model — edge cases, unusual formats, malicious inputs. Your model should handle them gracefully.
Human evaluation cadence. A regular process (weekly, fortnightly) where a human reviews a sample of real production outputs and rates quality. This is where you catch slow quality drift.

Most teams skip evaluation because it feels slow. It's actually the fastest path to a reliable product — the time you invest upfront is far less than the time you'll spend debugging regressions.

The Reliability Checklist

Before you call an LLM integration production-ready:

[ ] Prompts are version-controlled and tested
[ ] Context inputs are bounded with hard limits
[ ] Response streaming is implemented for interactive use cases
[ ] At least one fallback provider is configured and tested
[ ] Output validation is in place for safety-critical paths
[ ] A regression eval suite exists and runs in CI
[ ] Latency and cost are monitored with alerting
[ ] The product degrades gracefully when the model is unavailable

None of these are complex. Collectively, they're the difference between a prototype and a product.

Framz helps engineering teams build reliable, production-grade AI systems. Start a conversation if you're navigating these decisions.