As an engineer who's built several LLM applications from prototype to production, I've noticed something interesting: everyone talks about model selection and prompt engineering, but hardly anyone discusses the true complexity of LLM-powered systems.
After months of building, breaking, and rebuilding these systems, I've come to a realization:
The hardest part of working with LLMs isn't the model or the prompt—it's integrating unpredictable components into systems that expect predictability.
Welcome to Probability Land
Traditional software development gives us the comfort of determinism. Functions return consistent outputs for the same inputs. Edge cases can be mapped and handled. Testing is straightforward.
LLMs shatter this paradigm completely.
The first time I saw this in action was during a customer support automation project. We had a carefully engineered prompt that worked beautifully in testing. Then we deployed to production, and responses slowly began to drift. Same inputs produced increasingly different outputs. The system that passed all our tests was now telling customers to "contact support"... while acting as support.
Why? Because we were treating a statistical system as if it were deterministic.
The Hidden Complexities
Based on my experience building and shipping LLM applications, here are the challenges that no one adequately prepares you for:
1. Systems Design Meets Chaos Theory
Each LLM call introduces variability. Chain multiple LLMs together (like in a typical agent architecture), and uncertainties compound. Your system doesn't just have edge cases—it has edge dimensions.
An e-commerce chatbot I built would occasionally go from "Here are some product recommendations" to complex philosophical musings about consumerism. Not because the prompt was flawed, but because probability distributions occasionally produce outlier responses.
2. Testing What Cannot Be Tested
How do you unit test a component with built-in randomness?
In a financial analysis tool, our test suite would pass when the LLM produced substantially wrong answers that "looked right." We eventually created an evaluation framework with:
Reference-based testing (comparing to human-written exemplars)
Constraint validation (checking if outputs satisfied business rules)
Statistical confidence measurements across multiple runs
Supervised spot-checking of edge cases
None of these approaches fully solved the problem.
3. The Invisible Infrastructure
Building production LLM applications requires an entire ecosystem that's rarely discussed:
Caching layers to reduce costs and latency
Fallback mechanisms when models fail or timeout
Observability systems to track performance drift
Prompt versioning to manage changes across environments
Evaluation pipelines for continuous quality monitoring
For a document processing application, this "invisible infrastructure" was 3x larger than the actual application code—yet it's almost never mentioned in LLM tutorials.
4. The Psychological Barrier
Users approach AI with fundamentally different expectations than other software.
When Google Maps gives bad directions, users blame the software. When an LLM gives bad information, users often feel personally misled. The psychological contract is different.
We built a financial records assistant that was technically correct 95% of the time—better than our previous rule-based system. Yet user satisfaction dropped because the 5% of errors felt like betrayals rather than bugs.
Embracing Probabilistic Design
After much trial and error, I've found that successful LLM applications embrace their probabilistic nature rather than fighting it:
1. Design for Uncertainty
Present multiple options instead of single answers
Include confidence scores when possible
Create explicit feedback loops for correction
Set clear expectations about capabilities and limitations
2. Build Safety Nets, Not Guardrails
Instead of trying to prevent every possible error through prompting (impossible), build systems that:
Detect when outputs drift from expected patterns
Gracefully handle uncertain responses
Have clear escalation paths for edge cases
Maintain human-in-the-loop options for critical decisions
3. Think in Systems, Not Components
The most successful LLM applications I've built treat each model call as part of a broader system, with:
Multiple validation layers
Complementary deterministic components
Continuous evaluation feedback loops
Graceful degradation paths
A New Development Paradigm
Building with LLMs requires a fundamental shift in how we approach software development:
From correctness to acceptability ranges
Rather than "Is this output correct?" ask "Is this output within acceptable parameters?"From testing to ongoing evaluation
Continuous monitoring matters more than pre-deployment testingFrom features to capabilities
Focus on the capability space rather than specific feature implementationsFrom linear pipelines to adaptive systems
Create systems that detect and correct their own shortcomings
Moving Forward
The teams that will succeed with LLM applications aren't those with the best prompts or the most expensive models—they're the ones building robust systems around uncertainty.
This isn't just a technical challenge. It's a fundamental rethinking of how we build software for an era where capabilities and correctness exist on probability curves rather than boolean flags.
The future belongs not to prompt engineers, but to uncertainty engineers—those who can design, build and maintain systems that deliver value despite their inherently probabilistic nature.
What challenges have you faced building with LLMs? Share your experiences in the comments below.
Reviews
"Team warehows efficiently set up our pipelines on Databricks, integrated tools like Airbyte and BigQuery, and managed LLM and AI tasks smoothly."

Olivier Ramier
CTO, Telescope AI