Three-Tier AI Architecture — Deterministic Rules, ML Models, and LLM Agents
A practical framework for deciding when to use hard-coded rules, trained ML models, or LLM agents in production AI systems — with real examples from enterprise SaaS.
Three-Tier AI Architecture
For production AI features, I’ve developed a three-tier approach that balances reliability with intelligence. Too many teams jump straight to LLMs for everything, when simpler and more predictable approaches would serve better. Conversely, teams that avoid AI entirely miss enormous opportunities. The key is knowing which tier to apply where.
After building AI-powered features across multiple products — from enterprise SaaS to health tech — this framework has consistently guided good architectural decisions.
Tier 1: Deterministic Rules
When to use: Business logic that must be exact, auditable, and 100% predictable.
Compliance checks, validation rules, regulatory calculations, data transformations with known schemas, threshold-based alerts. These should never be delegated to probabilistic models. They’re fast, testable, auditable, and predictable. When a regulator asks “why did you do X?”, the answer needs to be a clear rule, not “the model thought so.”
Characteristics:
- Zero ambiguity in inputs and outputs
- Must be explainable to auditors and regulators
- Performance critical (sub-millisecond)
- Behavior must be identical across runs
- Changes require explicit human approval
Real Examples (Order Management):
- Price calculation: Line item quantity * unit price = expected total. No AI needed, no AI wanted.
- Duplicate detection (exact match): Same order number + same customer + same amount = flag as duplicate. Deterministic hash comparison.
- Order validation: Requested quantity = available inventory = confirmed allocation within configured tolerance. Pure arithmetic.
- Compliance checks: Is this customer on the sanctions list? Is the transaction within the approved threshold for this approver? Binary lookups.
- Format validation: Does this order have all required fields? Is the date parseable? Is the currency code valid?
Implementation:
- Standard business logic in your application code
- Rule engines (Drools, custom DSL) for complex rule sets that business users configure
- Database constraints and triggers for data integrity
- Configuration-driven (not code-deployed) where business users need to adjust thresholds
Cost: Essentially zero per transaction. Development cost is the rule logic itself.
Tier 2: Trained ML Models
When to use: Pattern recognition on structured data where you have historical ground truth and the problem has a measurable, bounded outcome.
Classification, scoring, anomaly detection, recommendation, forecasting. These models are trained on your specific data, evaluated with standard ML metrics, and deployed with monitoring. They’re more capable than rules but still bounded and measurable.
Characteristics:
- You have labeled training data (or can generate it)
- The output is a classification, score, or prediction — not free text
- You can define and measure accuracy, precision, recall
- The model needs to improve over time as it sees more data
- Latency requirements are moderate (milliseconds to low seconds)
Real Examples (Order Management):
- Fraud scoring: This transaction has a 94% probability of being fraudulent based on 47 features (amount anomaly, customer behavior change, timing patterns, device fingerprint irregularities). Trained on historically confirmed fraud cases.
- Document classification: Is this a new order, return request, support ticket, or contract amendment? Trained classifier on document structure and content features.
- Customer risk scoring: Composite score based on payment history, account activity patterns, geographic risk factors. Gradient boosted model on structured customer data.
- Amount anomaly detection: This order amount is 3.7 standard deviations from the customer’s historical mean. Statistical model, not LLM.
- Demand forecasting: Predicted order volume by category for the next 30 days. Time series model for capacity planning.
Implementation:
- XGBoost/LightGBM for tabular data (still the best choice in 2026 for structured features)
- PyTorch for complex patterns (image-based document classification, sequence models)
- Feature stores (Feast, Tecton) for consistent feature serving
- MLFlow or W&B for experiment tracking and model registry
- Monitoring: Evidently AI or custom dashboards for drift detection
Cost: Training compute (periodic), inference compute (low per prediction), and ongoing data labeling.
Tier 3: LLM Agents
When to use: Natural language understanding, complex reasoning, multi-step workflows, and problems where the input space is too broad or unstructured for the first two tiers.
LLMs shine when you need flexibility, creativity, or the ability to handle novel inputs gracefully. But they require guardrails, evaluation frameworks, and human-in-the-loop fallbacks.
Characteristics:
- Inputs are unstructured (natural language, varied document formats)
- The task requires reasoning, not just pattern matching
- Novel inputs are expected and must be handled gracefully
- Quality can be evaluated but not reduced to a simple metric
- Cost and latency tolerance is higher
Real Examples (Order Management):
- Document data extraction: Read an unstructured PDF contract and extract customer name, reference number, date, line items, pricing, terms. The variation in document formats makes this impractical with rules or traditional ML. LLMs handle novel formats zero-shot.
- Exception resolution assistance: “This order from Acme Corp has a 15% price discrepancy on line item 3. Here’s the contract, the rate card, and the customer’s pricing agreement. What’s the likely explanation and recommended action?” Multi-document reasoning.
- Customer communication: Draft an email to a customer explaining a billing adjustment, referencing the specific discrepancy and relevant order numbers. Natural language generation with domain context.
- Category assignment for ad-hoc orders: “This is an order for ‘strategic advisory services Q4 2025.’ Based on historical categorization patterns and the product catalog, recommend the category and cost center.” Requires understanding both the order content and the organizational structure.
- Support query resolution: “Why was my order flagged for review?” Requires understanding the specific order’s history, the applicable business rules, and explaining in plain language.
Implementation:
- Claude Opus/Sonnet via API for complex reasoning tasks
- GPT-4o or Gemini 2.5 Flash for high-volume, lower-complexity tasks
- MCP servers for tool integration (database lookups, API queries, email sending)
- Structured output (JSON schema) for reliable data extraction
- RAG pipeline for grounding responses in organizational data
- Guardrails: output validation, confidence thresholds, human-in-the-loop for high-stakes decisions
- Evaluation: automated test suites, LLM-as-judge, human sampling
Cost: Significantly higher per transaction than Tiers 1 and 2. Token costs, latency (seconds, not milliseconds), and evaluation infrastructure.
The Decision Framework
The hierarchy is intentional: always start at Tier 1 and only move up when the problem genuinely requires it. Each tier adds capability but also adds complexity, cost, latency, and unpredictability.
Can the problem be solved with deterministic rules?
├── YES → Tier 1. Stop here. Don't over-engineer.
└── NO → Is there structured data with labeled outcomes?
├── YES → Tier 2. Train a model. Monitor it.
└── NO → Does it require language understanding or reasoning?
├── YES → Tier 3. Use an LLM with guardrails.
└── NO → Rethink the problem. Maybe it doesn't need AI.
Common Mistakes
Over-tiering: Using an LLM for something a regex could handle. I’ve seen teams use GPT-4 to validate email formats. Don’t.
Under-tiering: Writing 5,000 rules to handle something that’s inherently a pattern recognition problem. If your rule set has grown to hundreds of if/else branches with diminishing accuracy, it’s time for Tier 2.
Skipping Tier 2: Going straight from rules to LLMs because ML “seems harder.” Trained models are dramatically cheaper, faster, and more predictable than LLMs for classification and scoring tasks. The investment in training data and MLOps pays back quickly.
No fallback chain: The best production systems use tiers as a fallback chain. Tier 1 handles the easy cases (60% of volume). Tier 2 handles the pattern-matchable cases (30%). Tier 3 handles the complex remainder (10%). The LLM only sees the cases that genuinely need its capabilities, which keeps costs manageable and quality high.
Tier Interactions in Practice
The tiers don’t operate in isolation. In a real system:
- Order arrives → Tier 1 validates format, checks for exact-match duplicates
- Data extraction → Tier 3 (LLM) extracts structured data from unstructured document
- Fraud scoring → Tier 2 (ML model) scores the extracted data against historical patterns
- Data validation → Tier 1 performs cross-reference checks with deterministic rules
- Exception routing → Tier 1 (rules) routes based on exception type and configured workflows
- Exception resolution → Tier 3 (LLM) assists human reviewers with analysis and recommendations
- Approval → Tier 1 (rules) enforces approval matrix based on amount and entity
Each tier does what it’s best at. The orchestration layer is deterministic (Tier 1) — you always want predictable control flow, even when individual steps use ML or LLMs.
Cost and Performance Comparison
| Tier 1: Rules | Tier 2: ML | Tier 3: LLM | |
|---|---|---|---|
| Latency | <1ms | 1-100ms | 1-30s |
| Cost/transaction | ~$0 | ~$0.001 | $0.01-$0.50 |
| Accuracy | 100% (for defined cases) | 90-99% (measurable) | 85-98% (harder to measure) |
| Novel input handling | Fails on undefined cases | Degrades gracefully | Handles novel inputs well |
| Explainability | Perfect | Moderate (SHAP, LIME) | Low (can explain, but may confabulate) |
| Setup cost | Low | Medium (data + training) | Low (API call) |
| Maintenance | Rule updates | Retraining, monitoring | Prompt versioning, eval |
The best production AI systems use all three tiers working together, with clear boundaries and handoff points between them. Start simple, add intelligence only where it’s needed, and always have a deterministic fallback.