LLM Integration Best Practices: From API to Production System
The Challenge: From Playground to Production
The leap from initial ChatGPT experiments to a production LLM system is bigger than many expect. At innFactory, we have implemented numerous LLM integrations for clients and identified proven patterns along the way.
1. Prompt Engineering for Production
Structuring System Prompts Correctly
A production system prompt should contain:
[Role & Context]
You are a customer service assistant for [Company].
[Task]
Answer customer inquiries based on the knowledge base.
[Constraints]
- Answer only in English
- Do not invent information
- Refer to support when uncertain
[Format]
- Use short paragraphs
- Use bullet points for lists
- Maximum 200 wordsFew-Shot Examples
For consistent outputs:
{
"examples": [
{
"input": "How can I change my password?",
"output": "You can change your password in the account settings: 1. Click on 'Profile'..."
}
]
}2. Robust Error Handling
LLM APIs are not always reliable. Implement:
Retry Logic with Exponential Backoff
import time
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
def call_llm(prompt: str) -> str:
return openai.chat.completions.create(...)Fallback Strategies
Primary: GPT-4 Turbo
↓ (on error)
Fallback 1: GPT-3.5 Turbo
↓ (on error)
Fallback 2: Predefined response + EscalationTimeout Management
- API timeout: 30-60 seconds
- User-facing timeout: 10-15 seconds
- Use streaming for long responses
3. Cost Optimization
LLM costs can escalate quickly. Our strategies:
Token Management
# Limit prompt length
def truncate_context(context: str, max_tokens: int = 3000) -> str:
encoder = tiktoken.encoding_for_model("gpt-4")
tokens = encoder.encode(context)
if len(tokens) > max_tokens:
return encoder.decode(tokens[:max_tokens])
return contextCaching
Query → Hash → Cache Lookup
↓ (Miss)
LLM Call → Cache Store → ResponseAt innFactory, we use Redis or DynamoDB for semantic caching - identical questions are only answered once.
Model Routing
Not every request needs GPT-4:
| Request Type | Recommended Model | Cost |
|---|---|---|
| Simple FAQ | GPT-3.5 Turbo | €€ |
| Complex Analysis | GPT-4 Turbo | €€€€ |
| Classification | Fine-tuned GPT-3.5 | € |
4. Security & Compliance
Prompt Injection Prevention
def sanitize_user_input(text: str) -> str:
# Remove known injection patterns
patterns = [
r"ignore.*instructions",
r"disregard.*above",
r"system:.*"
]
for pattern in patterns:
text = re.sub(pattern, "", text, flags=re.IGNORECASE)
return textPII Handling
For GDPR compliance:
- Input Filtering: Mask PII before the API call
- Output Filtering: Detect unwanted data in responses
- Logging: No personal data in logs
Audit Trail
{
"timestamp": "2025-01-15T10:30:00Z",
"user_id": "user_123",
"model": "gpt-4-turbo",
"prompt_hash": "abc123...",
"tokens_used": 1500,
"response_hash": "def456..."
}5. Monitoring & Observability
Important Metrics
- Latency: P50, P95, P99 of response times
- Token Consumption: Per request and aggregated
- Error Rate: By error type (Rate Limit, Timeout, etc.)
- Quality: User feedback, thumbs up/down
Alerting
alerts:
- name: LLM Error Rate High
condition: error_rate > 5%
window: 5 minutes
- name: Cost Spike
condition: daily_cost > 1.5 * avg_daily_cost
- name: Latency Degradation
condition: p95_latency > 10s6. Architecture Patterns
Async Processing
For non-interactive applications:
User Request → Queue (SQS/RabbitMQ) → Worker → LLM → Result Store
↓
Immediate Response: "Your request is being processed"Streaming for UX
async def stream_response(prompt: str):
stream = await openai.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": prompt}],
stream=True
)
async for chunk in stream:
yield chunk.choices[0].delta.contentConclusion
A successful LLM integration requires more than just API calls. Plan from the start for:
- Robust error handling
- Cost monitoring and control
- Security measures
- Quality measurement
At innFactory, we support you throughout the entire journey - from proof of concept to scalable production system.
Your LLM project is coming up? Talk to us about your requirements.

Tobias Jonas


