The Challenge: From Playground to Production

The leap from initial ChatGPT experiments to a production LLM system is bigger than many expect. At innFactory, we have implemented numerous LLM integrations for clients and identified proven patterns along the way.

1. Prompt Engineering for Production

Structuring System Prompts Correctly

A production system prompt should contain:

[Role & Context]
You are a customer service assistant for [Company]. 

[Task]
Answer customer inquiries based on the knowledge base.

[Constraints]
- Answer only in English
- Do not invent information
- Refer to support when uncertain

[Format]
- Use short paragraphs
- Use bullet points for lists
- Maximum 200 words

Few-Shot Examples

For consistent outputs:

{
  "examples": [
    {
      "input": "How can I change my password?",
      "output": "You can change your password in the account settings: 1. Click on 'Profile'..."
    }
  ]
}

2. Robust Error Handling

LLM APIs are not always reliable. Implement:

Retry Logic with Exponential Backoff

import time
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
def call_llm(prompt: str) -> str:
    return openai.chat.completions.create(...)

Fallback Strategies

Primary: GPT-4 Turbo
  ↓ (on error)
Fallback 1: GPT-3.5 Turbo
  ↓ (on error)
Fallback 2: Predefined response + Escalation

Timeout Management

API timeout: 30-60 seconds
User-facing timeout: 10-15 seconds
Use streaming for long responses

3. Cost Optimization

LLM costs can escalate quickly. Our strategies:

Token Management

# Limit prompt length
def truncate_context(context: str, max_tokens: int = 3000) -> str:
    encoder = tiktoken.encoding_for_model("gpt-4")
    tokens = encoder.encode(context)
    if len(tokens) > max_tokens:
        return encoder.decode(tokens[:max_tokens])
    return context

Caching

Query → Hash → Cache Lookup
          ↓ (Miss)
       LLM Call → Cache Store → Response

At innFactory, we use Redis or DynamoDB for semantic caching - identical questions are only answered once.

Model Routing

Not every request needs GPT-4:

Request Type	Recommended Model	Cost
Simple FAQ	GPT-3.5 Turbo	€€
Complex Analysis	GPT-4 Turbo	€€€€
Classification	Fine-tuned GPT-3.5	€

4. Security & Compliance

Prompt Injection Prevention

def sanitize_user_input(text: str) -> str:
    # Remove known injection patterns
    patterns = [
        r"ignore.*instructions",
        r"disregard.*above",
        r"system:.*"
    ]
    for pattern in patterns:
        text = re.sub(pattern, "", text, flags=re.IGNORECASE)
    return text

PII Handling

For GDPR compliance:

Input Filtering: Mask PII before the API call
Output Filtering: Detect unwanted data in responses
Logging: No personal data in logs

Audit Trail

{
  "timestamp": "2025-01-15T10:30:00Z",
  "user_id": "user_123",
  "model": "gpt-4-turbo",
  "prompt_hash": "abc123...",
  "tokens_used": 1500,
  "response_hash": "def456..."
}

5. Monitoring & Observability

Important Metrics

Latency: P50, P95, P99 of response times
Token Consumption: Per request and aggregated
Error Rate: By error type (Rate Limit, Timeout, etc.)
Quality: User feedback, thumbs up/down

Alerting

alerts:
  - name: LLM Error Rate High
    condition: error_rate > 5%
    window: 5 minutes
    
  - name: Cost Spike
    condition: daily_cost > 1.5 * avg_daily_cost
    
  - name: Latency Degradation
    condition: p95_latency > 10s

6. Architecture Patterns

Async Processing

For non-interactive applications:

User Request → Queue (SQS/RabbitMQ) → Worker → LLM → Result Store
     ↓
  Immediate Response: "Your request is being processed"

Streaming for UX

async def stream_response(prompt: str):
    stream = await openai.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    async for chunk in stream:
        yield chunk.choices[0].delta.content

Conclusion

A successful LLM integration requires more than just API calls. Plan from the start for:

Robust error handling
Cost monitoring and control
Security measures
Quality measurement

At innFactory, we support you throughout the entire journey - from proof of concept to scalable production system.

Your LLM project is coming up? Talk to us about your requirements.