Skip to main content

LLM Integration Best Practices: From API to Production System

Tobias Jonas Tobias Jonas 3 min read

The Challenge: From Playground to Production

The leap from initial ChatGPT experiments to a production LLM system is bigger than many expect. At innFactory, we have implemented numerous LLM integrations for clients and identified proven patterns along the way.

1. Prompt Engineering for Production

Structuring System Prompts Correctly

A production system prompt should contain:

[Role & Context]
You are a customer service assistant for [Company]. 

[Task]
Answer customer inquiries based on the knowledge base.

[Constraints]
- Answer only in English
- Do not invent information
- Refer to support when uncertain

[Format]
- Use short paragraphs
- Use bullet points for lists
- Maximum 200 words

Few-Shot Examples

For consistent outputs:

{
  "examples": [
    {
      "input": "How can I change my password?",
      "output": "You can change your password in the account settings: 1. Click on 'Profile'..."
    }
  ]
}

2. Robust Error Handling

LLM APIs are not always reliable. Implement:

Retry Logic with Exponential Backoff

import time
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
def call_llm(prompt: str) -> str:
    return openai.chat.completions.create(...)

Fallback Strategies

Primary: GPT-4 Turbo
  ↓ (on error)
Fallback 1: GPT-3.5 Turbo
  ↓ (on error)
Fallback 2: Predefined response + Escalation

Timeout Management

  • API timeout: 30-60 seconds
  • User-facing timeout: 10-15 seconds
  • Use streaming for long responses

3. Cost Optimization

LLM costs can escalate quickly. Our strategies:

Token Management

# Limit prompt length
def truncate_context(context: str, max_tokens: int = 3000) -> str:
    encoder = tiktoken.encoding_for_model("gpt-4")
    tokens = encoder.encode(context)
    if len(tokens) > max_tokens:
        return encoder.decode(tokens[:max_tokens])
    return context

Caching

Query → Hash → Cache Lookup
          ↓ (Miss)
       LLM Call → Cache Store → Response

At innFactory, we use Redis or DynamoDB for semantic caching - identical questions are only answered once.

Model Routing

Not every request needs GPT-4:

Request TypeRecommended ModelCost
Simple FAQGPT-3.5 Turbo€€
Complex AnalysisGPT-4 Turbo€€€€
ClassificationFine-tuned GPT-3.5

4. Security & Compliance

Prompt Injection Prevention

def sanitize_user_input(text: str) -> str:
    # Remove known injection patterns
    patterns = [
        r"ignore.*instructions",
        r"disregard.*above",
        r"system:.*"
    ]
    for pattern in patterns:
        text = re.sub(pattern, "", text, flags=re.IGNORECASE)
    return text

PII Handling

For GDPR compliance:

  1. Input Filtering: Mask PII before the API call
  2. Output Filtering: Detect unwanted data in responses
  3. Logging: No personal data in logs

Audit Trail

{
  "timestamp": "2025-01-15T10:30:00Z",
  "user_id": "user_123",
  "model": "gpt-4-turbo",
  "prompt_hash": "abc123...",
  "tokens_used": 1500,
  "response_hash": "def456..."
}

5. Monitoring & Observability

Important Metrics

  • Latency: P50, P95, P99 of response times
  • Token Consumption: Per request and aggregated
  • Error Rate: By error type (Rate Limit, Timeout, etc.)
  • Quality: User feedback, thumbs up/down

Alerting

alerts:
  - name: LLM Error Rate High
    condition: error_rate > 5%
    window: 5 minutes
    
  - name: Cost Spike
    condition: daily_cost > 1.5 * avg_daily_cost
    
  - name: Latency Degradation
    condition: p95_latency > 10s

6. Architecture Patterns

Async Processing

For non-interactive applications:

User Request → Queue (SQS/RabbitMQ) → Worker → LLM → Result Store
  Immediate Response: "Your request is being processed"

Streaming for UX

async def stream_response(prompt: str):
    stream = await openai.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    async for chunk in stream:
        yield chunk.choices[0].delta.content

Conclusion

A successful LLM integration requires more than just API calls. Plan from the start for:

  • Robust error handling
  • Cost monitoring and control
  • Security measures
  • Quality measurement

At innFactory, we support you throughout the entire journey - from proof of concept to scalable production system.

Your LLM project is coming up? Talk to us about your requirements.

Tobias Jonas
Written by Tobias Jonas CEO

Cloud-Architekt und Experte für AWS, Google Cloud, Azure und STACKIT. Vor der Gründung der innFactory bei Siemens und BMW tätig.

LinkedIn