GenAI Observability Implementation Best Practices
Overview
This guide provides tactical, implementation-specific best practices for building a production-ready GenAI observability solution. These practices are based on real-world deployments and lessons learned.
OpenTelemetry Instrumentation
Metric Naming Conventions
Use consistent, descriptive names:
# ✅ Good - Clear, hierarchical naming
"genai.token.input.count"
"genai.token.output.count"
"genai.request.duration"
"genai.request.error.count"
# ❌ Bad - Ambiguous, inconsistent
"tokens"
"input_tok"
"req_time"
"errors"
Required Dimensions
Always include these dimensions:
dimensions = {
"model": "anthropic.claude-3-haiku-20240307-v1:0",
"cloud_provider": "aws", # aws, gcp, azure, on-prem
"application": "chatbot",
"environment": "production",
"region": "us-east-1"
}
Optional but Recommended Dimensions
optional_dimensions = {
"user_id": "hashed_user_id", # Hash for privacy
"session_id": "session_123",
"prompt_template": "customer_support_v2",
"model_version": "2024-03-07",
"feature_flag": "new_prompt_enabled"
}
Instrumentation Code Example
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
# Initialize meter
meter = metrics.get_meter(__name__)
# Create instruments
token_input_counter = meter.create_counter(
name="genai.token.input.count",
description="Number of input tokens consumed",
unit="tokens"
)
token_output_counter = meter.create_counter(
name="genai.token.output.count",
description="Number of output tokens generated",
unit="tokens"
)
latency_histogram = meter.create_histogram(
name="genai.request.duration",
description="Request duration in milliseconds",
unit="ms"
)
# Record metrics
def record_llm_metrics(model, provider, input_tokens, output_tokens, latency_ms):
dimensions = {
"model": model,
"cloud_provider": provider
}
token_input_counter.add(input_tokens, dimensions)
token_output_counter.add(output_tokens, dimensions)
latency_histogram.record(latency_ms, dimensions)
CloudWatch Configuration
Metric Namespace Strategy
Use hierarchical namespaces:
AIObservability # Root namespace
├── Production # Environment-specific
│ ├── Chatbot # Application-specific
│ └── SearchAssistant
└── Development
Metric Period Selection
Choose appropriate periods:
- High-frequency metrics (request count, errors): 60 seconds
- Medium-frequency metrics (latency, tokens): 300 seconds (5 min)
- Low-frequency metrics (daily costs): 3600 seconds (1 hour)
CloudWatch Logs Structure
Use structured logging:
{
"timestamp": "2026-03-04T10:30:00Z",
"level": "INFO",
"model": "gpt-4o",
"cloud_provider": "azure",
"input_tokens": 45,
"output_tokens": 234,
"latency_ms": 1523,
"cost_usd": 0.0234,
"user_id": "hashed_abc123",
"prompt_template": "summarization_v3",
"success": true
}
Log Group Organization
/genai-observability/
├── application-logs # Application-level logs
├── model-invocations # LLM request/response logs
├── errors # Error logs only
└── audit # Compliance audit trail
Grafana Dashboard Design
Dashboard Hierarchy
Create dashboards for different audiences:
-
Executive Dashboard - High-level KPIs
- Total daily cost
- Request volume trends
- Error rate
- Top models by usage
-
Operations Dashboard - Real-time monitoring
- Current request rate
- Active errors
- Latency percentiles
- Provider health status
-
Developer Dashboard - Debugging and optimization
- Request traces
- Token usage by feature
- Latency breakdown
- Error details
-
FinOps Dashboard - Cost management
- Cost by model
- Cost by team/project
- Cost trends and forecasts
- Optimization opportunities
Panel Best Practices
Time Series Panels:
{
"type": "timeseries",
"title": "Token Usage by Model",
"targets": [{
"expr": "sum by (model) (rate(genai_token_input_count[5m]))",
"legendFormat": "{{model}}"
}],
"options": {
"legend": {"displayMode": "table", "placement": "right"},
"tooltip": {"mode": "multi"}
}
}
Stat Panels for KPIs:
{
"type": "stat",
"title": "Total Requests (24h)",
"targets": [{
"expr": "sum(increase(genai_request_count[24h]))"
}],
"options": {
"colorMode": "background",
"graphMode": "area"
},
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 10000, "color": "yellow"},
{"value": 50000, "color": "red"}
]
}
}
Variable Templates
Use dashboard variables for filtering:
{
"templating": {
"list": [
{
"name": "cloud_provider",
"type": "query",
"query": "label_values(genai_token_input_count, cloud_provider)",
"multi": true,
"includeAll": true
},
{
"name": "model",
"type": "query",
"query": "label_values(genai_token_input_count{cloud_provider=~\"$cloud_provider\"}, model)",
"multi": true,
"includeAll": true
}
]
}
}
Alert Configuration
Alert Threshold Recommendations
Error Rate Alerts:
# Critical - Page immediately
- alert: HighErrorRate
expr: rate(genai_request_error_count[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 5% for 5 minutes"
# Warning - Investigate during business hours
- alert: ElevatedErrorRate
expr: rate(genai_request_error_count[5m]) > 0.02
for: 15m
labels:
severity: warning
Latency Alerts:
- alert: HighLatency
expr: histogram_quantile(0.95, rate(genai_request_duration_bucket[5m])) > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "P95 latency above 10 seconds"
Cost Alerts:
- alert: DailyCostSpike
expr: sum(increase(genai_cost_usd[1h])) > 100
labels:
severity: warning
annotations:
summary: "Hourly cost exceeds $100"
- alert: MonthlyBudgetExceeded
expr: sum(increase(genai_cost_usd[30d])) > 10000
labels:
severity: critical
annotations:
summary: "Monthly budget of $10,000 exceeded"
Alert Routing
Route alerts to appropriate channels:
route:
group_by: ['alertname', 'cloud_provider']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack-ops'
- match:
alertname: MonthlyBudgetExceeded
receiver: 'slack-finops'
MCP Server Deployment
Server Configuration
Production-ready MCP server setup:
# mcp_server_config.py
import os
CONFIG = {
"server": {
"host": "0.0.0.0",
"port": int(os.getenv("MCP_PORT", "8080")),
"workers": int(os.getenv("MCP_WORKERS", "4"))
},
"cloudwatch": {
"region": os.getenv("AWS_REGION", "us-east-1"),
"namespace": "AIObservability",
"log_group": "/genai-observability/mcp-server"
},
"cache": {
"enabled": True,
"ttl_seconds": 300, # 5 minutes
"max_size": 1000
},
"rate_limiting": {
"enabled": True,
"requests_per_minute": 60
}
}
Query Optimization
Optimize CloudWatch queries:
# ✅ Good - Use specific time ranges and dimensions
def get_token_usage(model, hours=1):
end_time = datetime.utcnow()
start_time = end_time - timedelta(hours=hours)
response = cloudwatch.get_metric_statistics(
Namespace='AIObservability',
MetricName='InputTokens',
Dimensions=[
{'Name': 'Model', 'Value': model},
{'Name': 'CloudProvider', 'Value': 'aws'}
],
StartTime=start_time,
EndTime=end_time,
Period=300, # 5 minutes
Statistics=['Sum']
)
return response
# ❌ Bad - Querying all data without filters
def get_all_metrics():
response = cloudwatch.get_metric_statistics(
Namespace='AIObservability',
MetricName='InputTokens',
StartTime=datetime(2020, 1, 1), # Too broad
EndTime=datetime.utcnow(),
Period=60, # Too granular
Statistics=['Sum', 'Average', 'Min', 'Max'] # Unnecessary stats
)
Caching Strategy
Implement intelligent caching:
from functools import lru_cache
from datetime import datetime, timedelta
@lru_cache(maxsize=100)
def get_cached_metrics(model, cloud_provider, time_bucket):
"""Cache metrics by 5-minute time buckets"""
# time_bucket = current_time // 300 (5 minutes)
return fetch_metrics_from_cloudwatch(model, cloud_provider)
def get_metrics_with_cache(model, cloud_provider):
current_time = int(datetime.utcnow().timestamp())
time_bucket = current_time // 300 # 5-minute buckets
return get_cached_metrics(model, cloud_provider, time_bucket)
Cost Optimization
Metric Sampling
Sample high-volume metrics:
import random
def should_sample(sample_rate=0.1):
"""Sample 10% of requests"""
return random.random() < sample_rate
def record_metrics(model, tokens, latency):
# Always record critical metrics
record_error_metrics()
record_request_count()
# Sample detailed metrics
if should_sample(sample_rate=0.1):
record_token_metrics(model, tokens)
record_latency_histogram(latency)