19 KiB
Production Deployment Guide
Comprehensive guide for deploying the webhook service in production environments with enterprise-grade reliability, security, and monitoring.
🎯 Production Readiness Overview
Deployment Checklist
□ Security hardening complete
□ SSL certificates configured and auto-renewing
□ Monitoring and alerting implemented
□ Backup and disaster recovery tested
□ Performance optimization validated
□ Documentation complete and accessible
□ Team training and runbooks prepared
Production vs Development Differences
| Aspect | Development | Production |
|---|---|---|
| Security | Basic auth, HTTP allowed | Full security stack, HTTPS only |
| Logging | Console output | Structured logging, centralized |
| Monitoring | Manual checks | Automated monitoring/alerting |
| Scaling | Single instance | Auto-scaling, load balancing |
| Data | Test data | Real customer data, GDPR compliance |
| Uptime | Best effort | 99.9% SLA target |
🏗️ Infrastructure Requirements
Server Specifications
Minimum Requirements:
CPU: 2 cores (x86_64)
RAM: 4GB
Storage: 50GB SSD
Network: 100Mbps
OS: Ubuntu 20.04 LTS or newer
Recommended Production:
CPU: 4 cores (x86_64)
RAM: 8GB
Storage: 100GB NVMe SSD
Network: 1Gbps
OS: Ubuntu 22.04 LTS
Backup: Automated daily backups
High Availability Setup:
Load Balancer: 2x instances
Application Servers: 3x instances
Database: Primary + Read Replica
Storage: RAID 1 or cloud block storage
Network: Redundant connections
Network Architecture
┌─────────────────────────────────────────────────────────────────┐
│ PRODUCTION NETWORK │
├─────────────────────────────────────────────────────────────────┤
│ Internet ──▶ CDN/WAF ──▶ Load Balancer ──▶ Application │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ DDoS Protection Health Checks Auto Scaling │
│ Rate Limiting SSL Termination Multiple Instances │
│ Geo Filtering Session Affinity Container Restart │
└─────────────────────────────────────────────────────────────────┘
🔒 Security Hardening
Operating System Security
System Hardening Checklist:
# 1. Update system packages
sudo apt update && sudo apt upgrade -y
# 2. Configure automatic security updates
sudo apt install unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades
# 3. Configure UFW firewall
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow ssh
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw enable
# 4. Install and configure fail2ban
sudo apt install fail2ban
sudo systemctl enable fail2ban
sudo systemctl start fail2ban
# 5. Disable root login and password authentication
sudo sed -i 's/PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
sudo sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config
sudo systemctl restart ssh
# 6. Configure automatic security updates
echo 'Unattended-Upgrade::Automatic-Reboot "true";' | sudo tee -a /etc/apt/apt.conf.d/50unattended-upgrades
echo 'Unattended-Upgrade::Automatic-Reboot-Time "02:00";' | sudo tee -a /etc/apt/apt.conf.d/50unattended-upgrades
Docker Security Configuration
Production Docker Daemon Config:
# /etc/docker/daemon.json
{
"live-restore": true,
"userland-proxy": false,
"no-new-privileges": true,
"seccomp-profile": "/etc/docker/seccomp.json",
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
},
"storage-driver": "overlay2",
"storage-opts": [
"overlay2.override_kernel_check=true"
]
}
Security Hardened docker-compose.yml:
version: '3.8'
services:
webhook-service:
build: .
container_name: webhook-service-prod
restart: unless-stopped
# Security configurations
read_only: true
security_opt:
- no-new-privileges:true
- seccomp:unconfined
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE
# Resource limits
deploy:
resources:
limits:
cpus: '0.5'
memory: 512M
reservations:
cpus: '0.1'
memory: 256M
# Temporary filesystems for read-only container
tmpfs:
- /tmp:size=100M,noexec,nosuid,nodev
- /var/run:size=100M,noexec,nosuid,nodev
environment:
- FLASK_ENV=production
- FLASK_SECRET_KEY=${FLASK_SECRET_KEY}
- WEBHOOK_SECRET=${WEBHOOK_SECRET}
- PARTICLE_WEBHOOK_SECRET=${PARTICLE_WEBHOOK_SECRET}
- SMTP_EMAIL=${SMTP_EMAIL}
- SMTP_PASSWORD=${SMTP_PASSWORD}
- RECIPIENT_EMAIL=${RECIPIENT_EMAIL}
networks:
- traefik
- internal
labels:
- "traefik.enable=true"
- "traefik.http.routers.webhook-prod.rule=Host(`webhook.yourdomain.com`)"
- "traefik.http.routers.webhook-prod.entrypoints=websecure"
- "traefik.http.routers.webhook-prod.tls.certresolver=letsencrypt"
- "traefik.http.services.webhook-prod.loadbalancer.server.port=5000"
# Production security middleware
- "traefik.http.routers.webhook-prod.middlewares=webhook-prod-security,webhook-prod-ratelimit"
# Enhanced security headers
- "traefik.http.middlewares.webhook-prod-security.headers.customrequestheaders.X-Forwarded-Proto=https"
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-Content-Type-Options=nosniff"
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-Frame-Options=DENY"
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-XSS-Protection=1; mode=block"
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Referrer-Policy=strict-origin-when-cross-origin"
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Strict-Transport-Security=max-age=31536000; includeSubDomains"
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Content-Security-Policy=default-src 'self'"
# Production rate limiting
- "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.average=20"
- "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.burst=50"
- "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.period=1m"
# Health check configuration
- "traefik.http.services.webhook-prod.loadbalancer.healthcheck.path=/health"
- "traefik.http.services.webhook-prod.loadbalancer.healthcheck.interval=30s"
networks:
traefik:
external: true
internal:
internal: true
SSL/TLS Configuration
Production Traefik SSL Configuration:
# traefik.yml
certificatesResolvers:
letsencrypt:
acme:
email: admin@yourdomain.com
storage: /acme.json
httpChallenge:
entryPoint: web
# Production Let's Encrypt endpoint
caServer: https://acme-v02.api.letsencrypt.org/directory
# Enhanced TLS configuration
tls:
options:
default:
minVersion: "VersionTLS12"
maxVersion: "VersionTLS13"
cipherSuites:
- "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"
- "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384"
- "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"
- "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305"
curvePreferences:
- "CurveP521"
- "CurveP384"
sniStrict: true
📊 Monitoring and Observability
Production Monitoring Stack
Monitoring Architecture:
┌─────────────────────────────────────────────────────────────────┐
│ MONITORING STACK │
├─────────────────────────────────────────────────────────────────┤
│ Application Metrics │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │Prometheus │ │Grafana │ │AlertManager │ │
│ │Metrics │ │Dashboards │ │Notifications │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Log Management │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │Loki │ │Log │ │Error │ │
│ │Aggregation │ │Analysis │ │Tracking │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Infrastructure Monitoring │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │Node │ │Docker │ │Network │ │
│ │Exporter │ │Stats │ │Monitoring │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Prometheus Metrics Integration
Enhanced webhook_app.py with metrics:
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Metrics definitions
webhook_requests_total = Counter(
'webhook_requests_total',
'Total webhook requests',
['method', 'endpoint', 'status_code', 'source_type']
)
webhook_request_duration = Histogram(
'webhook_request_duration_seconds',
'Webhook request duration',
['endpoint', 'source_type']
)
webhook_auth_failures = Counter(
'webhook_auth_failures_total',
'Total authentication failures',
['source_type', 'failure_reason']
)
notification_delivery_total = Counter(
'notification_delivery_total',
'Total notification delivery attempts',
['delivery_method', 'status']
)
active_connections = Gauge(
'webhook_active_connections',
'Number of active connections'
)
# Middleware for metrics collection
def metrics_middleware():
def decorator(f):
def wrapper(*args, **kwargs):
start_time = time.time()
source_type = 'particle' if 'ParticleBot' in request.headers.get('User-Agent', '') else 'generic'
try:
result = f(*args, **kwargs)
status_code = result[1] if isinstance(result, tuple) else 200
webhook_requests_total.labels(
method=request.method,
endpoint=request.endpoint,
status_code=status_code,
source_type=source_type
).inc()
return result
except Exception as e:
webhook_requests_total.labels(
method=request.method,
endpoint=request.endpoint,
status_code=500,
source_type=source_type
).inc()
raise
finally:
duration = time.time() - start_time
webhook_request_duration.labels(
endpoint=request.endpoint,
source_type=source_type
).observe(duration)
return wrapper
return decorator
# Add metrics endpoint
@app.route('/metrics')
def metrics():
"""Prometheus metrics endpoint"""
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
# Start metrics server
if __name__ == '__main__':
start_http_server(8000) # Prometheus metrics on port 8000
app.run(host='0.0.0.0', port=5000)
Grafana Dashboard Configuration
Production Dashboard JSON:
json{
"dashboard": {
"title": "Webhook Service Production Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(webhook_requests_total[5m])",
"legendFormat": "{{source_type}} - {{status_code}}"
}
]
},
{
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(webhook_request_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
},
{
"expr": "histogram_quantile(0.50, rate(webhook_request_duration_seconds_bucket[5m]))",
"legendFormat": "50th percentile"
}
]
},
{
"title": "Authentication Failures",
"type": "singlestat",
"targets": [
{
"expr": "increase(webhook_auth_failures_total[1h])",
"legendFormat": "Last Hour"
}
]
},
{
"title": "Notification Success Rate",
"type": "graph",
"targets": [
{
"expr": "rate(notification_delivery_total{status=\"success\"}[5m]) / rate(notification_delivery_total[5m]) * 100",
"legendFormat": "Success Rate %"
}
]
}
]
}
}
Alerting Rules
AlertManager Configuration:
yaml# alertmanager.yml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@yourdomain.com'
smtp_auth_username: 'alerts@yourdomain.com'
smtp_auth_password: 'your-app-password'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'webhook-alerts'
receivers:
- name: 'webhook-alerts'
email_configs:
- to: 'admin@yourdomain.com'
subject: 'Webhook Service Alert - {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Instance: {{ .Labels.instance }}
Severity: {{ .Labels.severity }}
{{ end }}
# Prometheus alerting rules
groups:
- name: webhook-service
rules:
- alert: WebhookServiceDown
expr: up{job="webhook-service"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Webhook service is down"
description: "Webhook service has been down for more than 1 minute"
- alert: HighErrorRate
expr: rate(webhook_requests_total{status_code=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} requests per second"
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(webhook_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High response time"
description: "95th percentile response time is {{ $value }} seconds"
- alert: AuthenticationFailures
expr: increase(webhook_auth_failures_total[15m]) > 10
for: 0m
labels:
severity: critical
annotations:
summary: "Multiple authentication failures"
description: "{{ $value }} authentication failures in the last 15 minutes"
🎯 Production Success Metrics
Service Level Objectives (SLOs) Availability SLO: 99.9% uptime
- Measurement: HTTP 200 responses / Total HTTP requests
- Error Budget: 43.2 minutes downtime per month
- Alerting: Alert if availability drops below 99.5% over 1 hour
Latency SLO: 95% of requests < 500ms
- Measurement: Response time distribution
- Alerting: Alert if 95th percentile > 500ms for 5 minutes
Error Rate SLO: <0.1% error rate
- Measurement: HTTP 5xx responses / Total HTTP requests
- Alerting: Alert if error rate > 0.5% over 5 minutes
Security SLO: <10 authentication failures per day
- Measurement: Failed authentication attempts
- Alerting: Alert if >50 failures in 1 hour
Key Performance Indicators
Business Metrics: □ Total webhook events processed per day □ Notification delivery success rate (target: >99%) □ Average response time (target: <100ms) □ Cost per webhook processed □ Mean time to detection (MTTD) for issues □ Mean time to resolution (MTTR) for incidents □ Infrastructure utilization efficiency □ Customer satisfaction score
📞 Production Support
Incident Response Severity Levels: SEVERITY 1 - Critical (Service Down) Response Time: 15 minutes Resolution Time: 1 hour Actions: Immediate escalation, war room, customer communication
SEVERITY 2 - High (Degraded Performance) Response Time: 30 minutes Resolution Time: 4 hours Actions: Team lead notification, monitoring increase
SEVERITY 3 - Medium (Minor Issues) Response Time: 2 hours Resolution Time: 24 hours Actions: Standard troubleshooting, ticket tracking
SEVERITY 4 - Low (Enhancement Requests) Response Time: Next business day Resolution Time: Per roadmap Actions: Backlog prioritization
On-Call Procedures
24/7 Support Structure: Primary On-Call: Initial response and triage Secondary On-Call: Backup coverage and escalation Engineering Manager: Resource coordination Senior Leadership: Business impact decisions
Escalation Timeline:
- 15 minutes: Auto-escalate if no response
- 30 minutes: Escalate to secondary on-call
- 1 hour: Escalate to engineering manager
- 2 hours: Escalate to senior leadership
🚀 Production Deployment Summary:
This production deployment guide provides enterprise-grade reliability with: ✅ 99.9% Uptime Target - Comprehensive monitoring and alerting ✅ Enterprise Security - Multi-layer security hardening ✅ Auto-scaling - Dynamic resource allocation ✅ Disaster Recovery - Automated backup and recovery procedures ✅ 24/7 Support - Structured incident response and on-call coverage ✅ Performance Optimization - Sub-500ms response times