Files
StorageSecurity/Server/docs/production-deployment.md
2025-07-06 22:43:05 -06:00

564 lines
19 KiB
Markdown

# Production Deployment Guide
Comprehensive guide for deploying the webhook service in production environments with enterprise-grade reliability, security, and monitoring.
## 🎯 Production Readiness Overview
### Deployment Checklist
```
□ Security hardening complete
□ SSL certificates configured and auto-renewing
□ Monitoring and alerting implemented
□ Backup and disaster recovery tested
□ Performance optimization validated
□ Documentation complete and accessible
□ Team training and runbooks prepared
```
### Production vs Development Differences
| Aspect | Development | Production |
|--------|-------------|------------|
| **Security** | Basic auth, HTTP allowed | Full security stack, HTTPS only |
| **Logging** | Console output | Structured logging, centralized |
| **Monitoring** | Manual checks | Automated monitoring/alerting |
| **Scaling** | Single instance | Auto-scaling, load balancing |
| **Data** | Test data | Real customer data, GDPR compliance |
| **Uptime** | Best effort | 99.9% SLA target |
## 🏗️ Infrastructure Requirements
### Server Specifications
**Minimum Requirements:**
```
CPU: 2 cores (x86_64)
RAM: 4GB
Storage: 50GB SSD
Network: 100Mbps
OS: Ubuntu 20.04 LTS or newer
```
**Recommended Production:**
```
CPU: 4 cores (x86_64)
RAM: 8GB
Storage: 100GB NVMe SSD
Network: 1Gbps
OS: Ubuntu 22.04 LTS
Backup: Automated daily backups
```
**High Availability Setup:**
```
Load Balancer: 2x instances
Application Servers: 3x instances
Database: Primary + Read Replica
Storage: RAID 1 or cloud block storage
Network: Redundant connections
```
### Network Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ PRODUCTION NETWORK │
├─────────────────────────────────────────────────────────────────┤
│ Internet ──▶ CDN/WAF ──▶ Load Balancer ──▶ Application │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ DDoS Protection Health Checks Auto Scaling │
│ Rate Limiting SSL Termination Multiple Instances │
│ Geo Filtering Session Affinity Container Restart │
└─────────────────────────────────────────────────────────────────┘
```
## 🔒 Security Hardening
### Operating System Security
**System Hardening Checklist:**
```bash
# 1. Update system packages
sudo apt update && sudo apt upgrade -y
# 2. Configure automatic security updates
sudo apt install unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades
# 3. Configure UFW firewall
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow ssh
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw enable
# 4. Install and configure fail2ban
sudo apt install fail2ban
sudo systemctl enable fail2ban
sudo systemctl start fail2ban
# 5. Disable root login and password authentication
sudo sed -i 's/PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
sudo sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config
sudo systemctl restart ssh
# 6. Configure automatic security updates
echo 'Unattended-Upgrade::Automatic-Reboot "true";' | sudo tee -a /etc/apt/apt.conf.d/50unattended-upgrades
echo 'Unattended-Upgrade::Automatic-Reboot-Time "02:00";' | sudo tee -a /etc/apt/apt.conf.d/50unattended-upgrades
```
### Docker Security Configuration
**Production Docker Daemon Config:**
```json
# /etc/docker/daemon.json
{
"live-restore": true,
"userland-proxy": false,
"no-new-privileges": true,
"seccomp-profile": "/etc/docker/seccomp.json",
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
},
"storage-driver": "overlay2",
"storage-opts": [
"overlay2.override_kernel_check=true"
]
}
```
**Security Hardened docker-compose.yml:**
```yaml
version: '3.8'
services:
webhook-service:
build: .
container_name: webhook-service-prod
restart: unless-stopped
# Security configurations
read_only: true
security_opt:
- no-new-privileges:true
- seccomp:unconfined
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE
# Resource limits
deploy:
resources:
limits:
cpus: '0.5'
memory: 512M
reservations:
cpus: '0.1'
memory: 256M
# Temporary filesystems for read-only container
tmpfs:
- /tmp:size=100M,noexec,nosuid,nodev
- /var/run:size=100M,noexec,nosuid,nodev
environment:
- FLASK_ENV=production
- FLASK_SECRET_KEY=${FLASK_SECRET_KEY}
- WEBHOOK_SECRET=${WEBHOOK_SECRET}
- PARTICLE_WEBHOOK_SECRET=${PARTICLE_WEBHOOK_SECRET}
- SMTP_EMAIL=${SMTP_EMAIL}
- SMTP_PASSWORD=${SMTP_PASSWORD}
- RECIPIENT_EMAIL=${RECIPIENT_EMAIL}
networks:
- traefik
- internal
labels:
- "traefik.enable=true"
- "traefik.http.routers.webhook-prod.rule=Host(`webhook.yourdomain.com`)"
- "traefik.http.routers.webhook-prod.entrypoints=websecure"
- "traefik.http.routers.webhook-prod.tls.certresolver=letsencrypt"
- "traefik.http.services.webhook-prod.loadbalancer.server.port=5000"
# Production security middleware
- "traefik.http.routers.webhook-prod.middlewares=webhook-prod-security,webhook-prod-ratelimit"
# Enhanced security headers
- "traefik.http.middlewares.webhook-prod-security.headers.customrequestheaders.X-Forwarded-Proto=https"
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-Content-Type-Options=nosniff"
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-Frame-Options=DENY"
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-XSS-Protection=1; mode=block"
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Referrer-Policy=strict-origin-when-cross-origin"
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Strict-Transport-Security=max-age=31536000; includeSubDomains"
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Content-Security-Policy=default-src 'self'"
# Production rate limiting
- "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.average=20"
- "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.burst=50"
- "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.period=1m"
# Health check configuration
- "traefik.http.services.webhook-prod.loadbalancer.healthcheck.path=/health"
- "traefik.http.services.webhook-prod.loadbalancer.healthcheck.interval=30s"
networks:
traefik:
external: true
internal:
internal: true
```
### SSL/TLS Configuration
**Production Traefik SSL Configuration:**
```yaml
# traefik.yml
certificatesResolvers:
letsencrypt:
acme:
email: admin@yourdomain.com
storage: /acme.json
httpChallenge:
entryPoint: web
# Production Let's Encrypt endpoint
caServer: https://acme-v02.api.letsencrypt.org/directory
# Enhanced TLS configuration
tls:
options:
default:
minVersion: "VersionTLS12"
maxVersion: "VersionTLS13"
cipherSuites:
- "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"
- "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384"
- "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"
- "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305"
curvePreferences:
- "CurveP521"
- "CurveP384"
sniStrict: true
```
## 📊 Monitoring and Observability
### Production Monitoring Stack
**Monitoring Architecture:**
```
┌─────────────────────────────────────────────────────────────────┐
│ MONITORING STACK │
├─────────────────────────────────────────────────────────────────┤
│ Application Metrics │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │Prometheus │ │Grafana │ │AlertManager │ │
│ │Metrics │ │Dashboards │ │Notifications │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Log Management │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │Loki │ │Log │ │Error │ │
│ │Aggregation │ │Analysis │ │Tracking │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Infrastructure Monitoring │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │Node │ │Docker │ │Network │ │
│ │Exporter │ │Stats │ │Monitoring │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
### Prometheus Metrics Integration
**Enhanced webhook_app.py with metrics:**
```python
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Metrics definitions
webhook_requests_total = Counter(
'webhook_requests_total',
'Total webhook requests',
['method', 'endpoint', 'status_code', 'source_type']
)
webhook_request_duration = Histogram(
'webhook_request_duration_seconds',
'Webhook request duration',
['endpoint', 'source_type']
)
webhook_auth_failures = Counter(
'webhook_auth_failures_total',
'Total authentication failures',
['source_type', 'failure_reason']
)
notification_delivery_total = Counter(
'notification_delivery_total',
'Total notification delivery attempts',
['delivery_method', 'status']
)
active_connections = Gauge(
'webhook_active_connections',
'Number of active connections'
)
# Middleware for metrics collection
def metrics_middleware():
def decorator(f):
def wrapper(*args, **kwargs):
start_time = time.time()
source_type = 'particle' if 'ParticleBot' in request.headers.get('User-Agent', '') else 'generic'
try:
result = f(*args, **kwargs)
status_code = result[1] if isinstance(result, tuple) else 200
webhook_requests_total.labels(
method=request.method,
endpoint=request.endpoint,
status_code=status_code,
source_type=source_type
).inc()
return result
except Exception as e:
webhook_requests_total.labels(
method=request.method,
endpoint=request.endpoint,
status_code=500,
source_type=source_type
).inc()
raise
finally:
duration = time.time() - start_time
webhook_request_duration.labels(
endpoint=request.endpoint,
source_type=source_type
).observe(duration)
return wrapper
return decorator
# Add metrics endpoint
@app.route('/metrics')
def metrics():
"""Prometheus metrics endpoint"""
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
# Start metrics server
if __name__ == '__main__':
start_http_server(8000) # Prometheus metrics on port 8000
app.run(host='0.0.0.0', port=5000)
```
### Grafana Dashboard Configuration
**Production Dashboard JSON:**
```json
json{
"dashboard": {
"title": "Webhook Service Production Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(webhook_requests_total[5m])",
"legendFormat": "{{source_type}} - {{status_code}}"
}
]
},
{
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(webhook_request_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
},
{
"expr": "histogram_quantile(0.50, rate(webhook_request_duration_seconds_bucket[5m]))",
"legendFormat": "50th percentile"
}
]
},
{
"title": "Authentication Failures",
"type": "singlestat",
"targets": [
{
"expr": "increase(webhook_auth_failures_total[1h])",
"legendFormat": "Last Hour"
}
]
},
{
"title": "Notification Success Rate",
"type": "graph",
"targets": [
{
"expr": "rate(notification_delivery_total{status=\"success\"}[5m]) / rate(notification_delivery_total[5m]) * 100",
"legendFormat": "Success Rate %"
}
]
}
]
}
}
```
### Alerting Rules
**AlertManager Configuration:**
```yml
yaml# alertmanager.yml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@yourdomain.com'
smtp_auth_username: 'alerts@yourdomain.com'
smtp_auth_password: 'your-app-password'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'webhook-alerts'
receivers:
- name: 'webhook-alerts'
email_configs:
- to: 'admin@yourdomain.com'
subject: 'Webhook Service Alert - {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Instance: {{ .Labels.instance }}
Severity: {{ .Labels.severity }}
{{ end }}
# Prometheus alerting rules
groups:
- name: webhook-service
rules:
- alert: WebhookServiceDown
expr: up{job="webhook-service"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Webhook service is down"
description: "Webhook service has been down for more than 1 minute"
- alert: HighErrorRate
expr: rate(webhook_requests_total{status_code=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} requests per second"
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(webhook_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High response time"
description: "95th percentile response time is {{ $value }} seconds"
- alert: AuthenticationFailures
expr: increase(webhook_auth_failures_total[15m]) > 10
for: 0m
labels:
severity: critical
annotations:
summary: "Multiple authentication failures"
description: "{{ $value }} authentication failures in the last 15 minutes"
```
### 🎯 Production Success Metrics
**Service Level Objectives (SLOs)**
Availability SLO: 99.9% uptime
- Measurement: HTTP 200 responses / Total HTTP requests
- Error Budget: 43.2 minutes downtime per month
- Alerting: Alert if availability drops below 99.5% over 1 hour
Latency SLO: 95% of requests < 500ms
- Measurement: Response time distribution
- Alerting: Alert if 95th percentile > 500ms for 5 minutes
Error Rate SLO: <0.1% error rate
- Measurement: HTTP 5xx responses / Total HTTP requests
- Alerting: Alert if error rate > 0.5% over 5 minutes
Security SLO: <10 authentication failures per day
- Measurement: Failed authentication attempts
- Alerting: Alert if >50 failures in 1 hour
### Key Performance Indicators
**Business Metrics:**
□ Total webhook events processed per day
□ Notification delivery success rate (target: >99%)
□ Average response time (target: <100ms)
Cost per webhook processed
Mean time to detection (MTTD) for issues
Mean time to resolution (MTTR) for incidents
Infrastructure utilization efficiency
Customer satisfaction score
### 📞 Production Support
**Incident Response**
***Severity Levels:***
SEVERITY 1 - Critical (Service Down)
Response Time: 15 minutes
Resolution Time: 1 hour
Actions: Immediate escalation, war room, customer communication
SEVERITY 2 - High (Degraded Performance)
Response Time: 30 minutes
Resolution Time: 4 hours
Actions: Team lead notification, monitoring increase
SEVERITY 3 - Medium (Minor Issues)
Response Time: 2 hours
Resolution Time: 24 hours
Actions: Standard troubleshooting, ticket tracking
SEVERITY 4 - Low (Enhancement Requests)
Response Time: Next business day
Resolution Time: Per roadmap
Actions: Backlog prioritization
### On-Call Procedures
**24/7 Support Structure:**
Primary On-Call: Initial response and triage
Secondary On-Call: Backup coverage and escalation
Engineering Manager: Resource coordination
Senior Leadership: Business impact decisions
Escalation Timeline:
- 15 minutes: Auto-escalate if no response
- 30 minutes: Escalate to secondary on-call
- 1 hour: Escalate to engineering manager
- 2 hours: Escalate to senior leadership
### 🚀 Production Deployment Summary:
**This production deployment guide provides enterprise-grade reliability with:**
99.9% Uptime Target - Comprehensive monitoring and alerting
Enterprise Security - Multi-layer security hardening
Auto-scaling - Dynamic resource allocation
Disaster Recovery - Automated backup and recovery procedures
24/7 Support - Structured incident response and on-call coverage
Performance Optimization - Sub-500ms response times