564 lines
19 KiB
Markdown
564 lines
19 KiB
Markdown
# Production Deployment Guide
|
|
|
|
Comprehensive guide for deploying the webhook service in production environments with enterprise-grade reliability, security, and monitoring.
|
|
|
|
## 🎯 Production Readiness Overview
|
|
|
|
### Deployment Checklist
|
|
|
|
```
|
|
□ Security hardening complete
|
|
□ SSL certificates configured and auto-renewing
|
|
□ Monitoring and alerting implemented
|
|
□ Backup and disaster recovery tested
|
|
□ Performance optimization validated
|
|
□ Documentation complete and accessible
|
|
□ Team training and runbooks prepared
|
|
```
|
|
|
|
### Production vs Development Differences
|
|
|
|
| Aspect | Development | Production |
|
|
|--------|-------------|------------|
|
|
| **Security** | Basic auth, HTTP allowed | Full security stack, HTTPS only |
|
|
| **Logging** | Console output | Structured logging, centralized |
|
|
| **Monitoring** | Manual checks | Automated monitoring/alerting |
|
|
| **Scaling** | Single instance | Auto-scaling, load balancing |
|
|
| **Data** | Test data | Real customer data, GDPR compliance |
|
|
| **Uptime** | Best effort | 99.9% SLA target |
|
|
|
|
## 🏗️ Infrastructure Requirements
|
|
|
|
### Server Specifications
|
|
|
|
**Minimum Requirements:**
|
|
```
|
|
CPU: 2 cores (x86_64)
|
|
RAM: 4GB
|
|
Storage: 50GB SSD
|
|
Network: 100Mbps
|
|
OS: Ubuntu 20.04 LTS or newer
|
|
```
|
|
|
|
**Recommended Production:**
|
|
```
|
|
CPU: 4 cores (x86_64)
|
|
RAM: 8GB
|
|
Storage: 100GB NVMe SSD
|
|
Network: 1Gbps
|
|
OS: Ubuntu 22.04 LTS
|
|
Backup: Automated daily backups
|
|
```
|
|
|
|
**High Availability Setup:**
|
|
```
|
|
Load Balancer: 2x instances
|
|
Application Servers: 3x instances
|
|
Database: Primary + Read Replica
|
|
Storage: RAID 1 or cloud block storage
|
|
Network: Redundant connections
|
|
```
|
|
|
|
### Network Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ PRODUCTION NETWORK │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ Internet ──▶ CDN/WAF ──▶ Load Balancer ──▶ Application │
|
|
│ │ │ │ │
|
|
│ ▼ ▼ ▼ │
|
|
│ DDoS Protection Health Checks Auto Scaling │
|
|
│ Rate Limiting SSL Termination Multiple Instances │
|
|
│ Geo Filtering Session Affinity Container Restart │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## 🔒 Security Hardening
|
|
|
|
### Operating System Security
|
|
|
|
**System Hardening Checklist:**
|
|
```bash
|
|
# 1. Update system packages
|
|
sudo apt update && sudo apt upgrade -y
|
|
|
|
# 2. Configure automatic security updates
|
|
sudo apt install unattended-upgrades
|
|
sudo dpkg-reconfigure -plow unattended-upgrades
|
|
|
|
# 3. Configure UFW firewall
|
|
sudo ufw default deny incoming
|
|
sudo ufw default allow outgoing
|
|
sudo ufw allow ssh
|
|
sudo ufw allow 80/tcp
|
|
sudo ufw allow 443/tcp
|
|
sudo ufw enable
|
|
|
|
# 4. Install and configure fail2ban
|
|
sudo apt install fail2ban
|
|
sudo systemctl enable fail2ban
|
|
sudo systemctl start fail2ban
|
|
|
|
# 5. Disable root login and password authentication
|
|
sudo sed -i 's/PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
|
|
sudo sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config
|
|
sudo systemctl restart ssh
|
|
|
|
# 6. Configure automatic security updates
|
|
echo 'Unattended-Upgrade::Automatic-Reboot "true";' | sudo tee -a /etc/apt/apt.conf.d/50unattended-upgrades
|
|
echo 'Unattended-Upgrade::Automatic-Reboot-Time "02:00";' | sudo tee -a /etc/apt/apt.conf.d/50unattended-upgrades
|
|
```
|
|
|
|
### Docker Security Configuration
|
|
|
|
**Production Docker Daemon Config:**
|
|
```json
|
|
# /etc/docker/daemon.json
|
|
{
|
|
"live-restore": true,
|
|
"userland-proxy": false,
|
|
"no-new-privileges": true,
|
|
"seccomp-profile": "/etc/docker/seccomp.json",
|
|
"log-driver": "json-file",
|
|
"log-opts": {
|
|
"max-size": "10m",
|
|
"max-file": "3"
|
|
},
|
|
"storage-driver": "overlay2",
|
|
"storage-opts": [
|
|
"overlay2.override_kernel_check=true"
|
|
]
|
|
}
|
|
```
|
|
|
|
**Security Hardened docker-compose.yml:**
|
|
```yaml
|
|
version: '3.8'
|
|
|
|
services:
|
|
webhook-service:
|
|
build: .
|
|
container_name: webhook-service-prod
|
|
restart: unless-stopped
|
|
|
|
# Security configurations
|
|
read_only: true
|
|
security_opt:
|
|
- no-new-privileges:true
|
|
- seccomp:unconfined
|
|
cap_drop:
|
|
- ALL
|
|
cap_add:
|
|
- NET_BIND_SERVICE
|
|
|
|
# Resource limits
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
cpus: '0.5'
|
|
memory: 512M
|
|
reservations:
|
|
cpus: '0.1'
|
|
memory: 256M
|
|
|
|
# Temporary filesystems for read-only container
|
|
tmpfs:
|
|
- /tmp:size=100M,noexec,nosuid,nodev
|
|
- /var/run:size=100M,noexec,nosuid,nodev
|
|
|
|
environment:
|
|
- FLASK_ENV=production
|
|
- FLASK_SECRET_KEY=${FLASK_SECRET_KEY}
|
|
- WEBHOOK_SECRET=${WEBHOOK_SECRET}
|
|
- PARTICLE_WEBHOOK_SECRET=${PARTICLE_WEBHOOK_SECRET}
|
|
- SMTP_EMAIL=${SMTP_EMAIL}
|
|
- SMTP_PASSWORD=${SMTP_PASSWORD}
|
|
- RECIPIENT_EMAIL=${RECIPIENT_EMAIL}
|
|
|
|
networks:
|
|
- traefik
|
|
- internal
|
|
|
|
labels:
|
|
- "traefik.enable=true"
|
|
- "traefik.http.routers.webhook-prod.rule=Host(`webhook.yourdomain.com`)"
|
|
- "traefik.http.routers.webhook-prod.entrypoints=websecure"
|
|
- "traefik.http.routers.webhook-prod.tls.certresolver=letsencrypt"
|
|
- "traefik.http.services.webhook-prod.loadbalancer.server.port=5000"
|
|
|
|
# Production security middleware
|
|
- "traefik.http.routers.webhook-prod.middlewares=webhook-prod-security,webhook-prod-ratelimit"
|
|
|
|
# Enhanced security headers
|
|
- "traefik.http.middlewares.webhook-prod-security.headers.customrequestheaders.X-Forwarded-Proto=https"
|
|
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-Content-Type-Options=nosniff"
|
|
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-Frame-Options=DENY"
|
|
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-XSS-Protection=1; mode=block"
|
|
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Referrer-Policy=strict-origin-when-cross-origin"
|
|
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Strict-Transport-Security=max-age=31536000; includeSubDomains"
|
|
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Content-Security-Policy=default-src 'self'"
|
|
|
|
# Production rate limiting
|
|
- "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.average=20"
|
|
- "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.burst=50"
|
|
- "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.period=1m"
|
|
|
|
# Health check configuration
|
|
- "traefik.http.services.webhook-prod.loadbalancer.healthcheck.path=/health"
|
|
- "traefik.http.services.webhook-prod.loadbalancer.healthcheck.interval=30s"
|
|
|
|
networks:
|
|
traefik:
|
|
external: true
|
|
internal:
|
|
internal: true
|
|
```
|
|
|
|
### SSL/TLS Configuration
|
|
|
|
**Production Traefik SSL Configuration:**
|
|
```yaml
|
|
# traefik.yml
|
|
certificatesResolvers:
|
|
letsencrypt:
|
|
acme:
|
|
email: admin@yourdomain.com
|
|
storage: /acme.json
|
|
httpChallenge:
|
|
entryPoint: web
|
|
# Production Let's Encrypt endpoint
|
|
caServer: https://acme-v02.api.letsencrypt.org/directory
|
|
|
|
# Enhanced TLS configuration
|
|
tls:
|
|
options:
|
|
default:
|
|
minVersion: "VersionTLS12"
|
|
maxVersion: "VersionTLS13"
|
|
cipherSuites:
|
|
- "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"
|
|
- "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384"
|
|
- "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"
|
|
- "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305"
|
|
curvePreferences:
|
|
- "CurveP521"
|
|
- "CurveP384"
|
|
sniStrict: true
|
|
```
|
|
|
|
## 📊 Monitoring and Observability
|
|
|
|
### Production Monitoring Stack
|
|
|
|
**Monitoring Architecture:**
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ MONITORING STACK │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ Application Metrics │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
|
|
│ │Prometheus │ │Grafana │ │AlertManager │ │
|
|
│ │Metrics │ │Dashboards │ │Notifications │ │
|
|
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ Log Management │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
|
|
│ │Loki │ │Log │ │Error │ │
|
|
│ │Aggregation │ │Analysis │ │Tracking │ │
|
|
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ Infrastructure Monitoring │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
|
|
│ │Node │ │Docker │ │Network │ │
|
|
│ │Exporter │ │Stats │ │Monitoring │ │
|
|
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Prometheus Metrics Integration
|
|
|
|
**Enhanced webhook_app.py with metrics:**
|
|
```python
|
|
from prometheus_client import Counter, Histogram, Gauge, start_http_server
|
|
import time
|
|
|
|
# Metrics definitions
|
|
webhook_requests_total = Counter(
|
|
'webhook_requests_total',
|
|
'Total webhook requests',
|
|
['method', 'endpoint', 'status_code', 'source_type']
|
|
)
|
|
|
|
webhook_request_duration = Histogram(
|
|
'webhook_request_duration_seconds',
|
|
'Webhook request duration',
|
|
['endpoint', 'source_type']
|
|
)
|
|
|
|
webhook_auth_failures = Counter(
|
|
'webhook_auth_failures_total',
|
|
'Total authentication failures',
|
|
['source_type', 'failure_reason']
|
|
)
|
|
|
|
notification_delivery_total = Counter(
|
|
'notification_delivery_total',
|
|
'Total notification delivery attempts',
|
|
['delivery_method', 'status']
|
|
)
|
|
|
|
active_connections = Gauge(
|
|
'webhook_active_connections',
|
|
'Number of active connections'
|
|
)
|
|
|
|
# Middleware for metrics collection
|
|
def metrics_middleware():
|
|
def decorator(f):
|
|
def wrapper(*args, **kwargs):
|
|
start_time = time.time()
|
|
source_type = 'particle' if 'ParticleBot' in request.headers.get('User-Agent', '') else 'generic'
|
|
|
|
try:
|
|
result = f(*args, **kwargs)
|
|
status_code = result[1] if isinstance(result, tuple) else 200
|
|
|
|
webhook_requests_total.labels(
|
|
method=request.method,
|
|
endpoint=request.endpoint,
|
|
status_code=status_code,
|
|
source_type=source_type
|
|
).inc()
|
|
|
|
return result
|
|
|
|
except Exception as e:
|
|
webhook_requests_total.labels(
|
|
method=request.method,
|
|
endpoint=request.endpoint,
|
|
status_code=500,
|
|
source_type=source_type
|
|
).inc()
|
|
raise
|
|
|
|
finally:
|
|
duration = time.time() - start_time
|
|
webhook_request_duration.labels(
|
|
endpoint=request.endpoint,
|
|
source_type=source_type
|
|
).observe(duration)
|
|
|
|
return wrapper
|
|
return decorator
|
|
|
|
# Add metrics endpoint
|
|
@app.route('/metrics')
|
|
def metrics():
|
|
"""Prometheus metrics endpoint"""
|
|
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
|
|
return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
|
|
|
|
# Start metrics server
|
|
if __name__ == '__main__':
|
|
start_http_server(8000) # Prometheus metrics on port 8000
|
|
app.run(host='0.0.0.0', port=5000)
|
|
```
|
|
### Grafana Dashboard Configuration
|
|
**Production Dashboard JSON:**
|
|
```json
|
|
json{
|
|
"dashboard": {
|
|
"title": "Webhook Service Production Dashboard",
|
|
"panels": [
|
|
{
|
|
"title": "Request Rate",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "rate(webhook_requests_total[5m])",
|
|
"legendFormat": "{{source_type}} - {{status_code}}"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Response Time",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "histogram_quantile(0.95, rate(webhook_request_duration_seconds_bucket[5m]))",
|
|
"legendFormat": "95th percentile"
|
|
},
|
|
{
|
|
"expr": "histogram_quantile(0.50, rate(webhook_request_duration_seconds_bucket[5m]))",
|
|
"legendFormat": "50th percentile"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Authentication Failures",
|
|
"type": "singlestat",
|
|
"targets": [
|
|
{
|
|
"expr": "increase(webhook_auth_failures_total[1h])",
|
|
"legendFormat": "Last Hour"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Notification Success Rate",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "rate(notification_delivery_total{status=\"success\"}[5m]) / rate(notification_delivery_total[5m]) * 100",
|
|
"legendFormat": "Success Rate %"
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
### Alerting Rules
|
|
|
|
**AlertManager Configuration:**
|
|
```yml
|
|
yaml# alertmanager.yml
|
|
global:
|
|
smtp_smarthost: 'smtp.gmail.com:587'
|
|
smtp_from: 'alerts@yourdomain.com'
|
|
smtp_auth_username: 'alerts@yourdomain.com'
|
|
smtp_auth_password: 'your-app-password'
|
|
|
|
route:
|
|
group_by: ['alertname']
|
|
group_wait: 10s
|
|
group_interval: 10s
|
|
repeat_interval: 1h
|
|
receiver: 'webhook-alerts'
|
|
|
|
receivers:
|
|
- name: 'webhook-alerts'
|
|
email_configs:
|
|
- to: 'admin@yourdomain.com'
|
|
subject: 'Webhook Service Alert - {{ .GroupLabels.alertname }}'
|
|
body: |
|
|
{{ range .Alerts }}
|
|
Alert: {{ .Annotations.summary }}
|
|
Description: {{ .Annotations.description }}
|
|
Instance: {{ .Labels.instance }}
|
|
Severity: {{ .Labels.severity }}
|
|
{{ end }}
|
|
|
|
# Prometheus alerting rules
|
|
groups:
|
|
- name: webhook-service
|
|
rules:
|
|
- alert: WebhookServiceDown
|
|
expr: up{job="webhook-service"} == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Webhook service is down"
|
|
description: "Webhook service has been down for more than 1 minute"
|
|
|
|
- alert: HighErrorRate
|
|
expr: rate(webhook_requests_total{status_code=~"5.."}[5m]) > 0.1
|
|
for: 2m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High error rate detected"
|
|
description: "Error rate is {{ $value }} requests per second"
|
|
|
|
- alert: HighResponseTime
|
|
expr: histogram_quantile(0.95, rate(webhook_request_duration_seconds_bucket[5m])) > 1
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High response time"
|
|
description: "95th percentile response time is {{ $value }} seconds"
|
|
|
|
- alert: AuthenticationFailures
|
|
expr: increase(webhook_auth_failures_total[15m]) > 10
|
|
for: 0m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Multiple authentication failures"
|
|
description: "{{ $value }} authentication failures in the last 15 minutes"
|
|
```
|
|
### 🎯 Production Success Metrics
|
|
**Service Level Objectives (SLOs)**
|
|
Availability SLO: 99.9% uptime
|
|
- Measurement: HTTP 200 responses / Total HTTP requests
|
|
- Error Budget: 43.2 minutes downtime per month
|
|
- Alerting: Alert if availability drops below 99.5% over 1 hour
|
|
|
|
Latency SLO: 95% of requests < 500ms
|
|
- Measurement: Response time distribution
|
|
- Alerting: Alert if 95th percentile > 500ms for 5 minutes
|
|
|
|
Error Rate SLO: <0.1% error rate
|
|
- Measurement: HTTP 5xx responses / Total HTTP requests
|
|
- Alerting: Alert if error rate > 0.5% over 5 minutes
|
|
|
|
Security SLO: <10 authentication failures per day
|
|
- Measurement: Failed authentication attempts
|
|
- Alerting: Alert if >50 failures in 1 hour
|
|
|
|
### Key Performance Indicators
|
|
**Business Metrics:**
|
|
□ Total webhook events processed per day
|
|
□ Notification delivery success rate (target: >99%)
|
|
□ Average response time (target: <100ms)
|
|
□ Cost per webhook processed
|
|
□ Mean time to detection (MTTD) for issues
|
|
□ Mean time to resolution (MTTR) for incidents
|
|
□ Infrastructure utilization efficiency
|
|
□ Customer satisfaction score
|
|
### 📞 Production Support
|
|
**Incident Response**
|
|
***Severity Levels:***
|
|
SEVERITY 1 - Critical (Service Down)
|
|
Response Time: 15 minutes
|
|
Resolution Time: 1 hour
|
|
Actions: Immediate escalation, war room, customer communication
|
|
|
|
SEVERITY 2 - High (Degraded Performance)
|
|
Response Time: 30 minutes
|
|
Resolution Time: 4 hours
|
|
Actions: Team lead notification, monitoring increase
|
|
|
|
SEVERITY 3 - Medium (Minor Issues)
|
|
Response Time: 2 hours
|
|
Resolution Time: 24 hours
|
|
Actions: Standard troubleshooting, ticket tracking
|
|
|
|
SEVERITY 4 - Low (Enhancement Requests)
|
|
Response Time: Next business day
|
|
Resolution Time: Per roadmap
|
|
Actions: Backlog prioritization
|
|
### On-Call Procedures
|
|
**24/7 Support Structure:**
|
|
Primary On-Call: Initial response and triage
|
|
Secondary On-Call: Backup coverage and escalation
|
|
Engineering Manager: Resource coordination
|
|
Senior Leadership: Business impact decisions
|
|
|
|
Escalation Timeline:
|
|
- 15 minutes: Auto-escalate if no response
|
|
- 30 minutes: Escalate to secondary on-call
|
|
- 1 hour: Escalate to engineering manager
|
|
- 2 hours: Escalate to senior leadership
|
|
|
|
### 🚀 Production Deployment Summary:
|
|
**This production deployment guide provides enterprise-grade reliability with:**
|
|
✅ 99.9% Uptime Target - Comprehensive monitoring and alerting
|
|
✅ Enterprise Security - Multi-layer security hardening
|
|
✅ Auto-scaling - Dynamic resource allocation
|
|
✅ Disaster Recovery - Automated backup and recovery procedures
|
|
✅ 24/7 Support - Structured incident response and on-call coverage
|
|
✅ Performance Optimization - Sub-500ms response times |