Added Docs
This commit is contained in:
0
Server/.env.example
Normal file
0
Server/.env.example
Normal file
File diff suppressed because it is too large
Load Diff
@ -0,0 +1,564 @@
|
||||
# Production Deployment Guide
|
||||
|
||||
Comprehensive guide for deploying the webhook service in production environments with enterprise-grade reliability, security, and monitoring.
|
||||
|
||||
## 🎯 Production Readiness Overview
|
||||
|
||||
### Deployment Checklist
|
||||
|
||||
```
|
||||
□ Security hardening complete
|
||||
□ SSL certificates configured and auto-renewing
|
||||
□ Monitoring and alerting implemented
|
||||
□ Backup and disaster recovery tested
|
||||
□ Performance optimization validated
|
||||
□ Documentation complete and accessible
|
||||
□ Team training and runbooks prepared
|
||||
```
|
||||
|
||||
### Production vs Development Differences
|
||||
|
||||
| Aspect | Development | Production |
|
||||
|--------|-------------|------------|
|
||||
| **Security** | Basic auth, HTTP allowed | Full security stack, HTTPS only |
|
||||
| **Logging** | Console output | Structured logging, centralized |
|
||||
| **Monitoring** | Manual checks | Automated monitoring/alerting |
|
||||
| **Scaling** | Single instance | Auto-scaling, load balancing |
|
||||
| **Data** | Test data | Real customer data, GDPR compliance |
|
||||
| **Uptime** | Best effort | 99.9% SLA target |
|
||||
|
||||
## 🏗️ Infrastructure Requirements
|
||||
|
||||
### Server Specifications
|
||||
|
||||
**Minimum Requirements:**
|
||||
```
|
||||
CPU: 2 cores (x86_64)
|
||||
RAM: 4GB
|
||||
Storage: 50GB SSD
|
||||
Network: 100Mbps
|
||||
OS: Ubuntu 20.04 LTS or newer
|
||||
```
|
||||
|
||||
**Recommended Production:**
|
||||
```
|
||||
CPU: 4 cores (x86_64)
|
||||
RAM: 8GB
|
||||
Storage: 100GB NVMe SSD
|
||||
Network: 1Gbps
|
||||
OS: Ubuntu 22.04 LTS
|
||||
Backup: Automated daily backups
|
||||
```
|
||||
|
||||
**High Availability Setup:**
|
||||
```
|
||||
Load Balancer: 2x instances
|
||||
Application Servers: 3x instances
|
||||
Database: Primary + Read Replica
|
||||
Storage: RAID 1 or cloud block storage
|
||||
Network: Redundant connections
|
||||
```
|
||||
|
||||
### Network Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ PRODUCTION NETWORK │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ Internet ──▶ CDN/WAF ──▶ Load Balancer ──▶ Application │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ DDoS Protection Health Checks Auto Scaling │
|
||||
│ Rate Limiting SSL Termination Multiple Instances │
|
||||
│ Geo Filtering Session Affinity Container Restart │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## 🔒 Security Hardening
|
||||
|
||||
### Operating System Security
|
||||
|
||||
**System Hardening Checklist:**
|
||||
```bash
|
||||
# 1. Update system packages
|
||||
sudo apt update && sudo apt upgrade -y
|
||||
|
||||
# 2. Configure automatic security updates
|
||||
sudo apt install unattended-upgrades
|
||||
sudo dpkg-reconfigure -plow unattended-upgrades
|
||||
|
||||
# 3. Configure UFW firewall
|
||||
sudo ufw default deny incoming
|
||||
sudo ufw default allow outgoing
|
||||
sudo ufw allow ssh
|
||||
sudo ufw allow 80/tcp
|
||||
sudo ufw allow 443/tcp
|
||||
sudo ufw enable
|
||||
|
||||
# 4. Install and configure fail2ban
|
||||
sudo apt install fail2ban
|
||||
sudo systemctl enable fail2ban
|
||||
sudo systemctl start fail2ban
|
||||
|
||||
# 5. Disable root login and password authentication
|
||||
sudo sed -i 's/PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
|
||||
sudo sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config
|
||||
sudo systemctl restart ssh
|
||||
|
||||
# 6. Configure automatic security updates
|
||||
echo 'Unattended-Upgrade::Automatic-Reboot "true";' | sudo tee -a /etc/apt/apt.conf.d/50unattended-upgrades
|
||||
echo 'Unattended-Upgrade::Automatic-Reboot-Time "02:00";' | sudo tee -a /etc/apt/apt.conf.d/50unattended-upgrades
|
||||
```
|
||||
|
||||
### Docker Security Configuration
|
||||
|
||||
**Production Docker Daemon Config:**
|
||||
```json
|
||||
# /etc/docker/daemon.json
|
||||
{
|
||||
"live-restore": true,
|
||||
"userland-proxy": false,
|
||||
"no-new-privileges": true,
|
||||
"seccomp-profile": "/etc/docker/seccomp.json",
|
||||
"log-driver": "json-file",
|
||||
"log-opts": {
|
||||
"max-size": "10m",
|
||||
"max-file": "3"
|
||||
},
|
||||
"storage-driver": "overlay2",
|
||||
"storage-opts": [
|
||||
"overlay2.override_kernel_check=true"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Security Hardened docker-compose.yml:**
|
||||
```yaml
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
webhook-service:
|
||||
build: .
|
||||
container_name: webhook-service-prod
|
||||
restart: unless-stopped
|
||||
|
||||
# Security configurations
|
||||
read_only: true
|
||||
security_opt:
|
||||
- no-new-privileges:true
|
||||
- seccomp:unconfined
|
||||
cap_drop:
|
||||
- ALL
|
||||
cap_add:
|
||||
- NET_BIND_SERVICE
|
||||
|
||||
# Resource limits
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
cpus: '0.5'
|
||||
memory: 512M
|
||||
reservations:
|
||||
cpus: '0.1'
|
||||
memory: 256M
|
||||
|
||||
# Temporary filesystems for read-only container
|
||||
tmpfs:
|
||||
- /tmp:size=100M,noexec,nosuid,nodev
|
||||
- /var/run:size=100M,noexec,nosuid,nodev
|
||||
|
||||
environment:
|
||||
- FLASK_ENV=production
|
||||
- FLASK_SECRET_KEY=${FLASK_SECRET_KEY}
|
||||
- WEBHOOK_SECRET=${WEBHOOK_SECRET}
|
||||
- PARTICLE_WEBHOOK_SECRET=${PARTICLE_WEBHOOK_SECRET}
|
||||
- SMTP_EMAIL=${SMTP_EMAIL}
|
||||
- SMTP_PASSWORD=${SMTP_PASSWORD}
|
||||
- RECIPIENT_EMAIL=${RECIPIENT_EMAIL}
|
||||
|
||||
networks:
|
||||
- traefik
|
||||
- internal
|
||||
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.webhook-prod.rule=Host(`webhook.yourdomain.com`)"
|
||||
- "traefik.http.routers.webhook-prod.entrypoints=websecure"
|
||||
- "traefik.http.routers.webhook-prod.tls.certresolver=letsencrypt"
|
||||
- "traefik.http.services.webhook-prod.loadbalancer.server.port=5000"
|
||||
|
||||
# Production security middleware
|
||||
- "traefik.http.routers.webhook-prod.middlewares=webhook-prod-security,webhook-prod-ratelimit"
|
||||
|
||||
# Enhanced security headers
|
||||
- "traefik.http.middlewares.webhook-prod-security.headers.customrequestheaders.X-Forwarded-Proto=https"
|
||||
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-Content-Type-Options=nosniff"
|
||||
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-Frame-Options=DENY"
|
||||
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-XSS-Protection=1; mode=block"
|
||||
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Referrer-Policy=strict-origin-when-cross-origin"
|
||||
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Strict-Transport-Security=max-age=31536000; includeSubDomains"
|
||||
- "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Content-Security-Policy=default-src 'self'"
|
||||
|
||||
# Production rate limiting
|
||||
- "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.average=20"
|
||||
- "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.burst=50"
|
||||
- "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.period=1m"
|
||||
|
||||
# Health check configuration
|
||||
- "traefik.http.services.webhook-prod.loadbalancer.healthcheck.path=/health"
|
||||
- "traefik.http.services.webhook-prod.loadbalancer.healthcheck.interval=30s"
|
||||
|
||||
networks:
|
||||
traefik:
|
||||
external: true
|
||||
internal:
|
||||
internal: true
|
||||
```
|
||||
|
||||
### SSL/TLS Configuration
|
||||
|
||||
**Production Traefik SSL Configuration:**
|
||||
```yaml
|
||||
# traefik.yml
|
||||
certificatesResolvers:
|
||||
letsencrypt:
|
||||
acme:
|
||||
email: admin@yourdomain.com
|
||||
storage: /acme.json
|
||||
httpChallenge:
|
||||
entryPoint: web
|
||||
# Production Let's Encrypt endpoint
|
||||
caServer: https://acme-v02.api.letsencrypt.org/directory
|
||||
|
||||
# Enhanced TLS configuration
|
||||
tls:
|
||||
options:
|
||||
default:
|
||||
minVersion: "VersionTLS12"
|
||||
maxVersion: "VersionTLS13"
|
||||
cipherSuites:
|
||||
- "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"
|
||||
- "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384"
|
||||
- "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"
|
||||
- "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305"
|
||||
curvePreferences:
|
||||
- "CurveP521"
|
||||
- "CurveP384"
|
||||
sniStrict: true
|
||||
```
|
||||
|
||||
## 📊 Monitoring and Observability
|
||||
|
||||
### Production Monitoring Stack
|
||||
|
||||
**Monitoring Architecture:**
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ MONITORING STACK │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ Application Metrics │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
|
||||
│ │Prometheus │ │Grafana │ │AlertManager │ │
|
||||
│ │Metrics │ │Dashboards │ │Notifications │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ Log Management │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
|
||||
│ │Loki │ │Log │ │Error │ │
|
||||
│ │Aggregation │ │Analysis │ │Tracking │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ Infrastructure Monitoring │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
|
||||
│ │Node │ │Docker │ │Network │ │
|
||||
│ │Exporter │ │Stats │ │Monitoring │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Prometheus Metrics Integration
|
||||
|
||||
**Enhanced webhook_app.py with metrics:**
|
||||
```python
|
||||
from prometheus_client import Counter, Histogram, Gauge, start_http_server
|
||||
import time
|
||||
|
||||
# Metrics definitions
|
||||
webhook_requests_total = Counter(
|
||||
'webhook_requests_total',
|
||||
'Total webhook requests',
|
||||
['method', 'endpoint', 'status_code', 'source_type']
|
||||
)
|
||||
|
||||
webhook_request_duration = Histogram(
|
||||
'webhook_request_duration_seconds',
|
||||
'Webhook request duration',
|
||||
['endpoint', 'source_type']
|
||||
)
|
||||
|
||||
webhook_auth_failures = Counter(
|
||||
'webhook_auth_failures_total',
|
||||
'Total authentication failures',
|
||||
['source_type', 'failure_reason']
|
||||
)
|
||||
|
||||
notification_delivery_total = Counter(
|
||||
'notification_delivery_total',
|
||||
'Total notification delivery attempts',
|
||||
['delivery_method', 'status']
|
||||
)
|
||||
|
||||
active_connections = Gauge(
|
||||
'webhook_active_connections',
|
||||
'Number of active connections'
|
||||
)
|
||||
|
||||
# Middleware for metrics collection
|
||||
def metrics_middleware():
|
||||
def decorator(f):
|
||||
def wrapper(*args, **kwargs):
|
||||
start_time = time.time()
|
||||
source_type = 'particle' if 'ParticleBot' in request.headers.get('User-Agent', '') else 'generic'
|
||||
|
||||
try:
|
||||
result = f(*args, **kwargs)
|
||||
status_code = result[1] if isinstance(result, tuple) else 200
|
||||
|
||||
webhook_requests_total.labels(
|
||||
method=request.method,
|
||||
endpoint=request.endpoint,
|
||||
status_code=status_code,
|
||||
source_type=source_type
|
||||
).inc()
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
webhook_requests_total.labels(
|
||||
method=request.method,
|
||||
endpoint=request.endpoint,
|
||||
status_code=500,
|
||||
source_type=source_type
|
||||
).inc()
|
||||
raise
|
||||
|
||||
finally:
|
||||
duration = time.time() - start_time
|
||||
webhook_request_duration.labels(
|
||||
endpoint=request.endpoint,
|
||||
source_type=source_type
|
||||
).observe(duration)
|
||||
|
||||
return wrapper
|
||||
return decorator
|
||||
|
||||
# Add metrics endpoint
|
||||
@app.route('/metrics')
|
||||
def metrics():
|
||||
"""Prometheus metrics endpoint"""
|
||||
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
|
||||
return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
|
||||
|
||||
# Start metrics server
|
||||
if __name__ == '__main__':
|
||||
start_http_server(8000) # Prometheus metrics on port 8000
|
||||
app.run(host='0.0.0.0', port=5000)
|
||||
```
|
||||
### Grafana Dashboard Configuration
|
||||
**Production Dashboard JSON:**
|
||||
```json
|
||||
json{
|
||||
"dashboard": {
|
||||
"title": "Webhook Service Production Dashboard",
|
||||
"panels": [
|
||||
{
|
||||
"title": "Request Rate",
|
||||
"type": "graph",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "rate(webhook_requests_total[5m])",
|
||||
"legendFormat": "{{source_type}} - {{status_code}}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Response Time",
|
||||
"type": "graph",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "histogram_quantile(0.95, rate(webhook_request_duration_seconds_bucket[5m]))",
|
||||
"legendFormat": "95th percentile"
|
||||
},
|
||||
{
|
||||
"expr": "histogram_quantile(0.50, rate(webhook_request_duration_seconds_bucket[5m]))",
|
||||
"legendFormat": "50th percentile"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Authentication Failures",
|
||||
"type": "singlestat",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "increase(webhook_auth_failures_total[1h])",
|
||||
"legendFormat": "Last Hour"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Notification Success Rate",
|
||||
"type": "graph",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "rate(notification_delivery_total{status=\"success\"}[5m]) / rate(notification_delivery_total[5m]) * 100",
|
||||
"legendFormat": "Success Rate %"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
### Alerting Rules
|
||||
|
||||
**AlertManager Configuration:**
|
||||
```yml
|
||||
yaml# alertmanager.yml
|
||||
global:
|
||||
smtp_smarthost: 'smtp.gmail.com:587'
|
||||
smtp_from: 'alerts@yourdomain.com'
|
||||
smtp_auth_username: 'alerts@yourdomain.com'
|
||||
smtp_auth_password: 'your-app-password'
|
||||
|
||||
route:
|
||||
group_by: ['alertname']
|
||||
group_wait: 10s
|
||||
group_interval: 10s
|
||||
repeat_interval: 1h
|
||||
receiver: 'webhook-alerts'
|
||||
|
||||
receivers:
|
||||
- name: 'webhook-alerts'
|
||||
email_configs:
|
||||
- to: 'admin@yourdomain.com'
|
||||
subject: 'Webhook Service Alert - {{ .GroupLabels.alertname }}'
|
||||
body: |
|
||||
{{ range .Alerts }}
|
||||
Alert: {{ .Annotations.summary }}
|
||||
Description: {{ .Annotations.description }}
|
||||
Instance: {{ .Labels.instance }}
|
||||
Severity: {{ .Labels.severity }}
|
||||
{{ end }}
|
||||
|
||||
# Prometheus alerting rules
|
||||
groups:
|
||||
- name: webhook-service
|
||||
rules:
|
||||
- alert: WebhookServiceDown
|
||||
expr: up{job="webhook-service"} == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Webhook service is down"
|
||||
description: "Webhook service has been down for more than 1 minute"
|
||||
|
||||
- alert: HighErrorRate
|
||||
expr: rate(webhook_requests_total{status_code=~"5.."}[5m]) > 0.1
|
||||
for: 2m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High error rate detected"
|
||||
description: "Error rate is {{ $value }} requests per second"
|
||||
|
||||
- alert: HighResponseTime
|
||||
expr: histogram_quantile(0.95, rate(webhook_request_duration_seconds_bucket[5m])) > 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High response time"
|
||||
description: "95th percentile response time is {{ $value }} seconds"
|
||||
|
||||
- alert: AuthenticationFailures
|
||||
expr: increase(webhook_auth_failures_total[15m]) > 10
|
||||
for: 0m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Multiple authentication failures"
|
||||
description: "{{ $value }} authentication failures in the last 15 minutes"
|
||||
```
|
||||
### 🎯 Production Success Metrics
|
||||
**Service Level Objectives (SLOs)**
|
||||
Availability SLO: 99.9% uptime
|
||||
- Measurement: HTTP 200 responses / Total HTTP requests
|
||||
- Error Budget: 43.2 minutes downtime per month
|
||||
- Alerting: Alert if availability drops below 99.5% over 1 hour
|
||||
|
||||
Latency SLO: 95% of requests < 500ms
|
||||
- Measurement: Response time distribution
|
||||
- Alerting: Alert if 95th percentile > 500ms for 5 minutes
|
||||
|
||||
Error Rate SLO: <0.1% error rate
|
||||
- Measurement: HTTP 5xx responses / Total HTTP requests
|
||||
- Alerting: Alert if error rate > 0.5% over 5 minutes
|
||||
|
||||
Security SLO: <10 authentication failures per day
|
||||
- Measurement: Failed authentication attempts
|
||||
- Alerting: Alert if >50 failures in 1 hour
|
||||
|
||||
### Key Performance Indicators
|
||||
**Business Metrics:**
|
||||
□ Total webhook events processed per day
|
||||
□ Notification delivery success rate (target: >99%)
|
||||
□ Average response time (target: <100ms)
|
||||
□ Cost per webhook processed
|
||||
□ Mean time to detection (MTTD) for issues
|
||||
□ Mean time to resolution (MTTR) for incidents
|
||||
□ Infrastructure utilization efficiency
|
||||
□ Customer satisfaction score
|
||||
### 📞 Production Support
|
||||
**Incident Response**
|
||||
***Severity Levels:***
|
||||
SEVERITY 1 - Critical (Service Down)
|
||||
Response Time: 15 minutes
|
||||
Resolution Time: 1 hour
|
||||
Actions: Immediate escalation, war room, customer communication
|
||||
|
||||
SEVERITY 2 - High (Degraded Performance)
|
||||
Response Time: 30 minutes
|
||||
Resolution Time: 4 hours
|
||||
Actions: Team lead notification, monitoring increase
|
||||
|
||||
SEVERITY 3 - Medium (Minor Issues)
|
||||
Response Time: 2 hours
|
||||
Resolution Time: 24 hours
|
||||
Actions: Standard troubleshooting, ticket tracking
|
||||
|
||||
SEVERITY 4 - Low (Enhancement Requests)
|
||||
Response Time: Next business day
|
||||
Resolution Time: Per roadmap
|
||||
Actions: Backlog prioritization
|
||||
### On-Call Procedures
|
||||
**24/7 Support Structure:**
|
||||
Primary On-Call: Initial response and triage
|
||||
Secondary On-Call: Backup coverage and escalation
|
||||
Engineering Manager: Resource coordination
|
||||
Senior Leadership: Business impact decisions
|
||||
|
||||
Escalation Timeline:
|
||||
- 15 minutes: Auto-escalate if no response
|
||||
- 30 minutes: Escalate to secondary on-call
|
||||
- 1 hour: Escalate to engineering manager
|
||||
- 2 hours: Escalate to senior leadership
|
||||
|
||||
### 🚀 Production Deployment Summary:
|
||||
**This production deployment guide provides enterprise-grade reliability with:**
|
||||
✅ 99.9% Uptime Target - Comprehensive monitoring and alerting
|
||||
✅ Enterprise Security - Multi-layer security hardening
|
||||
✅ Auto-scaling - Dynamic resource allocation
|
||||
✅ Disaster Recovery - Automated backup and recovery procedures
|
||||
✅ 24/7 Support - Structured incident response and on-call coverage
|
||||
✅ Performance Optimization - Sub-500ms response times
|
||||
Reference in New Issue
Block a user