Files

stephenminakian 7aaec00d8f Added Docs

2025-07-06 22:43:05 -06:00

19 KiB

Raw Blame History

Production Deployment Guide

Comprehensive guide for deploying the webhook service in production environments with enterprise-grade reliability, security, and monitoring.

🎯 Production Readiness Overview

Deployment Checklist

□ Security hardening complete
□ SSL certificates configured and auto-renewing
□ Monitoring and alerting implemented
□ Backup and disaster recovery tested
□ Performance optimization validated
□ Documentation complete and accessible
□ Team training and runbooks prepared

Production vs Development Differences

Aspect	Development	Production
Security	Basic auth, HTTP allowed	Full security stack, HTTPS only
Logging	Console output	Structured logging, centralized
Monitoring	Manual checks	Automated monitoring/alerting
Scaling	Single instance	Auto-scaling, load balancing
Data	Test data	Real customer data, GDPR compliance
Uptime	Best effort	99.9% SLA target

🏗️ Infrastructure Requirements

Server Specifications

Minimum Requirements:

CPU: 2 cores (x86_64)
RAM: 4GB
Storage: 50GB SSD
Network: 100Mbps
OS: Ubuntu 20.04 LTS or newer

Recommended Production:

CPU: 4 cores (x86_64)
RAM: 8GB
Storage: 100GB NVMe SSD
Network: 1Gbps
OS: Ubuntu 22.04 LTS
Backup: Automated daily backups

High Availability Setup:

Load Balancer: 2x instances
Application Servers: 3x instances  
Database: Primary + Read Replica
Storage: RAID 1 or cloud block storage
Network: Redundant connections

Network Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      PRODUCTION NETWORK                        │
├─────────────────────────────────────────────────────────────────┤
│  Internet ──▶ CDN/WAF ──▶ Load Balancer ──▶ Application       │
│               │           │                  │                  │
│               ▼           ▼                  ▼                  │
│          DDoS Protection  Health Checks   Auto Scaling          │
│          Rate Limiting    SSL Termination Multiple Instances    │
│          Geo Filtering    Session Affinity Container Restart    │
└─────────────────────────────────────────────────────────────────┘

🔒 Security Hardening

Operating System Security

System Hardening Checklist:

# 1. Update system packages
sudo apt update && sudo apt upgrade -y

# 2. Configure automatic security updates
sudo apt install unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades

# 3. Configure UFW firewall
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow ssh
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw enable

# 4. Install and configure fail2ban
sudo apt install fail2ban
sudo systemctl enable fail2ban
sudo systemctl start fail2ban

# 5. Disable root login and password authentication
sudo sed -i 's/PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
sudo sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config
sudo systemctl restart ssh

# 6. Configure automatic security updates
echo 'Unattended-Upgrade::Automatic-Reboot "true";' | sudo tee -a /etc/apt/apt.conf.d/50unattended-upgrades
echo 'Unattended-Upgrade::Automatic-Reboot-Time "02:00";' | sudo tee -a /etc/apt/apt.conf.d/50unattended-upgrades

Docker Security Configuration

Production Docker Daemon Config:

# /etc/docker/daemon.json
{
  "live-restore": true,
  "userland-proxy": false,
  "no-new-privileges": true,
  "seccomp-profile": "/etc/docker/seccomp.json",
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  },
  "storage-driver": "overlay2",
  "storage-opts": [
    "overlay2.override_kernel_check=true"
  ]
}

Security Hardened docker-compose.yml:

version: '3.8'

services:
  webhook-service:
    build: .
    container_name: webhook-service-prod
    restart: unless-stopped
    
    # Security configurations
    read_only: true
    security_opt:
      - no-new-privileges:true
      - seccomp:unconfined
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE
    
    # Resource limits
    deploy:
      resources:
        limits:
          cpus: '0.5'
          memory: 512M
        reservations:
          cpus: '0.1'
          memory: 256M
    
    # Temporary filesystems for read-only container
    tmpfs:
      - /tmp:size=100M,noexec,nosuid,nodev
      - /var/run:size=100M,noexec,nosuid,nodev
    
    environment:
      - FLASK_ENV=production
      - FLASK_SECRET_KEY=${FLASK_SECRET_KEY}
      - WEBHOOK_SECRET=${WEBHOOK_SECRET}
      - PARTICLE_WEBHOOK_SECRET=${PARTICLE_WEBHOOK_SECRET}
      - SMTP_EMAIL=${SMTP_EMAIL}
      - SMTP_PASSWORD=${SMTP_PASSWORD}
      - RECIPIENT_EMAIL=${RECIPIENT_EMAIL}
      
    networks:
      - traefik
      - internal
    
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.webhook-prod.rule=Host(`webhook.yourdomain.com`)"
      - "traefik.http.routers.webhook-prod.entrypoints=websecure"
      - "traefik.http.routers.webhook-prod.tls.certresolver=letsencrypt"
      - "traefik.http.services.webhook-prod.loadbalancer.server.port=5000"
      
      # Production security middleware
      - "traefik.http.routers.webhook-prod.middlewares=webhook-prod-security,webhook-prod-ratelimit"
      
      # Enhanced security headers
      - "traefik.http.middlewares.webhook-prod-security.headers.customrequestheaders.X-Forwarded-Proto=https"
      - "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-Content-Type-Options=nosniff"
      - "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-Frame-Options=DENY"
      - "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-XSS-Protection=1; mode=block"
      - "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Referrer-Policy=strict-origin-when-cross-origin"
      - "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Strict-Transport-Security=max-age=31536000; includeSubDomains"
      - "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Content-Security-Policy=default-src 'self'"
      
      # Production rate limiting
      - "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.average=20"
      - "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.burst=50"
      - "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.period=1m"
      
      # Health check configuration
      - "traefik.http.services.webhook-prod.loadbalancer.healthcheck.path=/health"
      - "traefik.http.services.webhook-prod.loadbalancer.healthcheck.interval=30s"

networks:
  traefik:
    external: true
  internal:
    internal: true

SSL/TLS Configuration

Production Traefik SSL Configuration:

# traefik.yml
certificatesResolvers:
  letsencrypt:
    acme:
      email: admin@yourdomain.com
      storage: /acme.json
      httpChallenge:
        entryPoint: web
      # Production Let's Encrypt endpoint
      caServer: https://acme-v02.api.letsencrypt.org/directory

# Enhanced TLS configuration
tls:
  options:
    default:
      minVersion: "VersionTLS12"
      maxVersion: "VersionTLS13"
      cipherSuites:
        - "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"
        - "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384"
        - "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"
        - "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305"
      curvePreferences:
        - "CurveP521"
        - "CurveP384"
      sniStrict: true

📊 Monitoring and Observability

Production Monitoring Stack

Monitoring Architecture:

┌─────────────────────────────────────────────────────────────────┐
│                    MONITORING STACK                            │
├─────────────────────────────────────────────────────────────────┤
│  Application Metrics                                           │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │Prometheus   │  │Grafana      │  │AlertManager             │  │
│  │Metrics      │  │Dashboards   │  │Notifications            │  │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
├─────────────────────────────────────────────────────────────────┤
│  Log Management                                                │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │Loki         │  │Log          │  │Error                    │  │
│  │Aggregation  │  │Analysis     │  │Tracking                 │  │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
├─────────────────────────────────────────────────────────────────┤
│  Infrastructure Monitoring                                    │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │Node         │  │Docker       │  │Network                  │  │
│  │Exporter     │  │Stats        │  │Monitoring               │  │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Prometheus Metrics Integration

Enhanced webhook_app.py with metrics:

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Metrics definitions
webhook_requests_total = Counter(
    'webhook_requests_total',
    'Total webhook requests',
    ['method', 'endpoint', 'status_code', 'source_type']
)

webhook_request_duration = Histogram(
    'webhook_request_duration_seconds',
    'Webhook request duration',
    ['endpoint', 'source_type']
)

webhook_auth_failures = Counter(
    'webhook_auth_failures_total',
    'Total authentication failures',
    ['source_type', 'failure_reason']
)

notification_delivery_total = Counter(
    'notification_delivery_total',
    'Total notification delivery attempts',
    ['delivery_method', 'status']
)

active_connections = Gauge(
    'webhook_active_connections',
    'Number of active connections'
)

# Middleware for metrics collection
def metrics_middleware():
    def decorator(f):
        def wrapper(*args, **kwargs):
            start_time = time.time()
            source_type = 'particle' if 'ParticleBot' in request.headers.get('User-Agent', '') else 'generic'
            
            try:
                result = f(*args, **kwargs)
                status_code = result[1] if isinstance(result, tuple) else 200
                
                webhook_requests_total.labels(
                    method=request.method,
                    endpoint=request.endpoint,
                    status_code=status_code,
                    source_type=source_type
                ).inc()
                
                return result
                
            except Exception as e:
                webhook_requests_total.labels(
                    method=request.method,
                    endpoint=request.endpoint,
                    status_code=500,
                    source_type=source_type
                ).inc()
                raise
                
            finally:
                duration = time.time() - start_time
                webhook_request_duration.labels(
                    endpoint=request.endpoint,
                    source_type=source_type
                ).observe(duration)
        
        return wrapper
    return decorator

# Add metrics endpoint
@app.route('/metrics')
def metrics():
    """Prometheus metrics endpoint"""
    from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

# Start metrics server
if __name__ == '__main__':
    start_http_server(8000)  # Prometheus metrics on port 8000
    app.run(host='0.0.0.0', port=5000)

Grafana Dashboard Configuration

Production Dashboard JSON:

json{
  "dashboard": {
    "title": "Webhook Service Production Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(webhook_requests_total[5m])",
            "legendFormat": "{{source_type}} - {{status_code}}"
          }
        ]
      },
      {
        "title": "Response Time",
        "type": "graph", 
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(webhook_request_duration_seconds_bucket[5m]))",
            "legendFormat": "95th percentile"
          },
          {
            "expr": "histogram_quantile(0.50, rate(webhook_request_duration_seconds_bucket[5m]))",
            "legendFormat": "50th percentile"
          }
        ]
      },
      {
        "title": "Authentication Failures",
        "type": "singlestat",
        "targets": [
          {
            "expr": "increase(webhook_auth_failures_total[1h])",
            "legendFormat": "Last Hour"
          }
        ]
      },
      {
        "title": "Notification Success Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(notification_delivery_total{status=\"success\"}[5m]) / rate(notification_delivery_total[5m]) * 100",
            "legendFormat": "Success Rate %"
          }
        ]
      }
    ]
  }
}

Alerting Rules

AlertManager Configuration:

yaml# alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@yourdomain.com'
  smtp_auth_username: 'alerts@yourdomain.com'
  smtp_auth_password: 'your-app-password'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'webhook-alerts'

receivers:
- name: 'webhook-alerts'
  email_configs:
  - to: 'admin@yourdomain.com'
    subject: 'Webhook Service Alert - {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      Instance: {{ .Labels.instance }}
      Severity: {{ .Labels.severity }}
      {{ end }}

# Prometheus alerting rules
groups:
- name: webhook-service
  rules:
  - alert: WebhookServiceDown
    expr: up{job="webhook-service"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Webhook service is down"
      description: "Webhook service has been down for more than 1 minute"

  - alert: HighErrorRate
    expr: rate(webhook_requests_total{status_code=~"5.."}[5m]) > 0.1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value }} requests per second"

  - alert: HighResponseTime
    expr: histogram_quantile(0.95, rate(webhook_request_duration_seconds_bucket[5m])) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High response time"
      description: "95th percentile response time is {{ $value }} seconds"

  - alert: AuthenticationFailures
    expr: increase(webhook_auth_failures_total[15m]) > 10
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: "Multiple authentication failures"
      description: "{{ $value }} authentication failures in the last 15 minutes"

🎯 Production Success Metrics

Service Level Objectives (SLOs) Availability SLO: 99.9% uptime

Measurement: HTTP 200 responses / Total HTTP requests
Error Budget: 43.2 minutes downtime per month
Alerting: Alert if availability drops below 99.5% over 1 hour

Latency SLO: 95% of requests < 500ms

Measurement: Response time distribution
Alerting: Alert if 95th percentile > 500ms for 5 minutes

Error Rate SLO: <0.1% error rate

Measurement: HTTP 5xx responses / Total HTTP requests
Alerting: Alert if error rate > 0.5% over 5 minutes

Security SLO: <10 authentication failures per day

Measurement: Failed authentication attempts
Alerting: Alert if >50 failures in 1 hour

Key Performance Indicators

Business Metrics: □ Total webhook events processed per day □ Notification delivery success rate (target: >99%) □ Average response time (target: <100ms) □ Cost per webhook processed □ Mean time to detection (MTTD) for issues □ Mean time to resolution (MTTR) for incidents □ Infrastructure utilization efficiency □ Customer satisfaction score

📞 Production Support

Incident Response Severity Levels: SEVERITY 1 - Critical (Service Down) Response Time: 15 minutes Resolution Time: 1 hour Actions: Immediate escalation, war room, customer communication

SEVERITY 2 - High (Degraded Performance) Response Time: 30 minutes Resolution Time: 4 hours Actions: Team lead notification, monitoring increase

SEVERITY 3 - Medium (Minor Issues) Response Time: 2 hours Resolution Time: 24 hours Actions: Standard troubleshooting, ticket tracking

SEVERITY 4 - Low (Enhancement Requests) Response Time: Next business day Resolution Time: Per roadmap Actions: Backlog prioritization

On-Call Procedures

24/7 Support Structure: Primary On-Call: Initial response and triage Secondary On-Call: Backup coverage and escalation Engineering Manager: Resource coordination Senior Leadership: Business impact decisions

Escalation Timeline:

15 minutes: Auto-escalate if no response
30 minutes: Escalate to secondary on-call
1 hour: Escalate to engineering manager
2 hours: Escalate to senior leadership

🚀 Production Deployment Summary:

This production deployment guide provides enterprise-grade reliability with: ✅ 99.9% Uptime Target - Comprehensive monitoring and alerting ✅ Enterprise Security - Multi-layer security hardening ✅ Auto-scaling - Dynamic resource allocation ✅ Disaster Recovery - Automated backup and recovery procedures ✅ 24/7 Support - Structured incident response and on-call coverage ✅ Performance Optimization - Sub-500ms response times

19 KiB Raw Blame History