Added Docs

2025-07-06 22:43:05 -06:00
parent 213edbacb7
commit 7aaec00d8f
3 changed files with 1828 additions and 0 deletions
--- a/Server/.env.example
+++ b/Server/.env.example
--- a/Server/docs/architecture.md
+++ b/Server/docs/architecture.md
--- a/Server/docs/production-deployment.md
+++ b/Server/docs/production-deployment.md
@ -0,0 +1,564 @@
+# Production Deployment Guide
+
+Comprehensive guide for deploying the webhook service in production environments with enterprise-grade reliability, security, and monitoring.
+
+## 🎯 Production Readiness Overview
+
+### Deployment Checklist
+
+```
+□ Security hardening complete
+□ SSL certificates configured and auto-renewing
+□ Monitoring and alerting implemented
+□ Backup and disaster recovery tested
+□ Performance optimization validated
+□ Documentation complete and accessible
+□ Team training and runbooks prepared
+```
+
+### Production vs Development Differences
+
+| Aspect | Development | Production |
+|--------|-------------|------------|
+| **Security** | Basic auth, HTTP allowed | Full security stack, HTTPS only |
+| **Logging** | Console output | Structured logging, centralized |
+| **Monitoring** | Manual checks | Automated monitoring/alerting |
+| **Scaling** | Single instance | Auto-scaling, load balancing |
+| **Data** | Test data | Real customer data, GDPR compliance |
+| **Uptime** | Best effort | 99.9% SLA target |
+
+## 🏗️ Infrastructure Requirements
+
+### Server Specifications
+
+**Minimum Requirements:**
+```
+CPU: 2 cores (x86_64)
+RAM: 4GB
+Storage: 50GB SSD
+Network: 100Mbps
+OS: Ubuntu 20.04 LTS or newer
+```
+
+**Recommended Production:**
+```
+CPU: 4 cores (x86_64)
+RAM: 8GB
+Storage: 100GB NVMe SSD
+Network: 1Gbps
+OS: Ubuntu 22.04 LTS
+Backup: Automated daily backups
+```
+
+**High Availability Setup:**
+```
+Load Balancer: 2x instances
+Application Servers: 3x instances  
+Database: Primary + Read Replica
+Storage: RAID 1 or cloud block storage
+Network: Redundant connections
+```
+
+### Network Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                      PRODUCTION NETWORK                        │
+├─────────────────────────────────────────────────────────────────┤
+│  Internet ──▶ CDN/WAF ──▶ Load Balancer ──▶ Application       │
+│               │           │                  │                  │
+│               ▼           ▼                  ▼                  │
+│          DDoS Protection  Health Checks   Auto Scaling          │
+│          Rate Limiting    SSL Termination Multiple Instances    │
+│          Geo Filtering    Session Affinity Container Restart    │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+## 🔒 Security Hardening
+
+### Operating System Security
+
+**System Hardening Checklist:**
+```bash
+# 1. Update system packages
+sudo apt update && sudo apt upgrade -y
+
+# 2. Configure automatic security updates
+sudo apt install unattended-upgrades
+sudo dpkg-reconfigure -plow unattended-upgrades
+
+# 3. Configure UFW firewall
+sudo ufw default deny incoming
+sudo ufw default allow outgoing
+sudo ufw allow ssh
+sudo ufw allow 80/tcp
+sudo ufw allow 443/tcp
+sudo ufw enable
+
+# 4. Install and configure fail2ban
+sudo apt install fail2ban
+sudo systemctl enable fail2ban
+sudo systemctl start fail2ban
+
+# 5. Disable root login and password authentication
+sudo sed -i 's/PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
+sudo sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config
+sudo systemctl restart ssh
+
+# 6. Configure automatic security updates
+echo 'Unattended-Upgrade::Automatic-Reboot "true";' | sudo tee -a /etc/apt/apt.conf.d/50unattended-upgrades
+echo 'Unattended-Upgrade::Automatic-Reboot-Time "02:00";' | sudo tee -a /etc/apt/apt.conf.d/50unattended-upgrades
+```
+
+### Docker Security Configuration
+
+**Production Docker Daemon Config:**
+```json
+# /etc/docker/daemon.json
+{
+  "live-restore": true,
+  "userland-proxy": false,
+  "no-new-privileges": true,
+  "seccomp-profile": "/etc/docker/seccomp.json",
+  "log-driver": "json-file",
+  "log-opts": {
+    "max-size": "10m",
+    "max-file": "3"
+  },
+  "storage-driver": "overlay2",
+  "storage-opts": [
+    "overlay2.override_kernel_check=true"
+  ]
+}
+```
+
+**Security Hardened docker-compose.yml:**
+```yaml
+version: '3.8'
+
+services:
+  webhook-service:
+    build: .
+    container_name: webhook-service-prod
+    restart: unless-stopped
+    
+    # Security configurations
+    read_only: true
+    security_opt:
+      - no-new-privileges:true
+      - seccomp:unconfined
+    cap_drop:
+      - ALL
+    cap_add:
+      - NET_BIND_SERVICE
+    
+    # Resource limits
+    deploy:
+      resources:
+        limits:
+          cpus: '0.5'
+          memory: 512M
+        reservations:
+          cpus: '0.1'
+          memory: 256M
+    
+    # Temporary filesystems for read-only container
+    tmpfs:
+      - /tmp:size=100M,noexec,nosuid,nodev
+      - /var/run:size=100M,noexec,nosuid,nodev
+    
+    environment:
+      - FLASK_ENV=production
+      - FLASK_SECRET_KEY=${FLASK_SECRET_KEY}
+      - WEBHOOK_SECRET=${WEBHOOK_SECRET}
+      - PARTICLE_WEBHOOK_SECRET=${PARTICLE_WEBHOOK_SECRET}
+      - SMTP_EMAIL=${SMTP_EMAIL}
+      - SMTP_PASSWORD=${SMTP_PASSWORD}
+      - RECIPIENT_EMAIL=${RECIPIENT_EMAIL}
+      
+    networks:
+      - traefik
+      - internal
+    
+    labels:
+      - "traefik.enable=true"
+      - "traefik.http.routers.webhook-prod.rule=Host(`webhook.yourdomain.com`)"
+      - "traefik.http.routers.webhook-prod.entrypoints=websecure"
+      - "traefik.http.routers.webhook-prod.tls.certresolver=letsencrypt"
+      - "traefik.http.services.webhook-prod.loadbalancer.server.port=5000"
+      
+      # Production security middleware
+      - "traefik.http.routers.webhook-prod.middlewares=webhook-prod-security,webhook-prod-ratelimit"
+      
+      # Enhanced security headers
+      - "traefik.http.middlewares.webhook-prod-security.headers.customrequestheaders.X-Forwarded-Proto=https"
+      - "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-Content-Type-Options=nosniff"
+      - "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-Frame-Options=DENY"
+      - "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.X-XSS-Protection=1; mode=block"
+      - "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Referrer-Policy=strict-origin-when-cross-origin"
+      - "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Strict-Transport-Security=max-age=31536000; includeSubDomains"
+      - "traefik.http.middlewares.webhook-prod-security.headers.customresponseheaders.Content-Security-Policy=default-src 'self'"
+      
+      # Production rate limiting
+      - "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.average=20"
+      - "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.burst=50"
+      - "traefik.http.middlewares.webhook-prod-ratelimit.ratelimit.period=1m"
+      
+      # Health check configuration
+      - "traefik.http.services.webhook-prod.loadbalancer.healthcheck.path=/health"
+      - "traefik.http.services.webhook-prod.loadbalancer.healthcheck.interval=30s"
+
+networks:
+  traefik:
+    external: true
+  internal:
+    internal: true
+```
+
+### SSL/TLS Configuration
+
+**Production Traefik SSL Configuration:**
+```yaml
+# traefik.yml
+certificatesResolvers:
+  letsencrypt:
+    acme:
+      email: admin@yourdomain.com
+      storage: /acme.json
+      httpChallenge:
+        entryPoint: web
+      # Production Let's Encrypt endpoint
+      caServer: https://acme-v02.api.letsencrypt.org/directory
+
+# Enhanced TLS configuration
+tls:
+  options:
+    default:
+      minVersion: "VersionTLS12"
+      maxVersion: "VersionTLS13"
+      cipherSuites:
+        - "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"
+        - "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384"
+        - "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"
+        - "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305"
+      curvePreferences:
+        - "CurveP521"
+        - "CurveP384"
+      sniStrict: true
+```
+
+## 📊 Monitoring and Observability
+
+### Production Monitoring Stack
+
+**Monitoring Architecture:**
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    MONITORING STACK                            │
+├─────────────────────────────────────────────────────────────────┤
+│  Application Metrics                                           │
+│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
+│  │Prometheus   │  │Grafana      │  │AlertManager             │  │
+│  │Metrics      │  │Dashboards   │  │Notifications            │  │
+│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
+├─────────────────────────────────────────────────────────────────┤
+│  Log Management                                                │
+│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
+│  │Loki         │  │Log          │  │Error                    │  │
+│  │Aggregation  │  │Analysis     │  │Tracking                 │  │
+│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
+├─────────────────────────────────────────────────────────────────┤
+│  Infrastructure Monitoring                                    │
+│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
+│  │Node         │  │Docker       │  │Network                  │  │
+│  │Exporter     │  │Stats        │  │Monitoring               │  │
+│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### Prometheus Metrics Integration
+
+**Enhanced webhook_app.py with metrics:**
+```python
+from prometheus_client import Counter, Histogram, Gauge, start_http_server
+import time
+
+# Metrics definitions
+webhook_requests_total = Counter(
+    'webhook_requests_total',
+    'Total webhook requests',
+    ['method', 'endpoint', 'status_code', 'source_type']
+)
+
+webhook_request_duration = Histogram(
+    'webhook_request_duration_seconds',
+    'Webhook request duration',
+    ['endpoint', 'source_type']
+)
+
+webhook_auth_failures = Counter(
+    'webhook_auth_failures_total',
+    'Total authentication failures',
+    ['source_type', 'failure_reason']
+)
+
+notification_delivery_total = Counter(
+    'notification_delivery_total',
+    'Total notification delivery attempts',
+    ['delivery_method', 'status']
+)
+
+active_connections = Gauge(
+    'webhook_active_connections',
+    'Number of active connections'
+)
+
+# Middleware for metrics collection
+def metrics_middleware():
+    def decorator(f):
+        def wrapper(*args, **kwargs):
+            start_time = time.time()
+            source_type = 'particle' if 'ParticleBot' in request.headers.get('User-Agent', '') else 'generic'
+            
+            try:
+                result = f(*args, **kwargs)
+                status_code = result[1] if isinstance(result, tuple) else 200
+                
+                webhook_requests_total.labels(
+                    method=request.method,
+                    endpoint=request.endpoint,
+                    status_code=status_code,
+                    source_type=source_type
+                ).inc()
+                
+                return result
+                
+            except Exception as e:
+                webhook_requests_total.labels(
+                    method=request.method,
+                    endpoint=request.endpoint,
+                    status_code=500,
+                    source_type=source_type
+                ).inc()
+                raise
+                
+            finally:
+                duration = time.time() - start_time
+                webhook_request_duration.labels(
+                    endpoint=request.endpoint,
+                    source_type=source_type
+                ).observe(duration)
+        
+        return wrapper
+    return decorator
+
+# Add metrics endpoint
+@app.route('/metrics')
+def metrics():
+    """Prometheus metrics endpoint"""
+    from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
+    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
+
+# Start metrics server
+if __name__ == '__main__':
+    start_http_server(8000)  # Prometheus metrics on port 8000
+    app.run(host='0.0.0.0', port=5000)
+```
+### Grafana Dashboard Configuration
+**Production Dashboard JSON:**
+```json
+json{
+  "dashboard": {
+    "title": "Webhook Service Production Dashboard",
+    "panels": [
+      {
+        "title": "Request Rate",
+        "type": "graph",
+        "targets": [
+          {
+            "expr": "rate(webhook_requests_total[5m])",
+            "legendFormat": "{{source_type}} - {{status_code}}"
+          }
+        ]
+      },
+      {
+        "title": "Response Time",
+        "type": "graph", 
+        "targets": [
+          {
+            "expr": "histogram_quantile(0.95, rate(webhook_request_duration_seconds_bucket[5m]))",
+            "legendFormat": "95th percentile"
+          },
+          {
+            "expr": "histogram_quantile(0.50, rate(webhook_request_duration_seconds_bucket[5m]))",
+            "legendFormat": "50th percentile"
+          }
+        ]
+      },
+      {
+        "title": "Authentication Failures",
+        "type": "singlestat",
+        "targets": [
+          {
+            "expr": "increase(webhook_auth_failures_total[1h])",
+            "legendFormat": "Last Hour"
+          }
+        ]
+      },
+      {
+        "title": "Notification Success Rate",
+        "type": "graph",
+        "targets": [
+          {
+            "expr": "rate(notification_delivery_total{status=\"success\"}[5m]) / rate(notification_delivery_total[5m]) * 100",
+            "legendFormat": "Success Rate %"
+          }
+        ]
+      }
+    ]
+  }
+}
+```
+### Alerting Rules
+
+**AlertManager Configuration:**
+```yml
+yaml# alertmanager.yml
+global:
+  smtp_smarthost: 'smtp.gmail.com:587'
+  smtp_from: 'alerts@yourdomain.com'
+  smtp_auth_username: 'alerts@yourdomain.com'
+  smtp_auth_password: 'your-app-password'
+
+route:
+  group_by: ['alertname']
+  group_wait: 10s
+  group_interval: 10s
+  repeat_interval: 1h
+  receiver: 'webhook-alerts'
+
+receivers:
+- name: 'webhook-alerts'
+  email_configs:
+  - to: 'admin@yourdomain.com'
+    subject: 'Webhook Service Alert - {{ .GroupLabels.alertname }}'
+    body: |
+      {{ range .Alerts }}
+      Alert: {{ .Annotations.summary }}
+      Description: {{ .Annotations.description }}
+      Instance: {{ .Labels.instance }}
+      Severity: {{ .Labels.severity }}
+      {{ end }}
+
+# Prometheus alerting rules
+groups:
+- name: webhook-service
+  rules:
+  - alert: WebhookServiceDown
+    expr: up{job="webhook-service"} == 0
+    for: 1m
+    labels:
+      severity: critical
+    annotations:
+      summary: "Webhook service is down"
+      description: "Webhook service has been down for more than 1 minute"
+
+  - alert: HighErrorRate
+    expr: rate(webhook_requests_total{status_code=~"5.."}[5m]) > 0.1
+    for: 2m
+    labels:
+      severity: warning
+    annotations:
+      summary: "High error rate detected"
+      description: "Error rate is {{ $value }} requests per second"
+
+  - alert: HighResponseTime
+    expr: histogram_quantile(0.95, rate(webhook_request_duration_seconds_bucket[5m])) > 1
+    for: 5m
+    labels:
+      severity: warning
+    annotations:
+      summary: "High response time"
+      description: "95th percentile response time is {{ $value }} seconds"
+
+  - alert: AuthenticationFailures
+    expr: increase(webhook_auth_failures_total[15m]) > 10
+    for: 0m
+    labels:
+      severity: critical
+    annotations:
+      summary: "Multiple authentication failures"
+      description: "{{ $value }} authentication failures in the last 15 minutes"
+```
+### 🎯 Production Success Metrics
+**Service Level Objectives (SLOs)**
+Availability SLO: 99.9% uptime
+- Measurement: HTTP 200 responses / Total HTTP requests
+- Error Budget: 43.2 minutes downtime per month
+- Alerting: Alert if availability drops below 99.5% over 1 hour
+
+Latency SLO: 95% of requests < 500ms
+- Measurement: Response time distribution
+- Alerting: Alert if 95th percentile > 500ms for 5 minutes
+
+Error Rate SLO: <0.1% error rate
+- Measurement: HTTP 5xx responses / Total HTTP requests  
+- Alerting: Alert if error rate > 0.5% over 5 minutes
+
+Security SLO: <10 authentication failures per day
+- Measurement: Failed authentication attempts
+- Alerting: Alert if >50 failures in 1 hour
+
+### Key Performance Indicators
+**Business Metrics:**
+□ Total webhook events processed per day
+□ Notification delivery success rate (target: >99%)
+□ Average response time (target: <100ms)
+□ Cost per webhook processed
+□ Mean time to detection (MTTD) for issues
+□ Mean time to resolution (MTTR) for incidents
+□ Infrastructure utilization efficiency
+□ Customer satisfaction score
+### 📞 Production Support
+**Incident Response**
+***Severity Levels:***
+SEVERITY 1 - Critical (Service Down)
+Response Time: 15 minutes
+Resolution Time: 1 hour
+Actions: Immediate escalation, war room, customer communication
+
+SEVERITY 2 - High (Degraded Performance) 
+Response Time: 30 minutes
+Resolution Time: 4 hours
+Actions: Team lead notification, monitoring increase
+
+SEVERITY 3 - Medium (Minor Issues)
+Response Time: 2 hours
+Resolution Time: 24 hours
+Actions: Standard troubleshooting, ticket tracking
+
+SEVERITY 4 - Low (Enhancement Requests)
+Response Time: Next business day
+Resolution Time: Per roadmap
+Actions: Backlog prioritization
+### On-Call Procedures
+**24/7 Support Structure:**
+Primary On-Call: Initial response and triage
+Secondary On-Call: Backup coverage and escalation
+Engineering Manager: Resource coordination
+Senior Leadership: Business impact decisions
+
+Escalation Timeline:
+- 15 minutes: Auto-escalate if no response
+- 30 minutes: Escalate to secondary on-call
+- 1 hour: Escalate to engineering manager
+- 2 hours: Escalate to senior leadership
+
+### 🚀 Production Deployment Summary:
+**This production deployment guide provides enterprise-grade reliability with:**
+✅ 99.9% Uptime Target - Comprehensive monitoring and alerting
+✅ Enterprise Security - Multi-layer security hardening
+✅ Auto-scaling - Dynamic resource allocation
+✅ Disaster Recovery - Automated backup and recovery procedures
+✅ 24/7 Support - Structured incident response and on-call coverage
+✅ Performance Optimization - Sub-500ms response times