Ops — Runbooks¶

Runbooks opérationnels pour SalamBot. Chaque procédure suit la structure : Contexte ► Pré-requis ► Étapes ► Vérifications ► Rollback/Contournement ► SLO/Métriques ► Communication ► Postmortem

Objectif & portée¶

Ce document centralise les procédures opérationnelles pour SalamBot :

Incidents critiques (P0/P1)
Maintenance préventive
Escalade et communication
Rollback d'urgence

Note SLO/SLA : les budgets d'erreur, burn-rates et formules PromQL sont maintenus dans Ops/slo-sla.md. Ce document n'en duplique pas le contenu.

Horodatage & format (obligatoire)¶

Tous les timestamps dans exemples/scripts doivent être au format ISO 8601 UTC suffixé Z (ex: 2025-08-14T10:30:00Z).
L'affichage local (UTC+1, etc.) est option UI uniquement, la corrélation se fait en UTC Z.

Budgets SLO de référence¶

Référence complète : voir Ops/slo-sla.md pour budgets d'erreur, burn-rates et formules PromQL détaillées.

Métrique	Budget SLO	Seuil alerte
Ingestion	≤150ms	>120ms
NLU	≤200ms	>160ms
Retrieval	≤500ms	>400ms
TTFB	≤600ms	>480ms
E2E (p95)	≤2500ms	>2000ms

Budgets SLO¶

Les valeurs à jour sont tenues dans Ops/slo-sla.md. Ce runbook ne duplique pas les chiffres.

Calcul error budget¶

# Disponibilité cible : 99.9% (mensuel)
uptime_target=99.9
downtime_allowed=$((30*24*60 * (100-uptime_target) / 100))  # minutes
echo "Budget mensuel : ${downtime_allowed} min"

Monitoring et alerting¶

Le diagramme suivant illustre l'architecture de monitoring et les flux d'alerting :

flowchart TD
    METRICS[📊 Collecte métriques] --> PROM[🔍 Prometheus]
    LOGS[📝 Collecte logs] --> LOKI[📚 Loki]
    TRACES[🔗 Collecte traces] --> JAEGER[🕸️ Jaeger]

    PROM --> GRAFANA[📈 Grafana Dashboards]
    LOKI --> GRAFANA
    JAEGER --> GRAFANA

    PROM --> ALERT_RULES[⚠️ Règles d'alerte]

    ALERT_RULES --> SLO_BREACH{SLO breach?}
    ALERT_RULES --> ERROR_RATE{Error rate élevé?}
    ALERT_RULES --> LATENCY_HIGH{Latence élevée?}
    ALERT_RULES --> RESOURCE_LOW{Ressources faibles?}

    SLO_BREACH -->|Oui| P0_ALERT[🚨 Alerte P0]
    ERROR_RATE -->|>5%| P1_ALERT[⚠️ Alerte P1]
    LATENCY_HIGH -->|>SLO| P1_ALERT
    RESOURCE_LOW -->|CPU>80%| P2_ALERT[⚡ Alerte P2]

    P0_ALERT --> PAGER[📱 PagerDuty]
    P1_ALERT --> SLACK[💬 Slack #incidents]
    P2_ALERT --> SLACK

    PAGER --> ONCALL[👨‍💻 On-call Engineer]
    SLACK --> PLATFORM_TEAM[👥 Platform Team]

    ONCALL --> INCIDENT_MGMT[🎯 Gestion incident]
    PLATFORM_TEAM --> INCIDENT_MGMT

    INCIDENT_MGMT --> TRIAGE[🔍 Triage P0/P1/P2]
    TRIAGE --> INVESTIGATION[🕵️ Investigation]
    INVESTIGATION --> MITIGATION[🛠️ Mitigation]
    MITIGATION --> RESOLUTION[✅ Résolution]

    RESOLUTION --> POSTMORTEM[📋 Postmortem]
    POSTMORTEM --> IMPROVEMENTS[🔄 Améliorations]

    subgraph "Métriques SLO"
        SLO_INGESTION["Ingestion ≤150ms"]
        SLO_NLU["NLU ≤200ms"]
        SLO_RAG["RAG ≤500ms"]
        SLO_TTFB["TTFB ≤600ms"]
        SLO_E2E["E2E ≤2500ms"]
    end

    subgraph "Alertes par composant"
        GATEWAY_ALERTS["Gateway: latence, erreurs"]
        NLU_ALERTS["NLU: accuracy, latence"]
        RAG_ALERTS["RAG: recall, latence"]
        LLM_ALERTS["LLM: tokens/s, erreurs"]
        INFRA_ALERTS["Infra: CPU, mémoire, disque"]
    end

    METRICS --> SLO_INGESTION
    METRICS --> SLO_NLU
    METRICS --> SLO_RAG
    METRICS --> SLO_TTFB
    METRICS --> SLO_E2E

    ALERT_RULES --> GATEWAY_ALERTS
    ALERT_RULES --> NLU_ALERTS
    ALERT_RULES --> RAG_ALERTS
    ALERT_RULES --> LLM_ALERTS
    ALERT_RULES --> INFRA_ALERTS

    style P0_ALERT fill:#f44336
    style P1_ALERT fill:#ff9800
    style P2_ALERT fill:#2196f3
    style PAGER fill:#e91e63
    style ONCALL fill:#4caf50
    style RESOLUTION fill:#4caf50
    style IMPROVEMENTS fill:#9c27b0

Triage incidents & sévérité (P0/P1/P2)¶

Contexte¶

Classification des incidents selon impact business et délais de réponse adaptés au fuseau Maroc (UTC+1).

Critères de sévérité¶

Sévérité	Impact	Délai comms initiales	Délai résolution cible
P0	Service indisponible	≤15 min	≤2h
P1	Dégradation majeure	≤30 min	≤4h
P2	Dégradation mineure	≤1h	≤24h

Étapes¶

Triage initial
[ ] Évaluer impact (nombre d'utilisateurs, canaux affectés)
[ ] Assigner sévérité P0/P1/P2
[ ] Créer incident dans système de tracking
Communication initiale
[ ] Notifier #incidents-salambot
[ ] Alerter on-call selon escalade
[ ] Publier statut initial (status page)
Escalade
P0 : Platform Team + Product Owner immédiat
P1 : Platform Team + Product dans 30min
P2 : Platform Team seul

Modèle de message d'incident¶

🚨 [P0/P1/P2] Incident #INC-YYYY-NNNN
📍 Impact: [Description courte]
🕐 Détecté: HH:MM UTC+1
👥 Assigné: @username
📊 Statut: Investigation en cours
🔗 Suivi: [lien incident]

Workflow de gestion d'incident¶

Le diagramme suivant illustre la procédure complète de gestion d'incident :

flowchart TD
    A[🚨 Incident détecté] --> B{Évaluation impact}

    B -->|Service down<br/>Utilisateurs bloqués| C[P0 - Critique]
    B -->|Dégradation majeure<br/>Fonctionnalités limitées| D[P1 - Majeur]
    B -->|Dégradation mineure<br/>Impact limité| E[P2 - Mineur]

    C --> F[Comms ≤15min<br/>Escalade immédiate]
    D --> G[Comms ≤30min<br/>Escalade 30min]
    E --> H[Comms ≤1h<br/>Platform Team]

    F --> I[Investigation P0]
    G --> J[Investigation P1]
    H --> K[Investigation P2]

    I --> L{Cause identifiée?}
    J --> L
    K --> L

    L -->|Non| M[Escalade expertise<br/>Logs détaillés]
    L -->|Oui| N[Implémentation fix]

    M --> L
    N --> O{Fix validé?}

    O -->|Non| P[Rollback<br/>Plan B]
    O -->|Oui| Q[Monitoring post-fix]

    P --> Q
    Q --> R[Communication résolution]
    R --> S[Postmortem<br/>Actions préventives]

    style C fill:#ff6b6b
    style D fill:#ffa726
    style E fill:#66bb6a
    style F fill:#ff6b6b
    style G fill:#ffa726
    style H fill:#66bb6a

Vérifications¶

[ ] Incident tracké avec correlation_id
[ ] Stakeholders notifiés selon matrice RACI
[ ] Statut public mis à jour

Redémarrage / scaling des services¶

Contexte¶

Redémarrage ciblé ou scaling des composants SalamBot en cas de dégradation.

Pré-requis¶

Accès Docker Compose ou kubectl
Monitoring actif (Prometheus/Grafana)

Étapes Docker Compose¶

# Vérifier état des services
docker compose ps

# Logs en temps réel
docker compose logs -f gateway orchestrateur nlu rag

# Redémarrage ciblé
docker compose restart gateway
docker compose restart orchestrateur
docker compose restart nlu
docker compose restart rag

# Rebuild si nécessaire
docker compose up -d --no-deps --build orchestrateur

# Purge cache Redis si applicable
docker compose exec redis redis-cli FLUSHALL

Étapes Kubernetes¶

# État des pods
kubectl get pods -n salambot

# Redémarrage rolling
kubectl rollout restart deploy/gateway -n salambot
kubectl rollout restart deploy/orchestrateur -n salambot
kubectl rollout restart deploy/nlu -n salambot
kubectl rollout restart deploy/rag -n salambot

# Scaling horizontal
kubectl scale deploy/llm-router -n salambot --replicas=3
kubectl scale deploy/rag -n salambot --replicas=2

# Vérifier rollout
kubectl rollout status deploy/gateway -n salambot

Workflow de scaling¶

Le diagramme suivant illustre la procédure de scaling des services :

flowchart TD
    A[📊 Métriques alertes] --> B{Type de charge?}

    B -->|CPU > 80%| C[Scaling horizontal]
    B -->|Mémoire > 85%| D[Scaling vertical]
    B -->|Latence > SLO| E[Analyse goulot]

    C --> F[kubectl scale<br/>replicas +1]
    D --> G[Augmenter limits<br/>requests]
    E --> H{Composant lent?}

    H -->|NLU| I[Scale NLU pods]
    H -->|RAG| J[Scale RAG + Qdrant]
    H -->|LLM| K[Scale LLM Router]
    H -->|DB| L[Optimiser requêtes]

    F --> M[Vérifier distribution]
    G --> N[Redéployer pods]
    I --> M
    J --> M
    K --> M
    L --> O[Analyser slow queries]

    M --> P{Métriques OK?}
    N --> P
    O --> P

    P -->|Non| Q[Scaling additionnel]
    P -->|Oui| R[Monitoring continu]

    Q --> S{Limite atteinte?}
    S -->|Oui| T[Escalade infra]
    S -->|Non| F

    R --> U[Documenter actions]
    T --> U

    style C fill:#4fc3f7
    style D fill:#81c784
    style E fill:#ffb74d
    style T fill:#ff8a65

Vérifications¶

[ ] Tous les pods/conteneurs healthy
[ ] Métriques de latence revenues à la normale
[ ] Tests fumée API passent

Rollback¶

# Docker Compose
docker compose down && docker compose up -d

# Kubernetes
kubectl rollout undo deploy/gateway -n salambot

Breach de latence e2e (p95 > 2.5s)¶

Contexte¶

Latence end-to-end dépasse le budget SLO (voir Ops/slo-sla.md) de 2500ms (p95).

Pré-requis¶

Accès Prometheus/Grafana
Droits Admin Policies

Étapes de diagnostic¶

# Vérifier métriques Prometheus
curl http://localhost:9090/metrics | grep salambot_latency_p95

# Identifier goulot d'étranglement
curl http://localhost:9090/v1/query?query=salambot_component_latency_p95

Analyse par composant
[ ] Gateway : >50ms ?
[ ] NLU : >200ms ?
[ ] RAG retrieval : >500ms ?
[ ] LLM generation : >1500ms ?
Activation mode dégradé
[ ] Réduire contexte RAG (top_k=3 au lieu de 10)
[ ] Désactiver re-ranking
[ ] Limiter tokens de réponse (max_tokens=150)
[ ] Basculer vers modèle plus rapide

Commandes mode dégradé¶

# Via Admin Policies API (PUT /v1/admin/policies)
curl -X PUT "$BASE_URL/admin/policies" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-SalamBot-Tenant: $TENANT" \
  -d '{
    "schema_version": "1.0",
    "tenant": "*",
    "channel": "admin",
    "message_id": "msg_policy_deg",
    "correlation_id": "corr_policy_deg",
    "timestamp": "2025-08-14T10:00:00Z",
    "locale": "fr-MA",
    "data": {
      "policies": {
        "rag_top_k": 3,
        "enable_reranking": false,
        "max_tokens": 150,
        "model": "llama3-8b"
      }
    }
  }'

Vérifications¶

[ ] Latence p95 < 2000ms
[ ] Throughput maintenu
[ ] CSAT > 80% (mode dégradé acceptable)

Critères de retour normal¶

Latence stable < 1800ms pendant 15min
Charge système < 70%
Aucune alerte active

Échecs de signature Webhooks (Facebook/WhatsApp)¶

Contexte¶

Échecs de validation HMAC-SHA256 pour webhooks entrants Facebook/WhatsApp.

Pré-requis¶

Secret webhook configuré
Logs gateway accessibles
Référence : API/examples.md section HMAC

Étapes de diagnostic¶

Vérifier signature
[ ] Header X-Hub-Signature-256 présent
[ ] Format : sha256=<hex_digest>
[ ] Secret webhook correct
Vérifier payload
[ ] Body brut utilisé (pas de parsing JSON)
[ ] Encoding UTF-8
[ ] Pas de modification du payload

Commandes de vérification¶

# Logs gateway pour webhooks
docker compose logs gateway | grep "webhook_signature"

# Test HMAC local
echo -n "payload_raw" | openssl dgst -sha256 -hmac "$FACEBOOK_APP_SECRET"

Étapes de résolution¶

Rotation secret si nécessaire

# Générer nouveau secret
openssl rand -hex 32

# Mettre à jour Facebook App Settings
# Déployer nouvelle config
# Tester webhook

Rejouer événements perdus
[ ] Identifier période d'échec
[ ] Demander replay via Facebook Developer Console
[ ] Vérifier réception post-fix

Vérifications¶

[ ] Webhooks validés avec succès
[ ] Aucun événement perdu
[ ] Métriques webhook_success_rate > 99%

Conflits d'idempotence (400 idempotency_mismatch)¶

Contexte¶

Requêtes avec même Idempotency-Key mais payload différent génèrent HTTP 400.

Pré-requis¶

Logs avec X-Request-Id pour corrélation
Accès base de données idempotence

Étapes de diagnostic¶

Identifier la clé conflictuelle

# Rechercher dans logs
docker compose logs gateway | grep "idempotency_mismatch"

Vérifier hash du body
[ ] Comparer hash stocké vs hash calculé
[ ] Identifier différences payload

Résolution côté client¶

Option 1 : Nouvelle clé

data='{"schema_version":"1.0","tenant":"acme","channel":"webchat","message_id":"msg_new","correlation_id":"corr_new","timestamp":"2025-08-14T10:00:00Z","locale":"fr-MA","data":{"text":"nouveau payload","lang":"fr"}}'
curl -X POST "$BASE_URL/messages/analyze" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-SalamBot-Tenant: $TENANT" \
  -H "Idempotency-Key: new-unique-key-$(date +%s)" \
  -H "X-Request-Id: req-$(uuidgen)" \
  -d "$data"

Option 2 : Même clé + même payload

# Réutiliser exactement le même payload
data='{"schema_version":"1.0","tenant":"acme","channel":"webchat","message_id":"msg_orig","correlation_id":"corr_orig","timestamp":"2025-08-14T10:00:00Z","locale":"fr-MA","data":{"text":"payload original exact","lang":"fr"}}'
curl -X POST "$BASE_URL/messages/analyze" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-SalamBot-Tenant: $TENANT" \
  -H "Idempotency-Key: existing-key" \
  -H "X-Request-Id: req-original" \
  -d "$data"

Vérifications¶

[ ] Requête acceptée (200 OK)
[ ] X-Request-Id conservé pour corrélation
[ ] Pas de duplication côté métier

Contournement d'urgence¶

# Purger clé idempotence si nécessaire
psql -d salambot -c "DELETE FROM idempotency_keys WHERE key = 'problematic-key';"

Re-indexation Knowledge Base bloquée¶

Contexte¶

Job d'indexation RAG en échec ou bloqué, impactant qualité des réponses.

Pré-requis¶

Accès Vector DB (Qdrant/Pinecone)
Commandes seed-kb disponibles
Métriques MRR@k configurées

Étapes de diagnostic¶

# Vérifier statut indexation
docker compose logs rag | grep "indexing"

# État Vector DB
curl http://localhost:6333/collections/salambot-kb

# Files d'attente
docker compose exec redis redis-cli LLEN indexing_queue

Relancer indexation¶

# Purger index existant
curl -X DELETE http://localhost:6333/collections/salambot-kb

# Relancer job complet
docker compose exec orchestrateur python -m salambot.jobs.seed_kb \
  --tenant all \
  --force-reindex

# Ou indexation incrémentale
docker compose exec orchestrateur python -m salambot.jobs.seed_kb \
  --tenant specific-tenant \
  --incremental

Vérifications post-indexation¶

# Compter documents indexés
curl http://localhost:6333/collections/salambot-kb/points/count

# Test retrieval basique
curl -X POST "$BASE_URL/v1/search/query" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-SalamBot-Tenant: demo" \
  -d '{
    "schema_version": "1.0",
    "tenant": "demo",
    "channel": "webchat",
    "message_id": "msg_retrieval_test",
    "correlation_id": "corr_retrieval_test",
    "timestamp": "2025-08-14T10:00:00Z",
    "locale": "fr-MA",
    "data": {"query": "test query", "limit": 5}
  }'

Contrôles qualité MRR@k¶

[ ] MRR@3 > 0.7
[ ] MRR@5 > 0.8
[ ] Latence retrieval < 400ms

Rollback¶

Restaurer snapshot Vector DB précédent
Revenir à index de sauvegarde

Onboarding d'un nouveau tenant¶

Contexte¶

Activation complète d'un nouveau tenant avec canaux et configuration.

Pré-requis¶

Template config tenant
Accès Admin API
Credentials canaux (Facebook, WhatsApp)
Référence : Get-Started/quickstart.md

Étapes d'onboarding¶

Créer configuration tenant

# Copier template
cp config/tenants/template.yml config/tenants/nouveau-tenant.yml

# Éditer configuration
vim config/tenants/nouveau-tenant.yml

Activer canaux

# Facebook Messenger
curl -X POST "$BASE_URL/admin/channels" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-SalamBot-Tenant: nouveau-tenant" \
  -d '{
    "tenant": "nouveau-tenant",
    "channel": "facebook",
    "config": {
      "page_access_token": "...",
      "verify_token": "..."
    }
  }'

# WhatsApp Business
curl -X POST "$BASE_URL/admin/channels" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-SalamBot-Tenant: nouveau-tenant" \
  -d '{
    "tenant": "nouveau-tenant",
    "channel": "whatsapp",
    "config": {
      "phone_number_id": "...",
      "access_token": "..."
    }
  }'

Smoke test webchat

curl -X POST "$BASE_URL/v1/generate/reply" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-SalamBot-Tenant: nouveau-tenant" \
  -H "X-Request-Id: req-$(uuidgen)" \
  -H "Idempotency-Key: idem-smoke-$(date +%s)" \
  -d '{
    "schema_version": "1.0",
    "tenant": "nouveau-tenant",
    "channel": "webchat",
    "message_id": "msg_smoke",
    "correlation_id": "corr_smoke",
    "timestamp": "2025-08-14T10:00:00Z",
    "locale": "fr-MA",
    "data": {
      "prompt": "Hello, test message",
      "context": []
    }
  }'

Checklist branding¶

[ ] Logo tenant configuré
[ ] Couleurs personnalisées
[ ] Messages de bienvenue
[ ] Politiques de contenu
[ ] Limites de rate limiting

Vérifications¶

[ ] Tenant actif dans base de données
[ ] Canaux connectés et validés
[ ] Knowledge base indexée
[ ] Tests end-to-end passent

Rotation de secrets & tokens¶

Contexte¶

Rotation sécurisée des secrets critiques avec zéro downtime.

Secrets concernés¶

OPENAI_API_KEY / LLM providers
FACEBOOK_APP_SECRET
Webhook secrets
Database passwords
JWT signing keys

Procédure safe (fenêtre de chevauchement)¶

Créer secret N+1

# Générer nouveau secret
openssl rand -hex 32

# Ajouter à configuration (garder ancien)
# secrets:
#   openai_api_key: ["new_key", "old_key"]

Déployer avec support dual

# Déploiement graduel
docker compose up -d --no-deps gateway

# Vérifier fonctionnement
curl "$BASE_URL/health" \
  -H "Authorization: Bearer $TOKEN" \
  -H "X-SalamBot-Tenant: $TENANT"

Invalider ancien secret

# Retirer ancien de config
# Redéployer
# Révoquer côté provider

Vérifications¶

[ ] Services fonctionnels avec nouveau secret
[ ] Aucune erreur d'authentification
[ ] Ancien secret révoqué
[ ] Logs sans erreurs pendant 1h

Rollback d'urgence¶

Réactiver ancien secret temporairement
Investiguer problème nouveau secret
Re-tenter rotation

Sauvegarde & restauration¶

Contexte¶

Procédures de backup et restore pour tous les composants critiques.

Composants à sauvegarder¶

PostgreSQL (métadonnées, conversations)
MinIO (fichiers, documents)
Vector DB (embeddings)
Configuration (secrets, tenants)

Sauvegarde PostgreSQL¶

# Dump complet
pg_dump -h localhost -U salambot -d salambot > backup_$(date +%Y%m%d_%H%M%S).sql

# Dump par tenant
pg_dump -h localhost -U salambot -d salambot \
  --table="conversations" \
  --where="tenant='specific-tenant'" > tenant_backup.sql

# Sauvegarde automatisée
0 2 * * * /scripts/backup_postgres.sh

Sauvegarde MinIO¶

# Sync vers backup bucket
mc mirror salambot-minio/documents backup-bucket/documents-$(date +%Y%m%d)

# Snapshot complet
mc cp --recursive salambot-minio/ backup-storage/

Restauration¶

# PostgreSQL
psql -h localhost -U salambot -d salambot < backup_20250114_020000.sql

# MinIO
mc cp --recursive backup-storage/documents/ salambot-minio/documents/

# Vector DB (Qdrant)
curl -X POST http://localhost:6333/collections/salambot-kb/snapshots/recover \
  -H "Content-Type: application/json" \
  -d '{"location": "/snapshots/backup_20250114.snapshot"}'

Objectifs RPO/RTO¶

RPO : ≤1h (perte de données max)
RTO : ≤4h (temps de restauration)
Test de restauration : mensuel

Mode dégradé contrôlé¶

Contexte¶

Activation proactive du mode dégradé pour maintenir la disponibilité.

Toggles disponibles¶

Toggle	Impact	Gain latence
`rag_top_k=3`	Contexte réduit	-200ms
`enable_reranking=false`	Pas de re-ranking	-150ms
`max_tokens=100`	Réponses brèves	-300ms
`model=llama3-8b`	Modèle rapide	-500ms

Activation via Admin API¶

# Mode dégradé global
curl -X PUT "$BASE_URL/admin/policies" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-SalamBot-Tenant: $TENANT" \
  -d '{
    "schema_version": "1.0",
    "tenant": "*",
    "channel": "admin",
    "message_id": "msg_pol_deg_glob",
    "correlation_id": "corr_pol_deg_glob",
    "timestamp": "2025-08-14T10:00:00Z",
    "locale": "fr-MA",
    "data": {
      "policies": {
        "rag_top_k": 3,
        "enable_reranking": false,
        "max_tokens": 100,
        "model": "llama3-8b",
        "cache_ttl": 3600
      }
    }
  }'

# Mode dégradé par tenant
curl -X PUT "$BASE_URL/admin/policies" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-SalamBot-Tenant: high-volume-tenant" \
  -d '{
    "schema_version": "1.0",
    "tenant": "high-volume-tenant",
    "channel": "admin",
    "message_id": "msg_pol_deg_tenant",
    "correlation_id": "corr_pol_deg_tenant",
    "timestamp": "2025-08-14T10:00:00Z",
    "locale": "fr-MA",
    "data": {
      "policies": {
        "rate_limit": 10,
        "max_tokens": 50
      }
    }
  }'

Critères d'activation¶

Latence p95 > 2000ms pendant 5min
CPU > 80% pendant 10min
Erreurs > 5% pendant 5min
Charge > capacité nominale

Critères de sortie¶

Latence p95 < 1500ms pendant 15min
CPU < 60% pendant 15min
Erreurs < 1% pendant 15min
CSAT > 85% en mode dégradé

Suivi qualité¶

# Métriques CSAT
curl http://localhost:9090/v1/query?query=salambot_csat_avg

# Latence mode dégradé
curl http://localhost:9090/v1/query?query=salambot_latency_p95{mode="degraded"}

Incident PII / Purge & rétention¶

Contexte¶

Gestion des incidents PII et exécution des tâches de purge selon réglementation.

Pré-requis¶

Conformité loi 09-08 (Maroc) et RGPD
Outils de redaction configurés
Politiques de rétention par tenant

Détection incident PII¶

# Rechercher PII dans logs
grep -E "[0-9]{10}|[A-Z]{1,2}[0-9]{6}" logs/salambot.log

# Scanner conversations
psql -d salambot -c "
  SELECT conversation_id, message
  FROM conversations
  WHERE message ~ '[0-9]{10}'
  LIMIT 10;
"

Purge immédiate¶

# Purger conversation spécifique
curl -X DELETE http://localhost:8080/admin/conversations/conv-123 \
  -H "X-Admin-Token: $ADMIN_TOKEN"

# Purge par tenant (GDPR)
curl -X POST http://localhost:8080/admin/purge \
  -H "Content-Type: application/json" \
  -d '{
    "tenant": "eu-tenant",
    "user_id": "user-to-purge",
    "reason": "gdpr_request"
  }'

# Purge automatique (rétention)
docker compose exec orchestrateur python -m salambot.jobs.purge_data \
  --older-than 90d \
  --tenant all

Redaction logs¶

# Activer redaction PII
export SALAMBOT_PII_REDACTION=true

# Vérifier redaction
tail -f logs/salambot.log | grep "\[REDACTED\]"

Communication légale¶

[ ] Notifier DPO dans 24h
[ ] Documenter incident PII
[ ] Rapport CNDP si applicable (Maroc)
[ ] Notification utilisateurs si requis

Déploiement hotfix & rollback¶

Contexte¶

Déploiement d'urgence et procédures de rollback rapide.

Pré-requis¶

Image Docker taguée
Tests fumée automatisés
Monitoring actif

Déploiement hotfix¶

# Tagger image hotfix
docker tag salambot:latest salambot:hotfix-$(date +%Y%m%d-%H%M%S)

# Déploiement ciblé
docker compose up -d --no-deps gateway

# Ou Kubernetes
kubectl set image deploy/gateway gateway=salambot:hotfix-20250114-143000 -n salambot
kubectl rollout status deploy/gateway -n salambot

Tests fumée post-déploiement¶

# Health check
curl "$BASE_URL/health" \
  -H "Authorization: Bearer $TOKEN" \
  -H "X-SalamBot-Tenant: $TENANT"

# Test API critique
curl -X POST "$BASE_URL/v1/generate/reply" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-SalamBot-Tenant: demo" \
  -H "X-Request-Id: req-$(uuidgen)" \
  -H "Idempotency-Key: idem-hotfix-$(date +%s)" \
  -d '{
    "schema_version": "1.0",
    "tenant": "demo",
    "channel": "webchat",
    "message_id": "msg_hotfix",
    "correlation_id": "corr_hotfix",
    "timestamp": "'$TIMESTAMP'",
    "locale": "fr-MA",
    "data": {"prompt": "test hotfix", "context": []}
  }'

# Vérifier métriques
curl http://localhost:9090/metrics | grep salambot_requests_total

Workflow backup et restore¶

Le diagramme suivant illustre les procédures de sauvegarde et de restauration :

flowchart TD
    subgraph "Backup Automatique"
        CRON[⏰ Cron Job Daily]
        BACKUP_TRIGGER[🔄 Déclenchement Backup]

        CRON --> BACKUP_TRIGGER

        BACKUP_TRIGGER --> DB_BACKUP[💾 Backup Postgres]
        BACKUP_TRIGGER --> VECTOR_BACKUP[🔍 Backup Vector DB]
        BACKUP_TRIGGER --> CONFIG_BACKUP[⚙️ Backup Configs]
        BACKUP_TRIGGER --> LOGS_BACKUP[📝 Backup Logs]

        DB_BACKUP --> S3_DB[📦 S3 DB Backups]
        VECTOR_BACKUP --> S3_VECTOR[📦 S3 Vector Backups]
        CONFIG_BACKUP --> S3_CONFIG[📦 S3 Config Backups]
        LOGS_BACKUP --> S3_LOGS[📦 S3 Log Archives]

        S3_DB --> RETENTION_DB[🗑️ Rétention 30j]
        S3_VECTOR --> RETENTION_VECTOR[🗑️ Rétention 90j]
        S3_CONFIG --> RETENTION_CONFIG[🗑️ Rétention 1an]
        S3_LOGS --> RETENTION_LOGS[🗑️ Rétention 6mois]
    end

    subgraph "Restore d'Urgence"
        INCIDENT[🚨 Incident Détecté]
        ASSESS[🔍 Évaluation Impact]

        INCIDENT --> ASSESS

        ASSESS --> DATA_LOSS{Perte de données?}
        ASSESS --> CONFIG_CORRUPT{Config corrompue?}
        ASSESS --> FULL_RESTORE{Restore complet?}

        DATA_LOSS -->|Oui| RESTORE_DB[🔄 Restore DB]
        CONFIG_CORRUPT -->|Oui| RESTORE_CONFIG[🔄 Restore Config]
        FULL_RESTORE -->|Oui| RESTORE_ALL[🔄 Restore Complet]

        RESTORE_DB --> VERIFY_DB[✅ Vérif DB]
        RESTORE_CONFIG --> VERIFY_CONFIG[✅ Vérif Config]
        RESTORE_ALL --> VERIFY_ALL[✅ Vérif Complète]

        VERIFY_DB --> SMOKE_TESTS[🧪 Tests Fumée]
        VERIFY_CONFIG --> SMOKE_TESTS
        VERIFY_ALL --> SMOKE_TESTS

        SMOKE_TESTS --> SUCCESS{Tests OK?}
        SUCCESS -->|Oui| MONITORING[📊 Monitoring Renforcé]
        SUCCESS -->|Non| ROLLBACK[⏪ Rollback]

        ROLLBACK --> RESTORE_PREVIOUS[🔄 Restore Backup-1]
        RESTORE_PREVIOUS --> SMOKE_TESTS
    end

    subgraph "Backup Manuel"
        MANUAL_TRIGGER[👤 Déclenchement Manuel]
        PRE_DEPLOY[🚀 Pré-déploiement]
        MAINTENANCE[🔧 Maintenance]

        MANUAL_TRIGGER --> PRE_DEPLOY
        MANUAL_TRIGGER --> MAINTENANCE

        PRE_DEPLOY --> SNAPSHOT_DB[📸 Snapshot DB]
        MAINTENANCE --> SNAPSHOT_ALL[📸 Snapshot Complet]

        SNAPSHOT_DB --> S3_MANUAL[📦 S3 Manual Backups]
        SNAPSHOT_ALL --> S3_MANUAL

        S3_MANUAL --> TAG_BACKUP[🏷️ Tag Version]
    end

    subgraph "Vérifications"
        BACKUP_HEALTH[💚 Health Check Backups]
        RESTORE_TEST[🧪 Test Restore Mensuel]
        INTEGRITY_CHECK[🔒 Vérif Intégrité]

        BACKUP_HEALTH --> ALERT_BACKUP{Backup OK?}
        RESTORE_TEST --> ALERT_RESTORE{Restore OK?}
        INTEGRITY_CHECK --> ALERT_INTEGRITY{Intégrité OK?}

        ALERT_BACKUP -->|Non| SLACK_ALERT[💬 Alerte Slack]
        ALERT_RESTORE -->|Non| SLACK_ALERT
        ALERT_INTEGRITY -->|Non| SLACK_ALERT

        SLACK_ALERT --> PLATFORM_TEAM[👥 Platform Team]
    end

    %% Connexions entre sous-graphes
    RETENTION_DB -.->|Cleanup| BACKUP_HEALTH
    RETENTION_VECTOR -.->|Cleanup| BACKUP_HEALTH
    S3_MANUAL -.->|Test Source| RESTORE_TEST

    %% Styles
    style INCIDENT fill:#f44336
    style SUCCESS fill:#4caf50
    style ROLLBACK fill:#ff9800
    style MONITORING fill:#2196f3
    style SLACK_ALERT fill:#e91e63
    style PLATFORM_TEAM fill:#9c27b0

    style CRON fill:#81c784
    style MANUAL_TRIGGER fill:#64b5f6
    style BACKUP_HEALTH fill:#aed581
    style RESTORE_TEST fill:#ffb74d
    style INTEGRITY_CHECK fill:#f06292

Rollback automatique¶

# Rollback Docker Compose : repointez l'image du service sur le tag précédent
# (via variable d'env/override compose), puis relancez le service :
docker compose up -d --no-deps gateway

# Kubernetes rollback
kubectl rollout undo deploy/gateway -n salambot

Préservation Idempotency-Key¶

[ ] Garder clés idempotence stables
[ ] Pas de purge pendant hotfix
[ ] Vérifier compatibilité payload

Matrice RACI On-Call¶

Rôle	P0	P1	P2	Escalade	Communication
Platform Team	R,A	R,A	R,A	R	R
Product Owner	C,I	C	I	A	A
Security Team	C	I	I	C	C
Management	I	I	-	I	A

Légende : R=Responsible, A=Accountable, C=Consulted, I=Informed

Modèle Postmortem¶

Template incident postmortem¶

# Postmortem - [Titre incident]

**Incident ID** : INC-YYYY-NNNN  
**Date** : YYYY-MM-DD  
**Durée** : HH:MM UTC+1  
**Sévérité** : P0/P1/P2  
**Rédacteur** : [Nom]

## Résumé exécutif

[Impact business, cause racine, actions préventives]

## Timeline (UTC+1)

| Heure | Événement     | Action |
| ----- | ------------- | ------ |
| HH:MM | Détection     | ...    |
| HH:MM | Investigation | ...    |
| HH:MM | Mitigation    | ...    |
| HH:MM | Résolution    | ...    |

## Analyse 5 Whys

1. **Pourquoi** l'incident s'est-il produit ?
2. **Pourquoi** cette cause n'a-t-elle pas été détectée ?
3. **Pourquoi** les alertes n'ont-elles pas fonctionné ?
4. **Pourquoi** la procédure n'était-elle pas claire ?
5. **Pourquoi** ce risque n'était-il pas anticipé ?

## Actions préventives

- [ ] Action 1 (Responsable, Échéance)
- [ ] Action 2 (Responsable, Échéance)
- [ ] Action 3 (Responsable, Échéance)

## Métriques

- **MTTR** : [temps de résolution]
- **MTBF** : [temps entre pannes]
- **Impact utilisateurs** : [nombre/pourcentage]

Dernière mise à jour : 2025-08-14 par Platform Team