Skip to content

Operations Runbook

Health Check

GET /api/health returns the status of the database and Redis:

{
  "status": "ok",
  "version": "0.1.0",
  "uptime": 12345.67,
  "checks": {
    "database": { "status": "ok", "latencyMs": 2 },
    "redis": { "status": "ok", "latencyMs": 1 }
  }
}
Status Meaning
ok Both database and Redis are healthy
degraded One or both dependencies are down

Use this endpoint for load balancer health checks. A degraded status means the server is running but some functionality (sessions, job scheduling) may be impaired.

Stuck Run Recovery

Workflow runs that remain in running status without progress are automatically failed by a periodic job.

Setting Env var Default
Timeout STUCK_RUN_TIMEOUT_MINUTES 30
Check interval Every 15 minutes

A run is considered stuck if its updated_at timestamp is older than the timeout. The run is marked as failed with reason "Run timed out (no progress for N minutes)".

The timeout can also be configured via Admin > Security Settings in the web UI.

BullMQ Job Processing

All background jobs use a single BullMQ queue named workflow-scheduler.

Retry Policy

Jobs are retried up to 3 times with exponential backoff (5s, 10s, 20s). Failed jobs are retained for inspection (up to 5000).

Job Types

Job Schedule Purpose
role-expiry-check Hourly Revoke expired role assignments
document-expiry-check Hourly Revoke roles with expired documents
entitlement-reconciliation Daily 02:00 UTC Reconcile entitlements with connectors
run-orphan-cleanup Daily 02:30 UTC Clean up orphaned run artifacts
stuck-run-recovery Every 15 minutes Fail runs stuck in running state
audit-checkpoint Configurable Create signed audit checkpoint
deliver-scheduled-report Per-schedule cron Generate and email scheduled reports
trigger-workflow Per-schedule cron Start scheduled workflow runs
escalation-reminder Delayed Send approval reminders
escalation-reassignment Delayed Reassign overdue approvals

Monitoring Failed Jobs

Use the BullMQ dashboard or query Redis directly to inspect failed jobs:

redis-cli keys "bull:workflow-scheduler:failed:*"

Graceful Shutdown

The server handles SIGTERM and SIGINT signals:

  1. Stops accepting new HTTP connections
  2. Completes in-flight requests
  3. Drains the BullMQ worker (waits for active jobs)
  4. Closes Redis and database connections
  5. Exits process

Docker and Kubernetes send SIGTERM on container stop. The default grace period should be at least 30 seconds.

Database Pool Tuning

Env var Default Description
DB_POOL_MAX 10 Maximum number of connections
DB_POOL_MIN 2 Minimum idle connections
DB_POOL_IDLE_TIMEOUT_MS 30000 Close idle connections after this time
DB_POOL_CONNECTION_TIMEOUT_MS 5000 Fail if connection cannot be acquired

Monitor active connections with:

SELECT count(*) FROM pg_stat_activity WHERE datname = 'floh';

Log Retention

System logs are stored in the system_log table. A daily cleanup job purges entries older than LOG_RETENTION_DAYS (default: 30).

Manual purge is available via Admin > Logging > Purge Now in the web UI.

Audit Checkpoints

Signed audit checkpoints are created on a configurable schedule (default: every 6 hours). Checkpoints are stored according to AUDIT_CHECKPOINT_STORE:

Store Env var Description
file AUDIT_CHECKPOINT_PATH Local filesystem (default)
s3 S3-compatible storage (planned)
siem SIEM integration (planned)

Configuration Validation

The server validates all configuration at startup. Invalid values (e.g., PORT=abc, DB_POOL_MAX=0) produce clear error messages and prevent startup.

In production, the server also validates that required secrets are not using default/fallback values. See Security for the full list.