Production Architecture Overview

Vantrexia runs on an automated CI/CD pipeline that deploys containerized services to an AWS EC2 instance. The deployment flow follows a fully automated path from code merge to production traffic:

Deployment Flow: GitHub Actions CI/CD → Build & Push to GHCR → SSH Deploy to EC2 t4g.medium (ARM64) → Docker Compose orchestration → Automated health checks & smoke tests.

The production environment runs on an AWS EC2 t4g.medium instance (ARM64 Graviton2 processor) providing a strong balance of cost efficiency and performance. All services are containerized using Docker and orchestrated with Docker Compose, making the entire stack reproducible and portable.

Key architectural decisions:

  • ARM64 (Graviton2) — 20% better price-performance vs. x86 instances
  • Docker Compose — Simple orchestration without Kubernetes overhead, suitable for current scale
  • Single-host deployment — All application containers on one EC2 instance; database on managed RDS
  • GHCR (GitHub Container Registry) — Docker images stored alongside source code for unified access control

Production Stack

The production environment consists of six core services running as Docker containers, plus a managed PostgreSQL database on AWS RDS:

Service Technology Role Port
Nginx Nginx 1.25 (Alpine) Reverse proxy, SSL termination, static file serving, rate limiting 80, 443
Frontend React 18 SPA (Vite build) Provider dashboard, patient management UI, billing interface 3000
Backend Django 4.2 / DRF 3.15 REST API, business logic, FHIR integration, authentication 8000
Celery Worker Celery 5.3 Async task processing: eCW sync, notifications, report generation
Celery Beat Celery 5.3 Periodic task scheduler: billing cycles, compliance audits, data sync
Redis Redis 7 (Alpine) Celery broker, caching layer, session store, rate limiting backend 6379
PostgreSQL AWS RDS (db.t4g.micro) Primary database with automated backups, encryption at rest 5432

Docker Compose Production

The production Docker Compose file defines all application services with health checks, resource limits, restart policies, and logging configuration. The database runs on AWS RDS and is connected via environment variables.

docker-compose.prod.yml
version: "3.8"

services:
  nginx:
    image: ghcr.io/highlandpc/vantrexia/nginx:${TAG:-latest}
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./docker/nginx/nginx.prod.conf:/etc/nginx/nginx.conf:ro
      - ./docker/nginx/ssl:/etc/nginx/ssl:ro
      - static_volume:/app/staticfiles:ro
      - media_volume:/app/mediafiles:ro
    depends_on:
      backend:
        condition: service_healthy
      frontend:
        condition: service_started
    restart: always
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

  frontend:
    image: ghcr.io/highlandpc/vantrexia/frontend:${TAG:-latest}
    expose:
      - "3000"
    environment:
      - VITE_API_BASE_URL=${API_URL}
    restart: always

  backend:
    image: ghcr.io/highlandpc/vantrexia/backend:${TAG:-latest}
    expose:
      - "8000"
    env_file:
      - .env
    volumes:
      - static_volume:/app/staticfiles
      - media_volume:/app/mediafiles
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health/"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    restart: always
    depends_on:
      redis:
        condition: service_healthy

  celery:
    image: ghcr.io/highlandpc/vantrexia/backend:${TAG:-latest}
    command: celery -A config worker -l info --concurrency=4
    env_file:
      - .env
    depends_on:
      backend:
        condition: service_healthy
    restart: always

  celery-beat:
    image: ghcr.io/highlandpc/vantrexia/backend:${TAG:-latest}
    command: celery -A config beat -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler
    env_file:
      - .env
    depends_on:
      backend:
        condition: service_healthy
    restart: always

  redis:
    image: redis:7-alpine
    expose:
      - "6379"
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
    restart: always

volumes:
  static_volume:
  media_volume:
  redis_data:
Important: The .env file is never committed to source control. It is deployed via the CI/CD pipeline from GitHub Secrets and AWS Secrets Manager. See the Secrets Management section below.

Zero-Downtime Deployment

Vantrexia uses a rolling deployment strategy to ensure zero downtime during production releases. The entire process completes in 3–5 minutes:

1

Scale Up

A new container is started alongside the existing running container using docker compose up -d --no-deps --scale backend=2 backend. The old container continues serving traffic while the new one initializes.

2

Health Check

The CI/CD pipeline polls the new container's /health/ endpoint every 5 seconds for up to 60 seconds. The health check verifies database connectivity, Redis availability, and Celery worker responsiveness.

3

Switch Traffic

Nginx upstream configuration is reloaded to route all new requests to the healthy new container. In-flight requests on the old container are allowed to complete with a 30-second drain period.

4

Run Migrations

Database migrations are executed against production RDS: python manage.py migrate --noinput. Migrations are designed to be backward-compatible so the old container can continue processing during this step.

5

Smoke Tests

Five automated smoke tests validate the deployment:

  1. Health endpointGET /health/ returns 200 OK
  2. Auth flow — Token acquisition and refresh cycle completes
  3. Patient APIGET /api/v1/patients/ returns valid JSON response
  4. Static assets — Frontend index.html loads with correct bundle hash
  5. SSL verification — HTTPS certificate is valid and not expiring within 30 days
6

Cleanup

The old container is stopped and removed. Unused Docker images are pruned to free disk space. Total deployment time: 3–5 minutes.

deploy.sh (simplified)
#!/bin/bash
set -euo pipefail

TAG="${GITHUB_SHA:-latest}"

# Pull latest images
docker compose -f docker-compose.prod.yml pull

# Scale up new container alongside old
docker compose -f docker-compose.prod.yml up -d --no-deps --scale backend=2 backend

# Wait for health check
for i in {1..12}; do
  if curl -sf http://localhost:8000/health/ > /dev/null; then
    echo "✓ New container healthy"
    break
  fi
  sleep 5
done

# Run migrations
docker compose -f docker-compose.prod.yml exec backend python manage.py migrate --noinput

# Scale back down (removes old container)
docker compose -f docker-compose.prod.yml up -d --no-deps --scale backend=1 backend

# Run smoke tests
./scripts/smoke-tests.sh

# Cleanup
docker image prune -f
echo "✓ Deployment complete"

SSL Configuration

SSL/TLS is terminated at the Nginx reverse proxy using a Cloudflare Origin Certificate. This provides end-to-end encryption between Cloudflare's edge network and the origin server, with additional benefits of DDoS protection and CDN caching.

Certificate Strategy: Cloudflare Origin Certificates are valid for up to 15 years and are automatically trusted by Cloudflare's edge. They are not trusted by browsers directly, ensuring that all traffic must pass through Cloudflare's protection layer.

The SSL certificate and private key are stored as GitHub Secrets (CLOUDFLARE_ORIGIN_CERT and CLOUDFLARE_ORIGIN_KEY) and deployed to the server during CI/CD. The Nginx configuration enforces TLS 1.2+ and uses modern cipher suites:

nginx SSL configuration
server {
    listen 443 ssl http2;
    server_name api.vantrexia.com;

    ssl_certificate     /etc/nginx/ssl/cloudflare-origin.pem;
    ssl_certificate_key /etc/nginx/ssl/cloudflare-origin.key;

    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
    ssl_prefer_server_ciphers off;

    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 1d;
    ssl_session_tickets off;

    # HSTS (handled by Cloudflare, but also set at origin)
    add_header Strict-Transport-Security "max-age=63072000" always;
}

Cost Breakdown

Vantrexia's production infrastructure is optimized for a small-to-medium RPM practice. The total monthly cost is approximately $57/month, providing a HIPAA-compliant, fully managed production environment:

Service Tier / Size Purpose Monthly Cost
EC2 t4g.medium (2 vCPU, 4 GB RAM) Application host (all Docker containers) ~$30.00
RDS db.t4g.micro (2 vCPU, 1 GB RAM) PostgreSQL 15, automated backups, encryption ~$14.00
S3 Standard Database backups, media files, static assets ~$3.00
CloudWatch Basic + custom metrics Logs, alarms, performance monitoring ~$5.00
Route 53 Hosted zone + health checks DNS management, failover routing ~$5.00
Total Estimated Monthly Cost ~$57.00
Cost Optimization: The ARM64 Graviton2 instance (t4g) provides approximately 20% better price-performance compared to equivalent x86 instances (t3). Additional savings come from using Cloudflare's free tier for CDN and DDoS protection, and GHCR's free storage for public container images.

Secrets Management

Vantrexia uses a two-tier secrets management strategy to keep credentials secure while enabling automated deployments:

  • AWS Secrets Manager — Source of truth for all production secrets. Secrets are rotated on a 90-day cycle and accessed at runtime by the application.
  • GitHub Secrets — CI/CD pipeline secrets used during build and deploy. Includes SSH keys, registry credentials, and deployment configuration.

Key Production Secrets

Secret Name Store Description
DATABASE_URL AWS Secrets Manager PostgreSQL connection string for RDS (includes credentials)
REDIS_URL AWS Secrets Manager Redis connection URI for Celery broker and caching
SECRET_KEY AWS Secrets Manager 256-bit secret key for Django session signing and CSRF tokens
ECW_CLIENT_ID AWS Secrets Manager eClinicalWorks FHIR API client identifier
ECW_CLIENT_SECRET AWS Secrets Manager eClinicalWorks API client secret for OAuth 2.0 authentication
CLOUDFLARE_CERT GitHub Secrets Cloudflare Origin Certificate (PEM) for SSL termination
CLOUDFLARE_KEY GitHub Secrets Cloudflare Origin Certificate private key
EC2_SSH_KEY GitHub Secrets SSH private key for CI/CD deployment to EC2
GHCR_TOKEN GitHub Secrets Personal access token for pushing images to GHCR
Security: Never log, print, or expose secrets in CI/CD output. All GitHub Actions workflows use ::add-mask:: to redact secret values. AWS Secrets Manager access is restricted via IAM roles with least-privilege policies.

CI/CD Pipeline

The production CI/CD pipeline runs on GitHub Actions and is triggered by pushes to the main branch. The pipeline consists of five sequential stages, with automatic rollback on failure:

1

Lint & Test

Runs in parallel: ruff linting for Python, eslint for TypeScript/React, pytest with 85%+ coverage requirement, and vitest for frontend unit tests. Fails fast if any check does not pass.

2

Build Docker Images

Multi-platform Docker images (linux/arm64) are built for backend, frontend, and nginx services using Docker Buildx. Images are tagged with the Git commit SHA and latest.

3

Push to GHCR

Built images are pushed to GitHub Container Registry (ghcr.io/highlandpc/vantrexia/*). Image layers are cached across builds for faster subsequent deploys.

4

Deploy to EC2 via SSH

The pipeline SSHs into the production EC2 instance, pulls the new images, and executes the zero-downtime deployment script described above. Environment variables are written from GitHub Secrets.

5

Health Check & Smoke Tests

Automated post-deploy verification confirms all services are healthy. If smoke tests fail, the pipeline executes an automatic rollback to the previous image tag. Deployment status is reported to the GitHub commit and Slack.

.github/workflows/deploy-production.yml (key steps)
name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run backend tests
        run: |
          cd backend
          pip install -r requirements.txt
          pytest --cov --cov-fail-under=85
      - name: Run frontend tests
        run: |
          cd frontend
          npm ci && npm run test

  build-and-push:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Login to GHCR
        run: echo "${{ secrets.GHCR_TOKEN }}" | docker login ghcr.io -u ${{ github.actor }} --password-stdin
      - name: Build and push images
        run: |
          docker buildx build --platform linux/arm64 \
            -t ghcr.io/highlandpc/vantrexia/backend:${{ github.sha }} \
            -t ghcr.io/highlandpc/vantrexia/backend:latest \
            --push ./backend

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to EC2
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.EC2_HOST }}
          username: ubuntu
          key: ${{ secrets.EC2_SSH_KEY }}
          script: |
            cd /opt/vantrexia
            export TAG=${{ github.sha }}
            ./scripts/deploy.sh

  smoke-test:
    needs: deploy
    runs-on: ubuntu-latest
    steps:
      - name: Run smoke tests
        run: |
          curl -sf https://api.vantrexia.com/health/ || exit 1
          curl -sf https://app.vantrexia.com/ || exit 1
          echo "✓ All smoke tests passed"

Disaster Recovery

Vantrexia's disaster recovery plan is designed for a healthcare platform where data integrity and availability are critical for patient safety and HIPAA compliance.

Recovery Objectives

Metric Target Description
RTO (Recovery Time Objective) 4 hours Maximum time to restore full service after a catastrophic failure
RPO (Recovery Point Objective) 24 hours Maximum acceptable data loss window (worst case: 1 day of data)
Availability Target 99.9% Allows approximately 8.77 hours of downtime per year

Backup Strategy

Backup Type Frequency Retention Storage
RDS Automated Snapshots Daily 7 days AWS RDS (same region)
Custom pg_dump Backups Daily 30 days S3 (encrypted, versioned)
Weekly Full Backups Weekly (Sunday 2 AM UTC) 90 days S3 (cross-region replica)
EBS Snapshots Daily 14 days AWS EBS Snapshots
Audit Log Archive Monthly 7 years (HIPAA) S3 Glacier Deep Archive

Recovery Procedures

  1. Application failure — Docker containers auto-restart via restart: always policy. If a container fails repeatedly, CloudWatch alarms trigger and the on-call engineer is notified via PagerDuty.
  2. EC2 instance failure — Launch a replacement t4g.medium from the latest EBS snapshot. Run docker compose up -d to restore all services. Estimated recovery: 30–60 minutes.
  3. Database corruption — Restore from the most recent RDS automated snapshot or S3 backup using point-in-time recovery. Estimated recovery: 1–2 hours.
  4. Complete region failure — Restore S3 cross-region backup to a new RDS instance in the failover region. Deploy application stack to a new EC2 instance. Update Route 53 DNS. Estimated recovery: 2–4 hours.
HIPAA Requirement: All backups are encrypted at rest using AES-256. S3 buckets enforce server-side encryption (SSE-S3) and have versioning enabled. Access to backup buckets is restricted to the vantrexia-backup-role IAM role. Backup restoration is tested quarterly and documented in compliance audit logs.