ENTERPRISE BACKUP & RECOVERY GUIDE

Truth.SI / Genesis AWS — Session 938 EXT1

Last Updated: 2026-03-09 Status: PRODUCTION — All systems active Recovery Target: < 10 minutes

QUICK REFERENCE

What you need	Command
Run a manual backup now	`bash /mnt/data/truth-si-dev-env/scripts/run-backup-once.sh [15min\\|hourly\\|daily]`
Check backup health	`ls -la /mnt/data/backups/enterprise/*/hourly/ 2>/dev/null`
Restore everything	`bash /mnt/data/truth-si-dev-env/scripts/full-recovery.sh`
Dry-run recovery	`bash /mnt/data/truth-si-dev-env/scripts/full-recovery.sh --dry-run`
Restore one service	`bash /mnt/data/truth-si-dev-env/scripts/full-recovery.sh --service redis`
Take EBS snapshot	`bash /mnt/data/truth-si-dev-env/scripts/ebs-snapshot.sh manual`
Check timer status	`systemctl list-timers 'truthsi-backup-*'`

ARCHITECTURE OVERVIEW

The 3-2-1-1-0 Rule (Implemented)

3 copies of data (live + local backups + S3 cloud)
2 different media types (EBS disk + S3 object storage)
1 offsite copy (S3 cross-region: us-west-2 + us-east-1)
1 offline/immutable copy (EBS snapshots via AWS)
0 untested restores (dry-run tests run regularly)

Backup Stack

Genesis AWS (p5en.48xlarge)
│
├── /mnt/data/backups/enterprise/     ← LOCAL BACKUPS (10TB EBS, persistent)
│   ├── neo4j/
│   │   ├── 15min/   (SKIPPED — Neo4j backup takes 5-8 min, exceeds window)
│   │   ├── hourly/  (48 retained = 2 days coverage)
│   │   ├── daily/   (14 retained = 2 weeks coverage)
│   │   └── weekly/  (52 retained = 1 year coverage)
│   ├── redis/       (same interval structure)
│   ├── yugabyte/    (same interval structure)
│   ├── weaviate/    (hourly/daily only — 26.8M objects, slow backup)
│   └── env/         (all intervals — small, fast)
│
├── EBS Snapshots (AWS)               ← VOLUME-LEVEL BACKUPS
│   └── Tagged: Project=truth-si, daily + on spot-warning
│
└── S3 (truthsi-sovereign-backup*)    ← CLOUD BACKUPS
    └── Synced by s3-sovereignty-sync daemon

BACKUP SERVICES

Systemd Timers (Running Automatically)

Timer	Schedule	Services backed up	Timeout
`truthsi-backup-15min.timer`	Every 15 min	Redis, YugabyteDB, .env	240s
`truthsi-backup-hourly.timer`	Every hour	Neo4j, Redis, YugabyteDB, .env	1200s
`truthsi-backup-daily.timer`	Daily 00:00	All services + Weaviate	1800s

Check timer status:

systemctl list-timers 'truthsi-backup-*'
systemctl status truthsi-backup-hourly.timer
journalctl -u truthsi-backup-hourly -n 50

Why 15min skips Neo4j: Neo4j's 27GB+ data directory takes 5-8 minutes to docker cp and compress. A 15-minute interval would leave no buffer time. Neo4j runs on hourly and daily timers instead.

Backup Script: `run-backup-once.sh`

The workhorse of the backup system. Standalone bash script — no daemon, no Redis Streams queue, no blocking.

# Usage
bash scripts/run-backup-once.sh 15min    # Fast: Redis + YugabyteDB + .env
bash scripts/run-backup-once.sh hourly   # Full: all services including Neo4j
bash scripts/run-backup-once.sh daily    # Full + Weaviate
bash scripts/run-backup-once.sh weekly   # Same as daily, longer retention

Log: /var/log/truthsi-backup-<interval>.log

Backup Storage Format

Each backup creates a timestamped directory:

/mnt/data/backups/enterprise/<service>/<interval>/<YYYYMMDD_HHMMSS>/
├── <data file>      (neo4j_data.tar.gz, redis_<ts>.rdb, yugabyte_<ts>.sql.gz)
└── metadata.json    (service, interval, timestamp, size_bytes, status, host)

metadata.json example:

{
  "service": "redis",
  "interval": "hourly",
  "timestamp": "20260309_043820",
  "size_bytes": 274890752,
  "status": "success",
  "host": "ip-10-0-0-1",
  "backed_up_at": "2026-03-09T04:38:20Z"
}

RETENTION POLICY

Interval	Retained	Coverage
15min	3 backups	45 minutes
hourly	48 backups	2 days
daily	14 backups	2 weeks
weekly	52 backups	1 year

Old backups are automatically pruned at the end of each backup run.

SPOT INTERRUPTION HANDLER

AWS Spot instances get a 2-minute warning before termination.

The spot handler (scripts/spot-interruption-handler.sh) polls the IMDS endpoint every 5 seconds. When termination is detected, it runs this sequence within the 2-minute window:

Step	Action	Time budget
1	ntfy.sh urgent alert	~1s
2	Redis BGSAVE + docker cp	~5s
3	.env backup	~1s
4	Git emergency commit + push	~15s
5	Neo4j docker cp (best-effort, 60s timeout)	up to 60s
6	EBS snapshot of all volumes	async
7	S3 sync	async
8	Stop SGLang model services gracefully	~10s
9	Completion notification	~1s

Spot backups land in:

/mnt/data/backups/enterprise/redis/spot/<timestamp>/
/mnt/data/backups/enterprise/neo4j/spot/<timestamp>/
/mnt/data/backups/enterprise/env/spot/<timestamp>/

Notification: Uses ntfy.sh — topic set via NTFY_TOPIC env var in .env

Service: truthsi-spot-handler.service (persistent, always running)

EBS SNAPSHOT AUTOMATION

EBS snapshots capture the entire /mnt/data volume (10TB) at a point in time. This is the fastest way to restore everything after spot termination — attach the snapshot to a new instance.

Script: scripts/ebs-snapshot.sh

# Manual snapshot
bash scripts/ebs-snapshot.sh manual

# Called automatically by spot handler on interruption
bash scripts/ebs-snapshot.sh spot-warning

Retention: Snapshots older than 7 days are automatically deleted.

Tags applied to each snapshot: - Name: TruthSI-<reason>-<timestamp> - Project: truth-si - Instance: <instance-id> - Device: /dev/sdb (or whichever device) - Reason: manual | spot-warning | daily

View snapshots:

aws ec2 describe-snapshots --filters "Name=tag:Project,Values=truth-si" \
    --query 'Snapshots[*].[SnapshotId,StartTime,Description,State]' \
    --output table

RECOVERY PROCEDURES

One-Command Full Recovery

bash /mnt/data/truth-si-dev-env/scripts/full-recovery.sh

Automatically: 1. Checks prerequisites (docker, backup root) 2. Restores .env (if not present) 3. Starts all Docker containers 4. Restores Redis (< 60s) 5. Restores YugabyteDB (< 120s) 6. Restores Neo4j (< 300s) 7. Restores Weaviate (via filesystem backup API) 8. Verifies all services healthy 9. Reports total time (target: < 10 minutes)

Backup selection: Picks the most recently timestamped backup across ALL intervals (hourly, daily, 15min, spot, weekly). A fresh 15min backup beats a stale hourly backup.

Supported flags:

--dry-run          # Show what would be restored, no changes made
--service neo4j    # Restore only one service
--service redis
--service yugabyte
--service weaviate
--service env

Manual Redis Restore

# Find latest backup
ls -dt /mnt/data/backups/enterprise/redis/*/ | head -5

# Restore uncompressed (.rdb — new format)
docker stop truthsi-redis
docker cp /mnt/data/backups/enterprise/redis/15min/<ts>/redis_<ts>.rdb truthsi-redis:/data/dump.rdb
docker start truthsi-redis

# Restore compressed (.rdb.gz — old format)
gunzip -c /mnt/data/backups/enterprise/redis/hourly/<ts>/dump.rdb.gz > /tmp/dump.rdb
docker stop truthsi-redis
docker cp /tmp/dump.rdb truthsi-redis:/data/dump.rdb
docker start truthsi-redis

Manual Neo4j Restore

TMP=/tmp/neo4j-restore-$$
mkdir -p $TMP
tar xzf /mnt/data/backups/enterprise/neo4j/hourly/<ts>/neo4j_data.tar.gz -C $TMP

docker stop truthsi-neo4j
sleep 5
docker cp ${TMP}/neo4j_bak_<ts>/. truthsi-neo4j:/data/
docker start truthsi-neo4j
sleep 25

# Verify
docker exec truthsi-neo4j cypher-shell -u neo4j -p "${NEO4J_PASSWORD}" "RETURN 1"
rm -rf $TMP

Credential note: If Neo4j password was rotated after the backup was taken, the restored auth file reflects the OLD password. Login with the old password, then update it.

Manual YugabyteDB Restore

YB_DUMP=/mnt/data/backups/enterprise/yugabyte/hourly/<ts>/yugabyte_<ts>.sql.gz
zcat "$YB_DUMP" | docker exec -i truthsi-yugabyte \
    bash -c "PGPASSWORD='${YUGABYTE_PASSWORD}' psql -h localhost -U yugabyte -d yugabyte"

Restore from EBS Snapshot (Full Instance Recovery)

For complete instance recovery on a new spot instance:

Launch new p5en.48xlarge spot instance in us-west-2
AWS console: Volumes → Create volume from snapshot (filter by Project=truth-si tag, pick latest)
Attach volume to new instance as /dev/sdb
Mount: sudo mount /dev/sdb /mnt/data
All data is immediately available — no per-database restore needed
Run: cd /mnt/data/truth-si-dev-env && docker compose up -d

BACKUP HEALTH MONITORING

Check Current Status

for svc in neo4j redis yugabyte weaviate env; do
    echo "=== $svc ==="
    ls -dt /mnt/data/backups/enterprise/${svc}/*/ 2>/dev/null | head -3 | while read d; do
        echo "  $(basename $d) — $(cat $d/metadata.json 2>/dev/null | \
            python3 -c 'import sys,json; d=json.load(sys.stdin); \
            print(d.get("status","?"), d.get("size_bytes",0)//1024//1024, "MB")' 2>/dev/null || echo 'no metadata')"
    done
done

Timer Next Run Times

systemctl list-timers 'truthsi-backup-*' --no-pager

Backup Logs

tail -50 /var/log/truthsi-backup-15min.log
tail -50 /var/log/truthsi-backup-hourly.log
tail -50 /var/log/truthsi-backup-daily.log

Staleness Alert Thresholds

Service	Alert if backup older than
Redis	2 hours
YugabyteDB	2 hours
Neo4j	4 hours
Weaviate	24 hours
.env	2 hours

KNOWN ISSUES & WORKAROUNDS

Neo4j 15min Skip

Issue: Neo4j backup takes 5-8 minutes — too slow for 15min interval. Workaround: 15min timer skips Neo4j. Neo4j runs hourly and daily only.

Redis Format Compatibility

Issue: Old enterprise-backup-daemon.py created dump.rdb.gz; new run-backup-once.sh creates redis_<ts>.rdb. Workaround: full-recovery.sh handles both formats automatically (decompresses .gz if needed).

S3 Sync Permissions

Issue: genesisdeploy IAM user lacks s3:ListObjectsV2 permission on backup buckets. Workaround: Attach IAM policy with s3:* on arn:aws:s3:::truthsi-sovereign-backup*, or use instance role.

ARCHITECTURE DECISIONS

Decision	Rationale
Standalone script vs daemon	enterprise-backup-daemon.py was blocked in Redis Streams polling loop; standalone oneshot avoids that bug entirely
systemd timers vs cron	Timers have `Persistent=true` (catch-up after reboot), built-in logging via journald, dependency management
Skip Neo4j at 15min	27GB tar.gz exceeds interval window; hourly coverage is sufficient
Most-recent timestamp wins	A fresh 15min backup is more valuable than a stale hourly of the same data
EBS snapshots as disaster recovery	Fastest path to full restore on new instance — no per-database restore scripts needed
ntfy.sh for spot alerts	Zero-infrastructure push notifications; works without Slack/email setup

FILES REFERENCE

File	Purpose
`scripts/run-backup-once.sh`	Main backup script (called by systemd timers)
`scripts/full-recovery.sh`	One-command full recovery
`scripts/ebs-snapshot.sh`	EBS volume snapshot automation
`scripts/spot-interruption-handler.sh`	Spot 2-minute warning handler
`/etc/systemd/system/truthsi-backup-15min.{service,timer}`	15-minute timer
`/etc/systemd/system/truthsi-backup-hourly.{service,timer}`	Hourly timer
`/etc/systemd/system/truthsi-backup-daily.{service,timer}`	Daily timer
`/var/log/truthsi-backup-*.log`	Backup run logs
`/mnt/data/backups/enterprise/`	Backup root directory

Created: Session 938 EXT1 — THE ARCHITECT Carter's mandate: "Every fucking little tiny thing where we can be back up in 5-10 minutes, not this fucking insanity."

Enterprise Backup Guide