Last Updated: 2026-03-09 Status: PRODUCTION — All systems active Recovery Target: < 10 minutes
| What you need | Command |
|---|---|
| Run a manual backup now | bash /mnt/data/truth-si-dev-env/scripts/run-backup-once.sh [15min\|hourly\|daily] |
| Check backup health | ls -la /mnt/data/backups/enterprise/*/hourly/ 2>/dev/null |
| Restore everything | bash /mnt/data/truth-si-dev-env/scripts/full-recovery.sh |
| Dry-run recovery | bash /mnt/data/truth-si-dev-env/scripts/full-recovery.sh --dry-run |
| Restore one service | bash /mnt/data/truth-si-dev-env/scripts/full-recovery.sh --service redis |
| Take EBS snapshot | bash /mnt/data/truth-si-dev-env/scripts/ebs-snapshot.sh manual |
| Check timer status | systemctl list-timers 'truthsi-backup-*' |
Genesis AWS (p5en.48xlarge)
│
├── /mnt/data/backups/enterprise/ ← LOCAL BACKUPS (10TB EBS, persistent)
│ ├── neo4j/
│ │ ├── 15min/ (SKIPPED — Neo4j backup takes 5-8 min, exceeds window)
│ │ ├── hourly/ (48 retained = 2 days coverage)
│ │ ├── daily/ (14 retained = 2 weeks coverage)
│ │ └── weekly/ (52 retained = 1 year coverage)
│ ├── redis/ (same interval structure)
│ ├── yugabyte/ (same interval structure)
│ ├── weaviate/ (hourly/daily only — 26.8M objects, slow backup)
│ └── env/ (all intervals — small, fast)
│
├── EBS Snapshots (AWS) ← VOLUME-LEVEL BACKUPS
│ └── Tagged: Project=truth-si, daily + on spot-warning
│
└── S3 (truthsi-sovereign-backup*) ← CLOUD BACKUPS
└── Synced by s3-sovereignty-sync daemon
| Timer | Schedule | Services backed up | Timeout |
|---|---|---|---|
truthsi-backup-15min.timer |
Every 15 min | Redis, YugabyteDB, .env | 240s |
truthsi-backup-hourly.timer |
Every hour | Neo4j, Redis, YugabyteDB, .env | 1200s |
truthsi-backup-daily.timer |
Daily 00:00 | All services + Weaviate | 1800s |
Check timer status:
systemctl list-timers 'truthsi-backup-*'
systemctl status truthsi-backup-hourly.timer
journalctl -u truthsi-backup-hourly -n 50
Why 15min skips Neo4j: Neo4j's 27GB+ data directory takes 5-8 minutes to docker cp and compress. A 15-minute interval would leave no buffer time. Neo4j runs on hourly and daily timers instead.
run-backup-once.shThe workhorse of the backup system. Standalone bash script — no daemon, no Redis Streams queue, no blocking.
# Usage
bash scripts/run-backup-once.sh 15min # Fast: Redis + YugabyteDB + .env
bash scripts/run-backup-once.sh hourly # Full: all services including Neo4j
bash scripts/run-backup-once.sh daily # Full + Weaviate
bash scripts/run-backup-once.sh weekly # Same as daily, longer retention
Log: /var/log/truthsi-backup-<interval>.log
Each backup creates a timestamped directory:
/mnt/data/backups/enterprise/<service>/<interval>/<YYYYMMDD_HHMMSS>/
├── <data file> (neo4j_data.tar.gz, redis_<ts>.rdb, yugabyte_<ts>.sql.gz)
└── metadata.json (service, interval, timestamp, size_bytes, status, host)
metadata.json example:
{
"service": "redis",
"interval": "hourly",
"timestamp": "20260309_043820",
"size_bytes": 274890752,
"status": "success",
"host": "ip-10-0-0-1",
"backed_up_at": "2026-03-09T04:38:20Z"
}
| Interval | Retained | Coverage |
|---|---|---|
| 15min | 3 backups | 45 minutes |
| hourly | 48 backups | 2 days |
| daily | 14 backups | 2 weeks |
| weekly | 52 backups | 1 year |
Old backups are automatically pruned at the end of each backup run.
AWS Spot instances get a 2-minute warning before termination.
The spot handler (scripts/spot-interruption-handler.sh) polls the IMDS endpoint every 5 seconds. When termination is detected, it runs this sequence within the 2-minute window:
| Step | Action | Time budget |
|---|---|---|
| 1 | ntfy.sh urgent alert | ~1s |
| 2 | Redis BGSAVE + docker cp | ~5s |
| 3 | .env backup | ~1s |
| 4 | Git emergency commit + push | ~15s |
| 5 | Neo4j docker cp (best-effort, 60s timeout) | up to 60s |
| 6 | EBS snapshot of all volumes | async |
| 7 | S3 sync | async |
| 8 | Stop SGLang model services gracefully | ~10s |
| 9 | Completion notification | ~1s |
Spot backups land in:
/mnt/data/backups/enterprise/redis/spot/<timestamp>/
/mnt/data/backups/enterprise/neo4j/spot/<timestamp>/
/mnt/data/backups/enterprise/env/spot/<timestamp>/
Notification: Uses ntfy.sh — topic set via NTFY_TOPIC env var in .env
Service: truthsi-spot-handler.service (persistent, always running)
EBS snapshots capture the entire /mnt/data volume (10TB) at a point in time. This is the fastest way to restore everything after spot termination — attach the snapshot to a new instance.
Script: scripts/ebs-snapshot.sh
# Manual snapshot
bash scripts/ebs-snapshot.sh manual
# Called automatically by spot handler on interruption
bash scripts/ebs-snapshot.sh spot-warning
Retention: Snapshots older than 7 days are automatically deleted.
Tags applied to each snapshot:
- Name: TruthSI-<reason>-<timestamp>
- Project: truth-si
- Instance: <instance-id>
- Device: /dev/sdb (or whichever device)
- Reason: manual | spot-warning | daily
View snapshots:
aws ec2 describe-snapshots --filters "Name=tag:Project,Values=truth-si" \
--query 'Snapshots[*].[SnapshotId,StartTime,Description,State]' \
--output table
bash /mnt/data/truth-si-dev-env/scripts/full-recovery.sh
Automatically:
1. Checks prerequisites (docker, backup root)
2. Restores .env (if not present)
3. Starts all Docker containers
4. Restores Redis (< 60s)
5. Restores YugabyteDB (< 120s)
6. Restores Neo4j (< 300s)
7. Restores Weaviate (via filesystem backup API)
8. Verifies all services healthy
9. Reports total time (target: < 10 minutes)
Backup selection: Picks the most recently timestamped backup across ALL intervals (hourly, daily, 15min, spot, weekly). A fresh 15min backup beats a stale hourly backup.
Supported flags:
--dry-run # Show what would be restored, no changes made
--service neo4j # Restore only one service
--service redis
--service yugabyte
--service weaviate
--service env
# Find latest backup
ls -dt /mnt/data/backups/enterprise/redis/*/ | head -5
# Restore uncompressed (.rdb — new format)
docker stop truthsi-redis
docker cp /mnt/data/backups/enterprise/redis/15min/<ts>/redis_<ts>.rdb truthsi-redis:/data/dump.rdb
docker start truthsi-redis
# Restore compressed (.rdb.gz — old format)
gunzip -c /mnt/data/backups/enterprise/redis/hourly/<ts>/dump.rdb.gz > /tmp/dump.rdb
docker stop truthsi-redis
docker cp /tmp/dump.rdb truthsi-redis:/data/dump.rdb
docker start truthsi-redis
TMP=/tmp/neo4j-restore-$$
mkdir -p $TMP
tar xzf /mnt/data/backups/enterprise/neo4j/hourly/<ts>/neo4j_data.tar.gz -C $TMP
docker stop truthsi-neo4j
sleep 5
docker cp ${TMP}/neo4j_bak_<ts>/. truthsi-neo4j:/data/
docker start truthsi-neo4j
sleep 25
# Verify
docker exec truthsi-neo4j cypher-shell -u neo4j -p "${NEO4J_PASSWORD}" "RETURN 1"
rm -rf $TMP
Credential note: If Neo4j password was rotated after the backup was taken, the restored auth file reflects the OLD password. Login with the old password, then update it.
YB_DUMP=/mnt/data/backups/enterprise/yugabyte/hourly/<ts>/yugabyte_<ts>.sql.gz
zcat "$YB_DUMP" | docker exec -i truthsi-yugabyte \
bash -c "PGPASSWORD='${YUGABYTE_PASSWORD}' psql -h localhost -U yugabyte -d yugabyte"
For complete instance recovery on a new spot instance:
p5en.48xlarge spot instance in us-west-2Project=truth-si tag, pick latest)/dev/sdbsudo mount /dev/sdb /mnt/datacd /mnt/data/truth-si-dev-env && docker compose up -dfor svc in neo4j redis yugabyte weaviate env; do
echo "=== $svc ==="
ls -dt /mnt/data/backups/enterprise/${svc}/*/ 2>/dev/null | head -3 | while read d; do
echo " $(basename $d) — $(cat $d/metadata.json 2>/dev/null | \
python3 -c 'import sys,json; d=json.load(sys.stdin); \
print(d.get("status","?"), d.get("size_bytes",0)//1024//1024, "MB")' 2>/dev/null || echo 'no metadata')"
done
done
systemctl list-timers 'truthsi-backup-*' --no-pager
tail -50 /var/log/truthsi-backup-15min.log
tail -50 /var/log/truthsi-backup-hourly.log
tail -50 /var/log/truthsi-backup-daily.log
| Service | Alert if backup older than |
|---|---|
| Redis | 2 hours |
| YugabyteDB | 2 hours |
| Neo4j | 4 hours |
| Weaviate | 24 hours |
| .env | 2 hours |
Issue: Neo4j backup takes 5-8 minutes — too slow for 15min interval. Workaround: 15min timer skips Neo4j. Neo4j runs hourly and daily only.
Issue: Old enterprise-backup-daemon.py created dump.rdb.gz; new run-backup-once.sh creates redis_<ts>.rdb.
Workaround: full-recovery.sh handles both formats automatically (decompresses .gz if needed).
Issue: genesisdeploy IAM user lacks s3:ListObjectsV2 permission on backup buckets.
Workaround: Attach IAM policy with s3:* on arn:aws:s3:::truthsi-sovereign-backup*, or use instance role.
| Decision | Rationale |
|---|---|
| Standalone script vs daemon | enterprise-backup-daemon.py was blocked in Redis Streams polling loop; standalone oneshot avoids that bug entirely |
| systemd timers vs cron | Timers have Persistent=true (catch-up after reboot), built-in logging via journald, dependency management |
| Skip Neo4j at 15min | 27GB tar.gz exceeds interval window; hourly coverage is sufficient |
| Most-recent timestamp wins | A fresh 15min backup is more valuable than a stale hourly of the same data |
| EBS snapshots as disaster recovery | Fastest path to full restore on new instance — no per-database restore scripts needed |
| ntfy.sh for spot alerts | Zero-infrastructure push notifications; works without Slack/email setup |
| File | Purpose |
|---|---|
scripts/run-backup-once.sh |
Main backup script (called by systemd timers) |
scripts/full-recovery.sh |
One-command full recovery |
scripts/ebs-snapshot.sh |
EBS volume snapshot automation |
scripts/spot-interruption-handler.sh |
Spot 2-minute warning handler |
/etc/systemd/system/truthsi-backup-15min.{service,timer} |
15-minute timer |
/etc/systemd/system/truthsi-backup-hourly.{service,timer} |
Hourly timer |
/etc/systemd/system/truthsi-backup-daily.{service,timer} |
Daily timer |
/var/log/truthsi-backup-*.log |
Backup run logs |
/mnt/data/backups/enterprise/ |
Backup root directory |
Created: Session 938 EXT1 — THE ARCHITECT Carter's mandate: "Every fucking little tiny thing where we can be back up in 5-10 minutes, not this fucking insanity."