DEFINITIVE MODEL LAUNCH SETTINGS — NEVER LOSE AGAIN

Created: Session 951 (2026-03-12) Updated: Session 968 (2026-03-14) — Synced with ACTUAL running Docker containers Cost of this lesson: $2,000+ over 3 days of downtime Root cause: Wrong Python packages in manual venv (torch cu128 vs cu129, missing flashinfer-jit-cache, wrong sgl-fa4 version) Solution: Run models from official Docker image lmsysorg/sglang:dev-x86

THE GOLDEN RULE

NEVER run SGLang from a manual venv again. ALWAYS use the official Docker image.

The Docker image lmsysorg/sglang:dev-x86 contains the EXACT blessed package set that SGLang developers test with. Our manual venv had: - torch==2.9.1+cu128 (WRONG — should be cu129) - flashinfer-python==0.6.3 (WRONG — should be 0.6.4) - Missing flashinfer-jit-cache==0.6.4+cu129 entirely - sgl-fa4==4.0.5 (WRONG — should be 4.0.3)

These mismatches caused SIGSEGV (segfault) during CUDA graph capture EVERY time.

QWEN3.5-397B-A17B-FP8 (PRIMARY — Port 8010, GPUs 0-3)

Docker Launch Command (PRODUCTION — ACTUAL RUNNING CONFIG)

docker run -d --name truthsi-llm-primary \
  --gpus '"device=0,1,2,3"' \
  --shm-size 32g \
  --restart unless-stopped \
  -v /opt/dlami/nvme/models/Qwen3.5-397B-A17B-FP8:/model \
  -p 8010:8010 \
  --env SGLANG_DISABLE_CUDNN_CHECK=1 \
  --env SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 \
  lmsysorg/sglang:dev-x86 \
  python -m sglang.launch_server \
    --model-path /model \
    --tp 4 \
    --port 8010 \
    --host 0.0.0.0 \
    --trust-remote-code \
    --context-length 1048576 \
    --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":262144,"mrope_section":[11,11,10],"mrope_interleaved":true,"rope_theta":10000000,"partial_rotary_factor":0.25}}' \
    --mem-fraction-static 0.92 \
    --served-model-name "genesis,deepseek-chat" \
    --schedule-policy lpm \
    --schedule-conservativeness 0.8 \
    --chunked-prefill-size 16384 \
    --max-prefill-tokens 65536 \
    --enable-mixed-chunk \
    --disable-custom-all-reduce \
    --num-continuous-decode-steps 8 \
    --mamba-full-memory-ratio 1.5 \
    --reasoning-parser qwen3 \
    --cuda-graph-max-bs 1536 \
    --enable-metrics \
    --fp8-gemm-backend triton \
    --moe-runner-backend triton \
    --watchdog-timeout 1200

Key Parameters

Parameter	Value	Why
`--gpus "device=0,1,2,3"`	GPUs 0-3	4x H200 for tensor parallelism
`--tp 4`	4-way tensor parallel	397B model needs 4 GPUs
`--context-length 1048576`	1M tokens	Extended via YaRN (4x native 262K)
`--json-model-override-args`	YaRN rope_scaling	Full mRoPE preservation (mrope_section, mrope_interleaved, rope_theta, partial_rotary_factor)
`--mem-fraction-static 0.92`	92% GPU memory	LOCKED parameter — maximizes KV cache
`--served-model-name "genesis,deepseek-chat"`	Dual aliases	Backward compatibility with deepseek-chat API calls
`--schedule-policy lpm`	Longest Prefix Match	Optimizes for repeated context/prefix sharing
`--schedule-conservativeness 0.8`	0.8	Balances throughput vs latency
`--chunked-prefill-size 16384`	16K tokens per chunk	Prevents OOM on long prefills
`--max-prefill-tokens 65536`	64K max prefill	Limits single-request prefill memory
`--enable-mixed-chunk`	Mixed chunking	Overlaps prefill and decode for throughput
`--disable-custom-all-reduce`	Disable custom AR	Uses NCCL default (more stable on H200)
`--num-continuous-decode-steps 8`	8 decode steps	Batch decode optimization
`--mamba-full-memory-ratio 1.5`	1.5x	Memory for Mamba-style attention layers
`--reasoning-parser qwen3`	Qwen3 parser	Parses thinking/reasoning tokens correctly
`--cuda-graph-max-bs 1536`	Max batch 1536	CUDA graph optimization for large batches
`--enable-metrics`	Prometheus metrics	Exposed at /metrics endpoint
`--fp8-gemm-backend triton`	Triton for FP8 GEMM	Avoids DeepGEMM assertion error on H200
`--moe-runner-backend triton`	Triton for MoE	Stable, tested backend
`--watchdog-timeout 1200`	20 min timeout	Allows slow startup
`--shm-size 32g`	Shared memory	Required for NCCL communication
`--restart unless-stopped`	Auto-restart	Survives reboots/crashes
`SGLANG_DISABLE_CUDNN_CHECK=1`	Disable CuDNN check	torch 2.9.1 + CuDNN < 9.15 compatibility
`SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1`	Allow extended context	REQUIRED to set context > model's native max_position_embeddings (262144)

Flags NOT Used (And Why)

Flag	Reason NOT Used
`--enable-hierarchical-cache`	Incompatible with Qwen3.5 hybrid attention (linear + full). Raises `HiRadixCache only supports MHA and MLA yet`
`--disable-cuda-graph`	CUDA graphs ARE enabled (default). `--cuda-graph-max-bs 1536` controls batch size limit

Performance (Verified Working)

Context window: 1,048,576 tokens (1M via YaRN)
Generation speed: ~75 tok/s per request
Memory after startup: Utilizes 92% of each GPU

GLM-4.7-FP8 (CRITIC — Port 8011, GPUs 4-7)

Docker Launch Command (PRODUCTION — ACTUAL RUNNING CONFIG)

docker run -d --name truthsi-llm-critic \
  --gpus '"device=4,5,6,7"' \
  --shm-size 32g \
  --restart unless-stopped \
  -v /opt/dlami/nvme/models/GLM-4.7-FP8:/model \
  -p 8011:8011 \
  --env SGLANG_DISABLE_CUDNN_CHECK=1 \
  lmsysorg/sglang:dev-x86 \
  python -m sglang.launch_server \
    --model-path /model \
    --tp 4 \
    --port 8011 \
    --host 0.0.0.0 \
    --trust-remote-code \
    --served-model-name "glm-4.7-fp8" \
    --mem-fraction-static 0.80 \
    --max-running-requests 100 \
    --cuda-graph-max-bs 1024 \
    --fp8-gemm-backend triton \
    --moe-runner-backend triton \
    --reasoning-parser glm45 \
    --tool-call-parser glm47 \
    --kv-cache-dtype bf16 \
    --enable-metrics \
    --watchdog-timeout 1200

Key Parameters

Parameter	Value	Why
`--gpus "device=4,5,6,7"`	GPUs 4-7	4x H200 for tensor parallelism
`--tp 4`	4-way tensor parallel	GLM-4.7 is 355B MoE (32B active)
`--mem-fraction-static 0.80`	80% GPU memory	LOCKED — shares GPU 7 with NV-Embed
`--max-running-requests 100`	Max 100 concurrent	Prevents memory overcommit
`--cuda-graph-max-bs 1024`	Max batch 1024	CUDA graph optimization
`--reasoning-parser glm45`	GLM4.5 parser	Handles interleaved thinking tokens
`--tool-call-parser glm47`	GLM4.7 tool parser	Parses tool call output format
`--kv-cache-dtype bf16`	BFloat16 KV cache	Better precision for review tasks
`--served-model-name "glm-4.7-fp8"`	Model name	Used in API routing

Performance

Context window: 202,000 tokens (native)
Use case: Actor-Critic code review, interleaved thinking

NV-EMBED-V2 (EMBEDDINGS — Port 8014, GPU 7)

NV-Embed runs as a systemd service, NOT Docker (separate from SGLang). Service: genesis-nv-embed.service GPU: 7 (shared with GLM-4.7) VRAM: ~23 GB (INT8 quantized — Session 933 breakthrough) Embedding dimension: 4096 Max tokens: 32768

BLESSED DOCKER IMAGE

Image: lmsysorg/sglang:dev-x86 Pull command: docker pull lmsysorg/sglang:dev-x86

Critical Packages Inside (DO NOT CHANGE)

Package	Version	Notes
torch	2.9.1+cu129	MUST be cu129 (NOT cu128!)
flashinfer-python	0.6.4	MUST be 0.6.4 (NOT 0.6.3!)
flashinfer-cubin	0.6.4	Pre-compiled CUDA binaries
flashinfer-jit-cache	0.6.4+cu129	JIT kernel cache (CRITICAL)
sgl-fa4	4.0.3	Flash Attention 4 (NOT 4.0.5!)
sgl-kernel	0.3.21	SGLang CUDA kernels
triton	3.5.1	Triton compiler
cuda-bindings	12.9.5	CUDA 12.9 bindings

Full Package Freeze

Saved to: /mnt/data/truth-si-dev-env/docs/BLESSED_DOCKER_PIP_FREEZE.txt

RECOVERY PROCEDURE (IF MODELS GO DOWN)

Step 1: Check Container Status

docker ps --filter name=truthsi-llm

Step 2: If Containers Stopped, Restart

docker start truthsi-llm-primary
docker start truthsi-llm-critic

Step 3: If Containers Don't Exist, Recreate

Copy-paste the Docker launch commands above.

Step 4: If Docker Image Missing

docker pull lmsysorg/sglang:dev-x86

Then recreate containers.

Step 5: If Model Files Missing (Spot Instance Termination)

Model files are on /opt/dlami/nvme (EPHEMERAL!). Restore: bash scripts/restore-models.sh

Step 6: Verify

curl -s http://localhost:8010/v1/models  # Qwen3.5
curl -s http://localhost:8011/v1/models  # GLM-4.7

RESTORE SCRIPT

Location: scripts/restore-models.sh This script contains the EXACT same flags as the running Docker containers. It is the authoritative restore procedure.

WHAT WENT WRONG (Session 949-951, 3 Days Lost)

Spot instance was terminated — models had to be reloaded
sglang-venv was rebuilt from pip — pulled WRONG package versions
pip installed torch cu128 instead of cu129 (CUDA toolkit is 12.9)
pip installed flashinfer 0.6.4 but without flashinfer-jit-cache
Research from Session 950 incorrectly recommended downgrading flashinfer to 0.6.3
SIGSEGV during CUDA graph capture — compiled kernels from wrong packages
3 days of debugging wrong packages, mem-fraction, flags
Solution: Run from Docker image which has the EXACT tested packages

LESSONS LEARNED (PERMANENT)

NEVER rebuild sglang from pip — Always use Docker image
Docker IS production — It's not just for testing
Package versions are SACRED — cu128 vs cu129 matters
When something worked, capture the EXACT environment (pip freeze + Docker image tag)
Test from Docker FIRST when debugging — isolates package issues immediately
NVMe is EPHEMERAL — Model weights must be backed up to /mnt/data

Created by THE ARCHITECT — Session 951 Updated by THE ARCHITECT — Session 968 (synced with actual running Docker containers) THIS DOCUMENT IS SACRED. NEVER DELETE. ALWAYS UPDATE.

AI Model Launch Settings

DEFINITIVE MODEL LAUNCH SETTINGS — NEVER LOSE AGAIN

THE GOLDEN RULE

QWEN3.5-397B-A17B-FP8 (PRIMARY — Port 8010, GPUs 0-3)

Docker Launch Command (PRODUCTION — ACTUAL RUNNING CONFIG)

Key Parameters

Flags NOT Used (And Why)

Performance (Verified Working)

GLM-4.7-FP8 (CRITIC — Port 8011, GPUs 4-7)

Docker Launch Command (PRODUCTION — ACTUAL RUNNING CONFIG)

Key Parameters

Performance

NV-EMBED-V2 (EMBEDDINGS — Port 8014, GPU 7)

BLESSED DOCKER IMAGE

Critical Packages Inside (DO NOT CHANGE)

Full Package Freeze

RECOVERY PROCEDURE (IF MODELS GO DOWN)

Step 1: Check Container Status

Step 2: If Containers Stopped, Restart

Step 3: If Containers Don't Exist, Recreate

Step 4: If Docker Image Missing

Step 5: If Model Files Missing (Spot Instance Termination)

Step 6: Verify

RESTORE SCRIPT

WHAT WENT WRONG (Session 949-951, 3 Days Lost)

LESSONS LEARNED (PERMANENT)