Created: Session 951 (2026-03-12)
Updated: Session 968 (2026-03-14) — Synced with ACTUAL running Docker containers
Cost of this lesson: $2,000+ over 3 days of downtime
Root cause: Wrong Python packages in manual venv (torch cu128 vs cu129, missing flashinfer-jit-cache, wrong sgl-fa4 version)
Solution: Run models from official Docker image lmsysorg/sglang:dev-x86
NEVER run SGLang from a manual venv again. ALWAYS use the official Docker image.
The Docker image lmsysorg/sglang:dev-x86 contains the EXACT blessed package set that SGLang developers test with. Our manual venv had:
- torch==2.9.1+cu128 (WRONG — should be cu129)
- flashinfer-python==0.6.3 (WRONG — should be 0.6.4)
- Missing flashinfer-jit-cache==0.6.4+cu129 entirely
- sgl-fa4==4.0.5 (WRONG — should be 4.0.3)
These mismatches caused SIGSEGV (segfault) during CUDA graph capture EVERY time.
docker run -d --name truthsi-llm-primary \
--gpus '"device=0,1,2,3"' \
--shm-size 32g \
--restart unless-stopped \
-v /opt/dlami/nvme/models/Qwen3.5-397B-A17B-FP8:/model \
-p 8010:8010 \
--env SGLANG_DISABLE_CUDNN_CHECK=1 \
--env SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 \
lmsysorg/sglang:dev-x86 \
python -m sglang.launch_server \
--model-path /model \
--tp 4 \
--port 8010 \
--host 0.0.0.0 \
--trust-remote-code \
--context-length 1048576 \
--json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":262144,"mrope_section":[11,11,10],"mrope_interleaved":true,"rope_theta":10000000,"partial_rotary_factor":0.25}}' \
--mem-fraction-static 0.92 \
--served-model-name "genesis,deepseek-chat" \
--schedule-policy lpm \
--schedule-conservativeness 0.8 \
--chunked-prefill-size 16384 \
--max-prefill-tokens 65536 \
--enable-mixed-chunk \
--disable-custom-all-reduce \
--num-continuous-decode-steps 8 \
--mamba-full-memory-ratio 1.5 \
--reasoning-parser qwen3 \
--cuda-graph-max-bs 1536 \
--enable-metrics \
--fp8-gemm-backend triton \
--moe-runner-backend triton \
--watchdog-timeout 1200
| Parameter | Value | Why |
|---|---|---|
--gpus "device=0,1,2,3" |
GPUs 0-3 | 4x H200 for tensor parallelism |
--tp 4 |
4-way tensor parallel | 397B model needs 4 GPUs |
--context-length 1048576 |
1M tokens | Extended via YaRN (4x native 262K) |
--json-model-override-args |
YaRN rope_scaling | Full mRoPE preservation (mrope_section, mrope_interleaved, rope_theta, partial_rotary_factor) |
--mem-fraction-static 0.92 |
92% GPU memory | LOCKED parameter — maximizes KV cache |
--served-model-name "genesis,deepseek-chat" |
Dual aliases | Backward compatibility with deepseek-chat API calls |
--schedule-policy lpm |
Longest Prefix Match | Optimizes for repeated context/prefix sharing |
--schedule-conservativeness 0.8 |
0.8 | Balances throughput vs latency |
--chunked-prefill-size 16384 |
16K tokens per chunk | Prevents OOM on long prefills |
--max-prefill-tokens 65536 |
64K max prefill | Limits single-request prefill memory |
--enable-mixed-chunk |
Mixed chunking | Overlaps prefill and decode for throughput |
--disable-custom-all-reduce |
Disable custom AR | Uses NCCL default (more stable on H200) |
--num-continuous-decode-steps 8 |
8 decode steps | Batch decode optimization |
--mamba-full-memory-ratio 1.5 |
1.5x | Memory for Mamba-style attention layers |
--reasoning-parser qwen3 |
Qwen3 parser | Parses thinking/reasoning tokens correctly |
--cuda-graph-max-bs 1536 |
Max batch 1536 | CUDA graph optimization for large batches |
--enable-metrics |
Prometheus metrics | Exposed at /metrics endpoint |
--fp8-gemm-backend triton |
Triton for FP8 GEMM | Avoids DeepGEMM assertion error on H200 |
--moe-runner-backend triton |
Triton for MoE | Stable, tested backend |
--watchdog-timeout 1200 |
20 min timeout | Allows slow startup |
--shm-size 32g |
Shared memory | Required for NCCL communication |
--restart unless-stopped |
Auto-restart | Survives reboots/crashes |
SGLANG_DISABLE_CUDNN_CHECK=1 |
Disable CuDNN check | torch 2.9.1 + CuDNN < 9.15 compatibility |
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 |
Allow extended context | REQUIRED to set context > model's native max_position_embeddings (262144) |
| Flag | Reason NOT Used |
|---|---|
--enable-hierarchical-cache |
Incompatible with Qwen3.5 hybrid attention (linear + full). Raises HiRadixCache only supports MHA and MLA yet |
--disable-cuda-graph |
CUDA graphs ARE enabled (default). --cuda-graph-max-bs 1536 controls batch size limit |
docker run -d --name truthsi-llm-critic \
--gpus '"device=4,5,6,7"' \
--shm-size 32g \
--restart unless-stopped \
-v /opt/dlami/nvme/models/GLM-4.7-FP8:/model \
-p 8011:8011 \
--env SGLANG_DISABLE_CUDNN_CHECK=1 \
lmsysorg/sglang:dev-x86 \
python -m sglang.launch_server \
--model-path /model \
--tp 4 \
--port 8011 \
--host 0.0.0.0 \
--trust-remote-code \
--served-model-name "glm-4.7-fp8" \
--mem-fraction-static 0.80 \
--max-running-requests 100 \
--cuda-graph-max-bs 1024 \
--fp8-gemm-backend triton \
--moe-runner-backend triton \
--reasoning-parser glm45 \
--tool-call-parser glm47 \
--kv-cache-dtype bf16 \
--enable-metrics \
--watchdog-timeout 1200
| Parameter | Value | Why |
|---|---|---|
--gpus "device=4,5,6,7" |
GPUs 4-7 | 4x H200 for tensor parallelism |
--tp 4 |
4-way tensor parallel | GLM-4.7 is 355B MoE (32B active) |
--mem-fraction-static 0.80 |
80% GPU memory | LOCKED — shares GPU 7 with NV-Embed |
--max-running-requests 100 |
Max 100 concurrent | Prevents memory overcommit |
--cuda-graph-max-bs 1024 |
Max batch 1024 | CUDA graph optimization |
--reasoning-parser glm45 |
GLM4.5 parser | Handles interleaved thinking tokens |
--tool-call-parser glm47 |
GLM4.7 tool parser | Parses tool call output format |
--kv-cache-dtype bf16 |
BFloat16 KV cache | Better precision for review tasks |
--served-model-name "glm-4.7-fp8" |
Model name | Used in API routing |
NV-Embed runs as a systemd service, NOT Docker (separate from SGLang).
Service: genesis-nv-embed.service
GPU: 7 (shared with GLM-4.7)
VRAM: ~23 GB (INT8 quantized — Session 933 breakthrough)
Embedding dimension: 4096
Max tokens: 32768
Image: lmsysorg/sglang:dev-x86
Pull command: docker pull lmsysorg/sglang:dev-x86
| Package | Version | Notes |
|---|---|---|
| torch | 2.9.1+cu129 | MUST be cu129 (NOT cu128!) |
| flashinfer-python | 0.6.4 | MUST be 0.6.4 (NOT 0.6.3!) |
| flashinfer-cubin | 0.6.4 | Pre-compiled CUDA binaries |
| flashinfer-jit-cache | 0.6.4+cu129 | JIT kernel cache (CRITICAL) |
| sgl-fa4 | 4.0.3 | Flash Attention 4 (NOT 4.0.5!) |
| sgl-kernel | 0.3.21 | SGLang CUDA kernels |
| triton | 3.5.1 | Triton compiler |
| cuda-bindings | 12.9.5 | CUDA 12.9 bindings |
Saved to: /mnt/data/truth-si-dev-env/docs/BLESSED_DOCKER_PIP_FREEZE.txt
docker ps --filter name=truthsi-llm
docker start truthsi-llm-primary
docker start truthsi-llm-critic
Copy-paste the Docker launch commands above.
docker pull lmsysorg/sglang:dev-x86
Then recreate containers.
Model files are on /opt/dlami/nvme (EPHEMERAL!).
Restore: bash scripts/restore-models.sh
curl -s http://localhost:8010/v1/models # Qwen3.5
curl -s http://localhost:8011/v1/models # GLM-4.7
Location: scripts/restore-models.sh
This script contains the EXACT same flags as the running Docker containers. It is the authoritative restore procedure.
Created by THE ARCHITECT — Session 951 Updated by THE ARCHITECT — Session 968 (synced with actual running Docker containers) THIS DOCUMENT IS SACRED. NEVER DELETE. ALWAYS UPDATE.