Fine-Tune 70B+ Models Across Fragmented GPUs. Over Standard Internet.
Zagora splits models across GPUs using Pipeline Parallelism. 4000× less network data than DDP. No InfiniBand. No matching hardware. No cluster required.
13 model families · 4 PEFT techniques · SFT + DPO · Fault-tolerant by default
The Universal Pipeline Engine
One Platform. Any model. Any PEFT method. Any training task.
13 Model Families
GPT-OSS, Llama 3, Qwen, DeepSeek-R1-Distill, Mistral, Mixtral, and more.
4 PEFT Techniques
LoRA, DoRA, IA3, or VeRA. Switch with a single flag. Future techniques work automatically via PyTorch's requires_grad graph.
SFT + DPO Training
Supervised fine-tuning and Direct Preference Optimization. DPO reference logprobs are pre-computed offline. No frozen reference copy. No 2× VRAM.
How Zagora Works
Pipeline parallelism over commodity networks. Fault-tolerant by design.
Pipeline Split
The model is divided into transformer blocks and distributed across the GPUs - proportional to each node's compute capacity.
Boundary Activations
Nodes only pass activations at block boundaries per step. This is why Zagora works over standard internet.
Local Training
Each node runs forward and backward passes on its assigned blocks with a local optimizer.
Fault-Tolerance
Nodes checkpoint on interval. Crashed nodes restart and resume from the exact global training step. No lost progress.
Zagora vs. Distributed Frameworks
What becomes possible when training doesn't require matching hardware or InfiniBand.
| DDP | FSDP | Zagora | |
|---|---|---|---|
| Mixed GPU Support | No - identical GPUs required | No - identical GPUs required | Yes. Consumer and datacenter GPUs in the same job |
| Network Requirement | InfiniBand (50+ GB/s) | InfiniBand (50+ GB/s) | Standard 1 Gbps internet |
| VRAM to Train 70B (total across GPUs) | 1.12 TB | ~300 GB | ~13 GB per GPU (4× consumer GPUs) |
| Fault Tolerance | None - one node dies, job dies | None | Checkpoint recovery + DHT self-healing + S3 durability |
| Model Coverage | Any (if you have the hardware) | Any (if you have the hardware) | 13 families, auto-registered from HuggingFace |
| PEFT Techniques | Any (manual setup) | Any (manual setup) | LoRA, DoRA, IA3, VeRA |
| DPO Training | Yes (requires 2× VRAM for reference model) | Yes (requires 2× VRAM) | Yes - reference pre-computed offline, no 2× VRAM |
Zagora
Mixed GPU Support
Yes. Consumer and datacenter GPUs in the same job
Network Requirement
Standard 1 Gbps internet
VRAM to Train 70B (total across GPUs)
~13 GB per GPU (4× consumer GPUs)
Fault Tolerance
Checkpoint recovery + DHT self-healing + S3 durability
Model Coverage
13 families, auto-registered from HuggingFace
PEFT Techniques
LoRA, DoRA, IA3, VeRA
DPO Training
Yes - reference pre-computed offline, no 2× VRAM
DDP▾
Mixed GPU Support
No - identical GPUs required
Network Requirement
InfiniBand (50+ GB/s)
VRAM to Train 70B (total across GPUs)
1.12 TB
Fault Tolerance
None - one node dies, job dies
Model Coverage
Any (if you have the hardware)
PEFT Techniques
Any (manual setup)
DPO Training
Yes (requires 2× VRAM for reference model)
FSDP▾
Mixed GPU Support
No - identical GPUs required
Network Requirement
InfiniBand (50+ GB/s)
VRAM to Train 70B (total across GPUs)
~300 GB
Fault Tolerance
None
Model Coverage
Any (if you have the hardware)
PEFT Techniques
Any (manual setup)
DPO Training
Yes (requires 2× VRAM)
Bring Your Own GPUs
You have the hardware. Zagora turns your fragmented fleet into a unified training cluster. No new purchases. No matching hardware.
Use Zagora Compute
We're building local compute partnerships to deliver low-latency training at a fraction of cloud cost. Pricing coming soon.
Start TrainingOur Approach
Zagora optimizes for cost-per-completed-experiment, not raw single-node speed. Pipeline parallelism trades peak throughput for network efficiency, hardware flexibility, and fault tolerance. We believe that's the right tradeoff for teams that don't have homogeneous A100 clusters, and we're transparent about it.