Fine-Tune 70B+ Models Across Fragmented GPUs. Over Standard Internet.

Zagora splits models across GPUs using Pipeline Parallelism. 4000× less network data than DDP. No InfiniBand. No matching hardware. No cluster required.

Start Training

13 model families · 4 PEFT techniques · SFT + DPO · Fault-tolerant by default

Scroll to explore

The Universal Pipeline Engine

One Platform. Any model. Any PEFT method. Any training task.

13 Model Families

GPT-OSS, Llama 3, Qwen, DeepSeek-R1-Distill, Mistral, Mixtral, and more.

4 PEFT Techniques

LoRA, DoRA, IA3, or VeRA. Switch with a single flag. Future techniques work automatically via PyTorch's requires_grad graph.

SFT + DPO Training

Supervised fine-tuning and Direct Preference Optimization. DPO reference logprobs are pre-computed offline. No frozen reference copy. No 2× VRAM.

How Zagora Works

Pipeline parallelism over commodity networks. Fault-tolerant by design.

Pipeline Split

The model is divided into transformer blocks and distributed across the GPUs - proportional to each node's compute capacity.

Boundary Activations

Nodes only pass activations at block boundaries per step. This is why Zagora works over standard internet.

Local Training

Each node runs forward and backward passes on its assigned blocks with a local optimizer.

Fault-Tolerance

Nodes checkpoint on interval. Crashed nodes restart and resume from the exact global training step. No lost progress.

Technical Comparison

Zagora vs. Distributed Frameworks

What becomes possible when training doesn't require matching hardware or InfiniBand.

	DDP	FSDP	Zagora
Mixed GPU Support	No - identical GPUs required	No - identical GPUs required	Yes. Consumer and datacenter GPUs in the same job
Network Requirement	InfiniBand (50+ GB/s)	InfiniBand (50+ GB/s)	Standard 1 Gbps internet
VRAM to Train 70B (total across GPUs)	1.12 TB	~300 GB	~13 GB per GPU (4× consumer GPUs)
Fault Tolerance	None - one node dies, job dies	None	Checkpoint recovery + DHT self-healing + S3 durability
Model Coverage	Any (if you have the hardware)	Any (if you have the hardware)	13 families, auto-registered from HuggingFace
PEFT Techniques	Any (manual setup)	Any (manual setup)	LoRA, DoRA, IA3, VeRA
DPO Training	Yes (requires 2× VRAM for reference model)	Yes (requires 2× VRAM)	Yes - reference pre-computed offline, no 2× VRAM

Zagora

Mixed GPU Support

Yes. Consumer and datacenter GPUs in the same job

Network Requirement

Standard 1 Gbps internet

VRAM to Train 70B (total across GPUs)

~13 GB per GPU (4× consumer GPUs)

Fault Tolerance

Checkpoint recovery + DHT self-healing + S3 durability

Model Coverage

13 families, auto-registered from HuggingFace

PEFT Techniques

LoRA, DoRA, IA3, VeRA

DPO Training

Yes - reference pre-computed offline, no 2× VRAM

DDP▾

Mixed GPU Support

No - identical GPUs required

Network Requirement

InfiniBand (50+ GB/s)

VRAM to Train 70B (total across GPUs)

1.12 TB

Fault Tolerance

None - one node dies, job dies

Model Coverage

Any (if you have the hardware)

PEFT Techniques

Any (manual setup)

DPO Training

Yes (requires 2× VRAM for reference model)

FSDP▾

Mixed GPU Support

No - identical GPUs required

Network Requirement

InfiniBand (50+ GB/s)

VRAM to Train 70B (total across GPUs)

~300 GB

Fault Tolerance

None

Model Coverage

Any (if you have the hardware)

PEFT Techniques

Any (manual setup)

DPO Training

Yes (requires 2× VRAM)

Bring Your Own GPUs

You have the hardware. Zagora turns your fragmented fleet into a unified training cluster. No new purchases. No matching hardware.

Use Zagora Compute

We're building local compute partnerships to deliver low-latency training at a fraction of cloud cost. Pricing coming soon.

Start Training

Our Approach

Zagora optimizes for cost-per-completed-experiment, not raw single-node speed. Pipeline parallelism trades peak throughput for network efficiency, hardware flexibility, and fault tolerance. We believe that's the right tradeoff for teams that don't have homogeneous A100 clusters, and we're transparent about it.