Fine-Tune 70B+ Models Across Fragmented GPUs. Over Standard Internet.

    Zagora splits models across GPUs using Pipeline Parallelism. 4000× less network data than DDP. No InfiniBand. No matching hardware. No cluster required.

    Start Training

    13 model families · 4 PEFT techniques · SFT + DPO · Fault-tolerant by default

    Scroll to explore

    The Universal Pipeline Engine

    One Platform. Any model. Any PEFT method. Any training task.

    13 Model Families

    GPT-OSS, Llama 3, Qwen, DeepSeek-R1-Distill, Mistral, Mixtral, and more.

    4 PEFT Techniques

    LoRA, DoRA, IA3, or VeRA. Switch with a single flag. Future techniques work automatically via PyTorch's requires_grad graph.

    SFT + DPO Training

    Supervised fine-tuning and Direct Preference Optimization. DPO reference logprobs are pre-computed offline. No frozen reference copy. No 2× VRAM.

    How Zagora Works

    Pipeline parallelism over commodity networks. Fault-tolerant by design.

    1

    Pipeline Split

    The model is divided into transformer blocks and distributed across the GPUs - proportional to each node's compute capacity.

    2

    Boundary Activations

    Nodes only pass activations at block boundaries per step. This is why Zagora works over standard internet.

    3

    Local Training

    Each node runs forward and backward passes on its assigned blocks with a local optimizer.

    4

    Fault-Tolerance

    Nodes checkpoint on interval. Crashed nodes restart and resume from the exact global training step. No lost progress.

    Technical Comparison

    Zagora vs. Distributed Frameworks

    What becomes possible when training doesn't require matching hardware or InfiniBand.

    Zagora

    Mixed GPU Support

    Yes. Consumer and datacenter GPUs in the same job

    Network Requirement

    Standard 1 Gbps internet

    VRAM to Train 70B (total across GPUs)

    ~13 GB per GPU (4× consumer GPUs)

    Fault Tolerance

    Checkpoint recovery + DHT self-healing + S3 durability

    Model Coverage

    13 families, auto-registered from HuggingFace

    PEFT Techniques

    LoRA, DoRA, IA3, VeRA

    DPO Training

    Yes - reference pre-computed offline, no 2× VRAM

    DDP

    Mixed GPU Support

    No - identical GPUs required

    Network Requirement

    InfiniBand (50+ GB/s)

    VRAM to Train 70B (total across GPUs)

    1.12 TB

    Fault Tolerance

    None - one node dies, job dies

    Model Coverage

    Any (if you have the hardware)

    PEFT Techniques

    Any (manual setup)

    DPO Training

    Yes (requires 2× VRAM for reference model)

    FSDP

    Mixed GPU Support

    No - identical GPUs required

    Network Requirement

    InfiniBand (50+ GB/s)

    VRAM to Train 70B (total across GPUs)

    ~300 GB

    Fault Tolerance

    None

    Model Coverage

    Any (if you have the hardware)

    PEFT Techniques

    Any (manual setup)

    DPO Training

    Yes (requires 2× VRAM)

    Bring Your Own GPUs

    You have the hardware. Zagora turns your fragmented fleet into a unified training cluster. No new purchases. No matching hardware.

    Use Zagora Compute

    We're building local compute partnerships to deliver low-latency training at a fraction of cloud cost. Pricing coming soon.

    Start Training

    Our Approach

    Zagora optimizes for cost-per-completed-experiment, not raw single-node speed. Pipeline parallelism trades peak throughput for network efficiency, hardware flexibility, and fault tolerance. We believe that's the right tradeoff for teams that don't have homogeneous A100 clusters, and we're transparent about it.