Module 14 · Section 14.3

Training Platforms & Tools

Unsloth, Axolotl, LLaMA-Factory, torchtune, TRL, and cloud compute options for practical fine-tuning
★ Big Picture

The fine-tuning tool landscape is evolving rapidly. While you can always write a training loop from scratch using PyTorch and the PEFT library, specialized platforms can dramatically reduce setup time, optimize GPU utilization, and provide production-tested configurations out of the box. This section surveys the most important tools in the ecosystem: Unsloth for raw speed, Axolotl for configuration-driven workflows, LLaMA-Factory for a visual interface, torchtune for PyTorch-native composability, and TRL for alignment training. We also cover the cloud compute landscape to help you choose the right GPU infrastructure for your budget.

1. Unsloth: 2x Faster Fine-Tuning

Unsloth is an open-source library that achieves roughly 2x training speedup and 50% memory reduction compared to standard Hugging Face training, with zero accuracy loss. It accomplishes this through hand-written Triton kernels for attention, RoPE, cross-entropy loss, and other operations, bypassing the overhead of PyTorch's autograd in performance-critical paths.

Unsloth integrates seamlessly with the Hugging Face ecosystem: you load models through Unsloth's optimized loader, and then use standard SFTTrainer or DPOTrainer for the actual training. The output is a standard PEFT adapter that can be loaded by any tool.

Unsloth Performance Comparison Standard HF + PEFT 100% memory, 1x speed Unsloth 50% memory, 2x speed Key Features • Custom Triton kernels for attention, RoPE, cross-entropy • Supports QLoRA, LoRA, full fine-tuning • GGUF/vLLM export built-in • Drop-in HF compatibility
Figure 1: Unsloth reduces memory by ~50% and doubles training speed through optimized Triton kernels.
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# 1. Load model with Unsloth (handles quantization + LoRA setup)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B",
    max_seq_length=2048,
    dtype=None,              # Auto-detect (BF16 on Ampere+)
    load_in_4bit=True,       # QLoRA mode
)

# 2. Add LoRA adapters (Unsloth optimized)
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,          # Unsloth recommends 0 for speed
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    use_gradient_checkpointing="unsloth",  # 30% less VRAM
)

# 3. Standard SFTTrainer workflow
dataset = load_dataset("tatsu-lab/alpaca", split="train")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=1,
        learning_rate=2e-4,
        bf16=True,
        logging_steps=1,
        optim="adamw_8bit",
        output_dir="outputs",
    ),
)

trainer.train()

# 4. Export to various formats
model.save_pretrained("lora_model")              # Save LoRA adapter
model.save_pretrained_merged("merged_model",      # Merged FP16
    tokenizer, save_method="merged_16bit")
model.save_pretrained_gguf("gguf_model",          # GGUF for llama.cpp
    tokenizer, quantization_method="q4_k_m")
◆ Key Insight

Unsloth's save_pretrained_gguf method directly exports to GGUF format, eliminating the separate llama.cpp conversion step. This makes the workflow from training to local deployment (via Ollama or llama.cpp) a single pipeline. For production vLLM deployments, use save_pretrained_merged instead.

2. Axolotl: Configuration-Driven Training

Axolotl takes a different approach: instead of writing Python code, you define your entire training run in a YAML configuration file. This makes experiments reproducible, shareable, and easy to iterate on. Axolotl supports all major model architectures, PEFT methods, dataset formats, and training features (DeepSpeed, FSDP, multi-GPU) through configuration alone.

# axolotl_config.yml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

# Dataset configuration
datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
  - path: ./my_custom_data.jsonl
    type: sharegpt

# QLoRA configuration
load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj
lora_target_linear: true

# Training parameters
sequence_len: 4096
sample_packing: true          # Pack multiple samples per sequence
pad_to_sequence_len: true
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 0.0002
lr_scheduler: cosine
warmup_ratio: 0.05
optimizer: paged_adamw_8bit
bf16: auto
gradient_checkpointing: true
flash_attention: true

# Evaluation and logging
val_set_size: 0.05
eval_steps: 100
logging_steps: 10
save_strategy: steps
save_steps: 200
output_dir: ./outputs/llama3-qlora
# Run training with a single command: $ accelerate launch -m axolotl.cli.train axolotl_config.yml # Or preprocess data first (useful for large datasets): $ python -m axolotl.cli.preprocess axolotl_config.yml $ accelerate launch -m axolotl.cli.train axolotl_config.yml
ⓘ Note

Axolotl's sample_packing feature concatenates multiple short training examples into a single sequence, significantly improving GPU utilization when your dataset contains many short examples. This can speed up training by 2-5x for datasets with average sequence lengths well below the maximum. Axolotl handles the attention masking automatically so that packed samples do not attend to each other.

3. LLaMA-Factory: Web UI for Fine-Tuning

LLaMA-Factory provides a graphical web interface (LLaMA Board) for configuring and launching fine-tuning runs. It is particularly valuable for teams where not everyone is comfortable writing YAML or Python configurations. The web UI lets you select models, datasets, PEFT methods, and hyperparameters through dropdown menus and sliders, then generates and executes the corresponding training code.

# Install LLaMA-Factory
# pip install llamafactory

# Launch the web UI
# llamafactory-cli webui

# Or use CLI for scriptable workflows
import json

# LLaMA-Factory uses a JSON config (similar to Axolotl YAML)
config = {
    "model_name_or_path": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "stage": "sft",
    "finetuning_type": "lora",
    "lora_rank": 16,
    "lora_alpha": 32,
    "lora_target": "all",
    "dataset": "alpaca_en",
    "template": "llama3",
    "quantization_bit": 4,
    "per_device_train_batch_size": 4,
    "gradient_accumulation_steps": 4,
    "num_train_epochs": 3.0,
    "learning_rate": 2e-4,
    "output_dir": "./llama_factory_output",
}

# Save and run via CLI
with open("train_config.json", "w") as f:
    json.dump(config, f, indent=2)

# llamafactory-cli train train_config.json

4. torchtune: PyTorch-Native Fine-Tuning

torchtune is PyTorch's official library for fine-tuning LLMs. Its philosophy is transparency and composability: rather than hiding complexity behind abstractions, it provides well-documented, hackable recipes that you can read, understand, and modify. Each recipe is a self-contained Python script, not a framework that manages your training loop.

torchtune is the best choice when you need full control over the training process, want to implement custom training logic, or are integrating fine-tuning into an existing PyTorch codebase.

# torchtune uses YAML configs and CLI recipes
# Install: pip install torchtune

# Download a model
# tune download meta-llama/Meta-Llama-3.1-8B-Instruct \
#   --output-dir ./models/llama3-8b

# Run a built-in recipe (LoRA single GPU)
# tune run lora_finetune_single_device \
#   --config llama3_1/8B_lora_single_device

# Custom config override
# tune run lora_finetune_single_device \
#   --config llama3_1/8B_lora_single_device \
#   batch_size=4 \
#   epochs=3 \
#   lora_rank=32

# torchtune YAML config example:
# model:
#   _component_: torchtune.models.llama3_1.lora_llama3_1_8b
#   lora_attn_modules: ['q_proj', 'v_proj', 'k_proj', 'output_proj']
#   apply_lora_to_mlp: True
#   lora_rank: 16
#   lora_alpha: 32

# torchtune is also great for programmatic use:
from torchtune.models.llama3_1 import lora_llama3_1_8b
from torchtune.modules.peft import get_adapter_params

# Build LoRA model with full control
model = lora_llama3_1_8b(
    lora_attn_modules=["q_proj", "v_proj"],
    apply_lora_to_mlp=True,
    lora_rank=16,
    lora_alpha=32,
)

# Get only adapter parameters for the optimizer
adapter_params = get_adapter_params(model)
optimizer = torch.optim.AdamW(adapter_params, lr=2e-4)

5. TRL: Transformer Reinforcement Learning

TRL (Transformer Reinforcement Learning) from Hugging Face is the standard library for alignment training, including SFT, RLHF, DPO, and other preference optimization methods. While its scope extends beyond PEFT, TRL integrates deeply with the PEFT library, making it the natural choice when your fine-tuning involves alignment stages.

from trl import SFTTrainer, SFTConfig, DPOTrainer, DPOConfig
from peft import LoraConfig

# SFT with LoRA (most common PEFT + TRL pattern)
sft_config = SFTConfig(
    output_dir="./sft_output",
    max_seq_length=2048,
    per_device_train_batch_size=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    packing=True,            # Sample packing for efficiency
)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
    model="meta-llama/Meta-Llama-3.1-8B",
    args=sft_config,
    train_dataset=dataset,
    peft_config=peft_config,   # TRL handles PEFT setup automatically
)
trainer.train()

# DPO with LoRA (preference optimization after SFT)
dpo_config = DPOConfig(
    output_dir="./dpo_output",
    per_device_train_batch_size=2,
    num_train_epochs=1,
    learning_rate=5e-5,
    beta=0.1,               # DPO temperature
)

dpo_trainer = DPOTrainer(
    model=sft_model,          # Start from SFT checkpoint
    args=dpo_config,
    train_dataset=preference_dataset,
    peft_config=peft_config,
)
dpo_trainer.train()

6. Tool Comparison Matrix

Feature Unsloth Axolotl LLaMA-Factory torchtune TRL
Interface Python API YAML config Web UI + CLI CLI + Python Python API
Speed 2x faster 1x (standard) 1x (standard) 1x (standard) 1x (standard)
Memory 50% less Standard Standard Standard Standard
Multi-GPU Limited DeepSpeed, FSDP DeepSpeed FSDP native Accelerate
RLHF/DPO Via TRL Via TRL Built-in Recipes Core feature
Export GGUF, vLLM HF format HF, GGUF HF format HF format
Best For Speed, single GPU Reproducibility Beginners, teams Custom research Alignment
⚠ Warning

Unsloth's speed advantage comes from custom CUDA/Triton kernels that may lag behind the latest model architectures. When a new model is released (for example, a new Qwen or Gemma variant), it can take days to weeks before Unsloth adds optimized support. Axolotl and TRL, which rely on standard Hugging Face Transformers, typically support new models within hours of their release. Plan accordingly if you need cutting-edge model support.

7. Cloud Compute Options

Choosing the right GPU infrastructure depends on your budget, scale, and workflow preferences. Here is a comparison of the major options available for LLM fine-tuning.

Platform GPU Options Price Range Best For
Google Colab T4 (free), A100 (Pro+) Free to $50/mo Prototyping, learning, small models
Lambda Labs A100, H100 $1.10-$2.49/hr per GPU On-demand training, reserved instances
RunPod A100, H100, A6000 $0.44-$3.89/hr per GPU Serverless, spot pricing, community cloud
Modal A100, H100, T4 Pay-per-second Serverless functions, burst training
Vast.ai Various (marketplace) $0.20-$2.00/hr Cheapest option, community GPUs
AWS/GCP/Azure Full range $1.00-$30+/hr Enterprise, compliance, multi-region
GPU Selection Guide by Model Size 7B Models QLoRA: T4 16GB LoRA: A100 40GB Full FT: 2x A100 13B Models QLoRA: A10G 24GB LoRA: A100 40GB Full FT: 4x A100 70B Models QLoRA: A100 80GB LoRA: 2x A100 80GB Full FT: 8x H100 Typical Training Cost Estimates (3 epochs, 50K samples) 7B QLoRA: $2-5 13B QLoRA: $5-15 70B QLoRA: $30-80 (Based on spot/community GPU pricing; reserved instances or cloud providers cost 2-5x more)
Figure 2: GPU requirements and approximate costs scale with model size and fine-tuning method.
ⓘ Note

For beginners, start with Google Colab Pro ($10/month) to experiment with QLoRA on 7B models using a T4 or A100 GPU. Once you have a working pipeline, move to RunPod or Lambda Labs for longer training runs. Modal is excellent for teams that want serverless infrastructure where you pay only for the seconds of GPU time you actually use.

8. Recommended Workflows

Here are recommended end-to-end workflows depending on your experience level and requirements.

Beginner: First Fine-Tune

  1. Use Google Colab with a free T4 GPU
  2. Install Unsloth for optimized training
  3. Fine-tune a 7B model with QLoRA (r=16)
  4. Export to GGUF and test with Ollama locally

Intermediate: Production Fine-Tune

  1. Use Axolotl for reproducible YAML-based configuration
  2. Train on RunPod or Lambda Labs with an A100
  3. Run evaluation suite before and after training
  4. Merge adapter and deploy via vLLM

Advanced: Multi-Stage Alignment

  1. SFT with TRL + LoRA on instruction data
  2. DPO with TRL + LoRA on preference pairs
  3. Merge both adapters sequentially
  4. Evaluate with custom benchmarks and human evaluation
  5. Deploy with vLLM or serve adapters via LoRAX

Section 14.3 Quiz

1. What are the two main performance benefits of Unsloth, and how does it achieve them?

Show Answer
Unsloth provides roughly 2x training speed and 50% memory reduction. It achieves this through custom Triton kernels for attention, RoPE embeddings, cross-entropy loss, and other operations. These kernels bypass PyTorch's autograd overhead in performance-critical paths, providing the same numerical results with less computation and memory usage.

2. What is sample packing in Axolotl, and when is it most beneficial?

Show Answer
Sample packing concatenates multiple short training examples into a single sequence up to the maximum sequence length, with attention masking to prevent cross-contamination between packed samples. It is most beneficial when your dataset contains many short examples (for example, single-turn instructions averaging 200 tokens with a 4096-token max sequence length). In such cases, packing can improve GPU utilization by 2-5x by eliminating padding waste.

3. How does torchtune differ philosophically from tools like Axolotl or LLaMA-Factory?

Show Answer
torchtune prioritizes transparency and composability over convenience. Instead of providing a framework that manages your training loop, it offers self-contained, readable recipes (Python scripts) that you can modify directly. This makes it the best choice when you need full control over the training process, want to implement custom training logic, or are integrating fine-tuning into an existing PyTorch codebase. Axolotl and LLaMA-Factory prioritize ease of use and rapid experimentation through configuration files or web UIs.

4. What GPU would you recommend for QLoRA fine-tuning of a 70B model, and approximately how much would a training run cost?

Show Answer
A 70B model with QLoRA requires approximately 36 GB of GPU memory, so a single A100 80GB GPU is the minimum viable option. A typical training run (3 epochs on 50K samples) would cost roughly $30-80 on spot/community GPU pricing (RunPod, Vast.ai), or 2-5x more on reserved instances or major cloud providers. An H100 would train faster but costs more per hour.

5. You need to fine-tune a model and then run DPO alignment. Which tool combination would you use, and why?

Show Answer
Use TRL (Transformer Reinforcement Learning) for both stages, as it natively supports SFT, DPO, and other alignment methods with built-in PEFT integration. For the SFT stage, use SFTTrainer with a LoRA config. For the DPO stage, use DPOTrainer starting from the SFT checkpoint. Optionally, use Unsloth as the model backend for 2x speed improvement on both stages. TRL handles the LoRA adapter management automatically across both training phases.

Key Takeaways