Skip to content

What is Distillation?

The Problem

Frontier AI models like GPT-5, Claude 4.5 Opus, and Gemini 3 Pro are incredibly capable, but they have significant limitations:

  • Closed Source: You can’t inspect, modify, or understand how they work
  • API-Only: Requires internet connection and ongoing costs
  • Privacy Concerns: Your data is sent to third-party servers
  • Rate Limits: Usage caps and throttling during high demand
  • No Customization: Can’t fine-tune for your specific use case

The Solution: Knowledge Distillation

Knowledge distillation is a machine learning technique where a smaller “student” model learns to replicate the behavior of a larger “teacher” model.

How It Works

  1. Generate a Dataset

    Use our datagen npm package to query the teacher model (e.g., Claude 4.5 Opus) with diverse prompts and capture reasoning traces in the ShareGPT/messages format. Configure high reasoning effort for detailed CoT.

  2. Prepare the Data

    Format the teacher’s responses into a training dataset with proper chat templates. We validate for completeness and filter out malformed responses.

  3. Fine-Tune the Student

    Using SFT (Supervised Fine-Tuning), train an open-source base model (e.g., Qwen3-8B) on the teacher’s outputs. We use LoRA for efficient training.

  4. Export & Quantize

    Save the trained model in multiple formats - full 16-bit weights for HuggingFace and GGUF quantizations for local deployment.

Why SFT Distillation?

ApproachProsCons
SFT DistillationSimple, fast, captures style & reasoningMay inherit teacher errors, no self-correction
RLHF/DPOBetter alignment, can correct errorsRequires preference data, more complex
GRPOStrong reasoning, self-improvementRequires verifiable tasks, compute-intensive

Next Steps

Ready to try distillation yourself?