What is Distillation?
The Problem
Frontier AI models like GPT-5, Claude 4.5 Opus, and Gemini 3 Pro are incredibly capable, but they have significant limitations:
- Closed Source: You can’t inspect, modify, or understand how they work
- API-Only: Requires internet connection and ongoing costs
- Privacy Concerns: Your data is sent to third-party servers
- Rate Limits: Usage caps and throttling during high demand
- No Customization: Can’t fine-tune for your specific use case
The Solution: Knowledge Distillation
Knowledge distillation is a machine learning technique where a smaller “student” model learns to replicate the behavior of a larger “teacher” model.
How It Works
-
Generate a Dataset
Use our
datagennpm package to query the teacher model (e.g., Claude 4.5 Opus) with diverse prompts and capture reasoning traces in the ShareGPT/messages format. Configure high reasoning effort for detailed CoT. -
Prepare the Data
Format the teacher’s responses into a training dataset with proper chat templates. We validate for completeness and filter out malformed responses.
-
Fine-Tune the Student
Using SFT (Supervised Fine-Tuning), train an open-source base model (e.g., Qwen3-8B) on the teacher’s outputs. We use LoRA for efficient training.
-
Export & Quantize
Save the trained model in multiple formats - full 16-bit weights for HuggingFace and GGUF quantizations for local deployment.
Why SFT Distillation?
| Approach | Pros | Cons |
|---|---|---|
| SFT Distillation | Simple, fast, captures style & reasoning | May inherit teacher errors, no self-correction |
| RLHF/DPO | Better alignment, can correct errors | Requires preference data, more complex |
| GRPO | Strong reasoning, self-improvement | Requires verifiable tasks, compute-intensive |
Next Steps
Ready to try distillation yourself?
- Quick Start Guide - Run your first distillation in minutes
- Creating Datasets - Learn how we generate training data
- View Notebooks - Ready-to-run Colab notebooks