What is Distillation?

The Problem

Frontier AI models like GPT-5, Claude 4.5 Opus, and Gemini 3 Pro are incredibly capable, but they have significant limitations:

Closed Source: You can’t inspect, modify, or understand how they work
API-Only: Requires internet connection and ongoing costs
Privacy Concerns: Your data is sent to third-party servers
Rate Limits: Usage caps and throttling during high demand
No Customization: Can’t fine-tune for your specific use case

The Solution: Knowledge Distillation

Knowledge distillation is a machine learning technique where a smaller “student” model learns to replicate the behavior of a larger “teacher” model.

How It Works

Generate a Dataset

Use our datagen npm package to query the teacher model (e.g., Claude 4.5 Opus) with diverse prompts and capture reasoning traces in the ShareGPT/messages format. Configure high reasoning effort for detailed CoT.
Prepare the Data

Format the teacher’s responses into a training dataset with proper chat templates. We validate for completeness and filter out malformed responses.
Fine-Tune the Student

Using SFT (Supervised Fine-Tuning), train an open-source base model (e.g., Qwen3-8B) on the teacher’s outputs. We use LoRA for efficient training.
Export & Quantize

Save the trained model in multiple formats - full 16-bit weights for HuggingFace and GGUF quantizations for local deployment.

Why SFT Distillation?

Approach	Pros	Cons
SFT Distillation	Simple, fast, captures style & reasoning	May inherit teacher errors, no self-correction
RLHF/DPO	Better alignment, can correct errors	Requires preference data, more complex
GRPO	Strong reasoning, self-improvement	Requires verifiable tasks, compute-intensive

Next Steps

Ready to try distillation yourself?

Quick Start Guide - Run your first distillation in minutes
Creating Datasets - Learn how we generate training data
View Notebooks - Ready-to-run Colab notebooks