Math Problems
May solve problems similar to training data but fail on variations. Particularly weak on multi-step problems requiring genuine reasoning.
Supervised Fine-Tuning (SFT) is the process of training a base model on a specific dataset of prompt-response pairs. In the context of distillation, we use high-quality outputs from a “teacher” model (like Claude 4.5) to teach a “student” model (like Qwen3) how to respond.
The process is straightforward:
Think of it like a student copying a teacher’s notes. The student learns exactly what to write for those specific topics, but they haven’t necessarily learned the underlying logic used to derive those notes.
Supervised Fine-Tuning (SFT) is powerful, but it’s important to understand what it can and cannot do. SFT teaches a model to imitate behavior, not to develop understanding.
Teacher (Claude 4.5 Opus):Q: If A > B and B > C, what is the relationship between A and C?
<think>This is a transitive relationship problem. If A is greater than B,and B is greater than C, then by the transitive property ofinequalities, A must be greater than C.</think>
A: A > C (A is greater than C)When a student model is trained on this, it learns:
<think> tagsThe student model may fail on:
Without reinforcement learning, the model has no mechanism to:
Consider this analogy:
MEMORIZING vs UNDERSTANDING
A student memorizes: "2 + 2 = 4" "3 + 3 = 6" "5 + 5 = 10"
When asked "7 + 7 = ?", they might:- Guess based on patterns- Fail completely- Get lucky with interpolation
A student who UNDERSTANDS addition can solve any problem.SFT distillation is closer to the memorization side. The student model learns patterns, not principles.
Math Problems
May solve problems similar to training data but fail on variations. Particularly weak on multi-step problems requiring genuine reasoning.
Code Generation
Can reproduce common patterns but may struggle with novel algorithms or debugging unfamiliar code.
Logical Reasoning
May appear to reason but is often pattern matching. Breaks down on unusual logical structures.
Factual Accuracy
Inherits both correct and incorrect information from the teacher. No fact-checking mechanism.
Based on our testing, SFT-distilled models typically achieve:
| Capability | Retention vs Teacher |
|---|---|
| Writing style | 85-95% |
| Format/structure | 90-95% |
| Common knowledge | 70-85% |
| Novel reasoning | 40-60% |
| Edge cases | 30-50% |
Despite limitations, SFT distillation is valuable for:
The quality of the teacher’s output directly impacts student quality:
# Better: Diverse, challenging prompts with detailed reasoningdataset = generate_dataset( prompts=diverse_challenging_prompts, reasoning_effort="high", num_samples=1000)
# Worse: Simple prompts with short responsesdataset = generate_dataset( prompts=simple_prompts, reasoning_effort="low", num_samples=100)Deliberately include unusual problems in your dataset:
After SFT, you can apply reinforcement learning to:
See the Reinforcement Learning (RL) guide for more details.
Use the distilled model alongside verification:
When using TeichAI distilled models, expect: