Supervised Fine-Tuning (SFT)

What even is SFT?

Supervised Fine-Tuning (SFT) is the process of training a base model on a specific dataset of prompt-response pairs. In the context of distillation, we use high-quality outputs from a “teacher” model (like Claude 4.5) to teach a “student” model (like Qwen3) how to respond.

The process is straightforward:

Dataset Collection: You gather thousands of examples where a human or teacher model provides the “ideal” answer to a prompt.
Gradient Descent: The student model’s weights are adjusted so that, when given a prompt from the dataset, its predicted output closely matches the teacher’s output.
Imitation: Through this, the model learns the style, format, and patterns of the teacher.

Think of it like a student copying a teacher’s notes. The student learns exactly what to write for those specific topics, but they haven’t necessarily learned the underlying logic used to derive those notes.

The Reality of SFT Distillation

Supervised Fine-Tuning (SFT) is powerful, but it’s important to understand what it can and cannot do. SFT teaches a model to imitate behavior, not to develop understanding.

What SFT Does Well

Style Transfer - The student model learns to write like the teacher
Format Replication - Structured outputs, chain-of-thought formatting
Knowledge Extraction - Facts and patterns present in training data
Response Quality - General improvement in coherence and helpfulness

What SFT Cannot Do

1. Transfer True Reasoning

Teacher (Claude 4.5 Opus):
Q: If A > B and B > C, what is the relationship between A and C?

<think>
This is a transitive relationship problem. If A is greater than B,
and B is greater than C, then by the transitive property of
inequalities, A must be greater than C.
</think>

A: A > C (A is greater than C)

When a student model is trained on this, it learns:

✅ To use <think> tags
✅ To mention “transitive property”
✅ To format similar problems similarly
❌ To actually understand transitivity and apply it to novel situations

2. Generalize Beyond Training Distribution

The student model may fail on:

Problems slightly different from training examples
Edge cases the teacher handled well but weren’t in the dataset
Novel combinations of concepts

3. Self-Correct Errors

Without reinforcement learning, the model has no mechanism to:

Verify its own reasoning
Catch and fix mistakes
Learn from incorrect outputs

The Imitation Problem

Consider this analogy:

MEMORIZING vs UNDERSTANDING

A student memorizes: "2 + 2 = 4"
                     "3 + 3 = 6"
                     "5 + 5 = 10"

When asked "7 + 7 = ?", they might:
- Guess based on patterns
- Fail completely
- Get lucky with interpolation

A student who UNDERSTANDS addition can solve any problem.

SFT distillation is closer to the memorization side. The student model learns patterns, not principles.

Real-World Implications

Math Problems

May solve problems similar to training data but fail on variations. Particularly weak on multi-step problems requiring genuine reasoning.

Code Generation

Can reproduce common patterns but may struggle with novel algorithms or debugging unfamiliar code.

Logical Reasoning

May appear to reason but is often pattern matching. Breaks down on unusual logical structures.

Factual Accuracy

Inherits both correct and incorrect information from the teacher. No fact-checking mechanism.

Quantifying the Gap

Based on our testing, SFT-distilled models typically achieve:

Capability	Retention vs Teacher
Writing style	85-95%
Format/structure	90-95%
Common knowledge	70-85%
Novel reasoning	40-60%
Edge cases	30-50%

When SFT Distillation Is Appropriate

Despite limitations, SFT distillation is valuable for:

Making frontier capabilities accessible - A 4B local model with 60% of Claude’s reasoning is still incredibly useful
Specific domain adaptation - When you only need performance on a narrow task set
Cost reduction - Running locally vs. paying per API call
Privacy - Keeping data on-premises
Speed - Faster inference with smaller models
Experimentation - Quick way to test if distillation helps your use case

Mitigating Limitations

1. Use High-Quality Datasets

The quality of the teacher’s output directly impacts student quality:

# Better: Diverse, challenging prompts with detailed reasoning
dataset = generate_dataset(
    prompts=diverse_challenging_prompts,
    reasoning_effort="high",
    num_samples=1000
)

# Worse: Simple prompts with short responses
dataset = generate_dataset(
    prompts=simple_prompts,
    reasoning_effort="low",
    num_samples=100
)

2. Include Edge Cases

Deliberately include unusual problems in your dataset:

Problems with multiple valid approaches
Problems with no solution
Problems requiring correction of initial assumptions

3. Reinforcement Learning (RL)

After SFT, you can apply reinforcement learning to:

Improve reasoning quality
Teach self-correction
Align with specific objectives

See the Reinforcement Learning (RL) guide for more details.

4. Ensemble Approaches

Use the distilled model alongside verification:

Generate multiple responses
Use voting or ranking
Apply external verification for critical tasks

Honest Expectations

When using TeichAI distilled models, expect:

✅ Significant improvement over base models on reasoning tasks
✅ Ability to produce thoughtful, well-structured responses
✅ Good performance on common problem types
⚠️ Degradation on unusual or complex problems
⚠️ Occasional confident but incorrect reasoning
❌ Full capability parity with the teacher model

Next Steps

Reinforcement Learning (RL) - Learn about reinforcement learning alternatives
Conclusions & Heuristics - Practical training guidelines
Creating Datasets - Build higher-quality training data