Skip to content

Supervised Fine-Tuning (SFT)

What even is SFT?

Supervised Fine-Tuning (SFT) is the process of training a base model on a specific dataset of prompt-response pairs. In the context of distillation, we use high-quality outputs from a “teacher” model (like Claude 4.5) to teach a “student” model (like Qwen3) how to respond.

The process is straightforward:

  1. Dataset Collection: You gather thousands of examples where a human or teacher model provides the “ideal” answer to a prompt.
  2. Gradient Descent: The student model’s weights are adjusted so that, when given a prompt from the dataset, its predicted output closely matches the teacher’s output.
  3. Imitation: Through this, the model learns the style, format, and patterns of the teacher.

Think of it like a student copying a teacher’s notes. The student learns exactly what to write for those specific topics, but they haven’t necessarily learned the underlying logic used to derive those notes.

The Reality of SFT Distillation

Supervised Fine-Tuning (SFT) is powerful, but it’s important to understand what it can and cannot do. SFT teaches a model to imitate behavior, not to develop understanding.

What SFT Does Well

  1. Style Transfer - The student model learns to write like the teacher
  2. Format Replication - Structured outputs, chain-of-thought formatting
  3. Knowledge Extraction - Facts and patterns present in training data
  4. Response Quality - General improvement in coherence and helpfulness

What SFT Cannot Do

1. Transfer True Reasoning

Teacher (Claude 4.5 Opus):
Q: If A > B and B > C, what is the relationship between A and C?
<think>
This is a transitive relationship problem. If A is greater than B,
and B is greater than C, then by the transitive property of
inequalities, A must be greater than C.
</think>
A: A > C (A is greater than C)

When a student model is trained on this, it learns:

  • ✅ To use <think> tags
  • ✅ To mention “transitive property”
  • ✅ To format similar problems similarly
  • ❌ To actually understand transitivity and apply it to novel situations

2. Generalize Beyond Training Distribution

The student model may fail on:

  • Problems slightly different from training examples
  • Edge cases the teacher handled well but weren’t in the dataset
  • Novel combinations of concepts

3. Self-Correct Errors

Without reinforcement learning, the model has no mechanism to:

  • Verify its own reasoning
  • Catch and fix mistakes
  • Learn from incorrect outputs

The Imitation Problem

Consider this analogy:

MEMORIZING vs UNDERSTANDING
A student memorizes: "2 + 2 = 4"
"3 + 3 = 6"
"5 + 5 = 10"
When asked "7 + 7 = ?", they might:
- Guess based on patterns
- Fail completely
- Get lucky with interpolation
A student who UNDERSTANDS addition can solve any problem.

SFT distillation is closer to the memorization side. The student model learns patterns, not principles.

Real-World Implications

Math Problems

May solve problems similar to training data but fail on variations. Particularly weak on multi-step problems requiring genuine reasoning.

Code Generation

Can reproduce common patterns but may struggle with novel algorithms or debugging unfamiliar code.

Logical Reasoning

May appear to reason but is often pattern matching. Breaks down on unusual logical structures.

Factual Accuracy

Inherits both correct and incorrect information from the teacher. No fact-checking mechanism.

Quantifying the Gap

Based on our testing, SFT-distilled models typically achieve:

CapabilityRetention vs Teacher
Writing style85-95%
Format/structure90-95%
Common knowledge70-85%
Novel reasoning40-60%
Edge cases30-50%

When SFT Distillation Is Appropriate

Despite limitations, SFT distillation is valuable for:

  1. Making frontier capabilities accessible - A 4B local model with 60% of Claude’s reasoning is still incredibly useful
  2. Specific domain adaptation - When you only need performance on a narrow task set
  3. Cost reduction - Running locally vs. paying per API call
  4. Privacy - Keeping data on-premises
  5. Speed - Faster inference with smaller models
  6. Experimentation - Quick way to test if distillation helps your use case

Mitigating Limitations

1. Use High-Quality Datasets

The quality of the teacher’s output directly impacts student quality:

# Better: Diverse, challenging prompts with detailed reasoning
dataset = generate_dataset(
prompts=diverse_challenging_prompts,
reasoning_effort="high",
num_samples=1000
)
# Worse: Simple prompts with short responses
dataset = generate_dataset(
prompts=simple_prompts,
reasoning_effort="low",
num_samples=100
)

2. Include Edge Cases

Deliberately include unusual problems in your dataset:

  • Problems with multiple valid approaches
  • Problems with no solution
  • Problems requiring correction of initial assumptions

3. Reinforcement Learning (RL)

After SFT, you can apply reinforcement learning to:

  • Improve reasoning quality
  • Teach self-correction
  • Align with specific objectives

See the Reinforcement Learning (RL) guide for more details.

4. Ensemble Approaches

Use the distilled model alongside verification:

  • Generate multiple responses
  • Use voting or ranking
  • Apply external verification for critical tasks

Honest Expectations

When using TeichAI distilled models, expect:

  • ✅ Significant improvement over base models on reasoning tasks
  • ✅ Ability to produce thoughtful, well-structured responses
  • ✅ Good performance on common problem types
  • ⚠️ Degradation on unusual or complex problems
  • ⚠️ Occasional confident but incorrect reasoning
  • ❌ Full capability parity with the teacher model

Next Steps