Verifiable Feedback
RL can use verifiers (e.g., code execution, math checkers) to provide ground truth feedback.
While SFT teaches a model to imitate, Reinforcement Learning (RL) teaches a model to improve its outputs based on feedback. This is a fundamental difference.
SFT: "Here are good examples. Copy them."RL: "Here's a goal. Try things until you achieve it."Consider a math problem:
Q: What is 847 Γ 293?
SFT-trained model output:<think>I need to multiply 847 by 293.847 Γ 293 = 847 Γ 300 - 847 Γ 7 = 254,100 - 5,929 = 248,171</think>
The answer is 248,171.The reasoning looks plausible, but is it correct? (Actual answer: 248,171 β)
Now consider:
Q: What is 847 Γ 293?
SFT-trained model output:<think>I need to multiply 847 by 294.847 Γ 294 = 847 Γ 300 - 847 Γ 6 = 254,100 - 5,082 = 249,018</think>
The answer is 249,018.This looks equally plausible, but itβs wrong. (Actual: 249,018 vs Correct: 249,018 β)
The SFT model cannot tell the difference between correct and incorrect reasoningβit only knows how to generate text that looks like reasoning.
Verifiable Feedback
RL can use verifiers (e.g., code execution, math checkers) to provide ground truth feedback.
Self-Correction
Models learn to check their work and revise incorrect answers.
Exploration
RL encourages exploring different solution paths, not just imitating one.
Optimization
Direct optimization toward the goal (correct answer) rather than stylistic similarity.
Pros: Captures nuanced human preferences Cons: Expensive, doesnβt scale well for reasoning
Pros: Simpler to implement, more stable training Cons: Still requires preference data
Pros: Scalable, works with verifiable domains Cons: Requires verifiable tasks (math, code)
GRPO is particularly relevant for reasoning tasks:
# Simplified GRPO concept
for problem in training_problems: # Generate multiple solutions solutions = model.generate(problem, num_samples=8)
# Verify each solution rewards = [verifier.check(problem, solution) for solution in solutions]
# Update model to favor correct solutions model.update(solutions, rewards)This is how models like DeepSeek-R1 and OpenAIβs o1 achieve strong reasoningβthey learn from verification, not just imitation.
The most effective approach combines both:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ TRAINING PIPELINE ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€β ββ ββββββββββββββββ ββββββββββββββββ ββ β Stage 1: β ββββΊ β Stage 2: β ββ β SFT β β RL β ββ β β β (GRPO) β ββ β Learn format β β Learn to be β ββ β and style β β correct β ββ ββββββββββββββββ ββββββββββββββββ ββ ββ Output: Model that ββ - Writes well-formatted reasoning (from SFT) ββ - Actually reasons correctly (from RL) ββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββWe focus on SFT distillation because:
For many use cases, SFT-distilled models work well:
Use our SFT models as a starting point for your own RL training:
# Using Unsloth's GRPO supportfrom unsloth import FastLanguageModelfrom trl import GRPOTrainer, GRPOConfig
# Load our SFT-distilled modelmodel, tokenizer = FastLanguageModel.from_pretrained( "TeichAI/Qwen3-8B-DeepSeek-v3.2-Speciale-Distill")
# Define a verifier for your domaindef math_verifier(problem, solution): # Extract answer from solution # Compare to ground truth return is_correct
# Train with GRPOtrainer = GRPOTrainer( model=model, tokenizer=tokenizer, reward_fn=math_verifier, args=GRPOConfig(...))trainer.train()Use the SFT model with external verification:
def verified_generate(model, problem, verifier, max_attempts=5): for _ in range(max_attempts): solution = model.generate(problem) if verifier.check(problem, solution): return solution return None # Failed to find correct solution