Skip to content

Reinforcement Learning (RL)

Beyond Imitation

While SFT teaches a model to imitate, Reinforcement Learning (RL) teaches a model to improve its outputs based on feedback. This is a fundamental difference.

SFT: "Here are good examples. Copy them."
RL: "Here's a goal. Try things until you achieve it."

Why SFT Alone Isn’t Enough

The Reasoning Verification Problem

Consider a math problem:

Q: What is 847 Γ— 293?
SFT-trained model output:
<think>
I need to multiply 847 by 293.
847 Γ— 293 = 847 Γ— 300 - 847 Γ— 7
= 254,100 - 5,929
= 248,171
</think>
The answer is 248,171.

The reasoning looks plausible, but is it correct? (Actual answer: 248,171 βœ“)

Now consider:

Q: What is 847 Γ— 293?
SFT-trained model output:
<think>
I need to multiply 847 by 294.
847 Γ— 294 = 847 Γ— 300 - 847 Γ— 6
= 254,100 - 5,082
= 249,018
</think>
The answer is 249,018.

This looks equally plausible, but it’s wrong. (Actual: 249,018 vs Correct: 249,018 βœ“)

The SFT model cannot tell the difference between correct and incorrect reasoningβ€”it only knows how to generate text that looks like reasoning.

How RL Fixes This

Verifiable Feedback

RL can use verifiers (e.g., code execution, math checkers) to provide ground truth feedback.

Self-Correction

Models learn to check their work and revise incorrect answers.

Exploration

RL encourages exploring different solution paths, not just imitating one.

Optimization

Direct optimization toward the goal (correct answer) rather than stylistic similarity.

Types of RL for LLMs

RLHF (Reinforcement Learning from Human Feedback)

  • Human annotators rank model outputs
  • Model learns to produce preferred outputs
  • Used by OpenAI, Anthropic for alignment

Pros: Captures nuanced human preferences Cons: Expensive, doesn’t scale well for reasoning

DPO (Direct Preference Optimization)

  • Simplified alternative to RLHF
  • Uses paired examples (preferred vs rejected)
  • No separate reward model needed

Pros: Simpler to implement, more stable training Cons: Still requires preference data

GRPO (Group Relative Policy Optimization)

  • Generates multiple solutions for each problem
  • Uses verifier (e.g., code execution) to determine correct ones
  • Optimizes policy to favor correct solutions

Pros: Scalable, works with verifiable domains Cons: Requires verifiable tasks (math, code)

The GRPO Approach

GRPO is particularly relevant for reasoning tasks:

# Simplified GRPO concept
for problem in training_problems:
# Generate multiple solutions
solutions = model.generate(problem, num_samples=8)
# Verify each solution
rewards = [verifier.check(problem, solution) for solution in solutions]
# Update model to favor correct solutions
model.update(solutions, rewards)

This is how models like DeepSeek-R1 and OpenAI’s o1 achieve strong reasoningβ€”they learn from verification, not just imitation.

SFT + RL: The Complete Pipeline

The most effective approach combines both:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TRAINING PIPELINE β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Stage 1: β”‚ ───► β”‚ Stage 2: β”‚ β”‚
β”‚ β”‚ SFT β”‚ β”‚ RL β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ (GRPO) β”‚ β”‚
β”‚ β”‚ Learn format β”‚ β”‚ Learn to be β”‚ β”‚
β”‚ β”‚ and style β”‚ β”‚ correct β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ Output: Model that β”‚
β”‚ - Writes well-formatted reasoning (from SFT) β”‚
β”‚ - Actually reasons correctly (from RL) β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why TeichAI Uses SFT Only (For Now)

We focus on SFT distillation because:

  1. Accessibility - SFT is simpler and runs on consumer hardware
  2. Speed - Quick iteration and experimentation
  3. Foundation - SFT provides the base for future RL
  4. Practical Value - Even SFT-only models are useful for many tasks

What You Can Do

Option 1: Use Our SFT Models As-Is

For many use cases, SFT-distilled models work well:

  • General conversation
  • Content generation
  • Common reasoning tasks
  • Code completion (with verification)

Option 2: Apply RL Yourself

Use our SFT models as a starting point for your own RL training:

# Using Unsloth's GRPO support
from unsloth import FastLanguageModel
from trl import GRPOTrainer, GRPOConfig
# Load our SFT-distilled model
model, tokenizer = FastLanguageModel.from_pretrained(
"TeichAI/Qwen3-8B-DeepSeek-v3.2-Speciale-Distill"
)
# Define a verifier for your domain
def math_verifier(problem, solution):
# Extract answer from solution
# Compare to ground truth
return is_correct
# Train with GRPO
trainer = GRPOTrainer(
model=model,
tokenizer=tokenizer,
reward_fn=math_verifier,
args=GRPOConfig(...)
)
trainer.train()

Option 3: Ensemble with Verification

Use the SFT model with external verification:

def verified_generate(model, problem, verifier, max_attempts=5):
for _ in range(max_attempts):
solution = model.generate(problem)
if verifier.check(problem, solution):
return solution
return None # Failed to find correct solution

Resources for RL Training

Next Steps