Reinforcement Learning (RL)

Beyond Imitation

While SFT teaches a model to imitate, Reinforcement Learning (RL) teaches a model to improve its outputs based on feedback. This is a fundamental difference.

SFT:  "Here are good examples. Copy them."
RL:   "Here's a goal. Try things until you achieve it."

Why SFT Alone Isn’t Enough

The Reasoning Verification Problem

Consider a math problem:

Q: What is 847 × 293?

SFT-trained model output:
<think>
I need to multiply 847 by 293.
847 × 293 = 847 × 300 - 847 × 7
         = 254,100 - 5,929
         = 248,171
</think>

The answer is 248,171.

The reasoning looks plausible, but is it correct? (Actual answer: 248,171 ✓)

Now consider:

Q: What is 847 × 293?

SFT-trained model output:
<think>
I need to multiply 847 by 294.
847 × 294 = 847 × 300 - 847 × 6
         = 254,100 - 5,082
         = 249,018
</think>

The answer is 249,018.

This looks equally plausible, but it’s wrong. (Actual: 249,018 vs Correct: 249,018 ✓)

The SFT model cannot tell the difference between correct and incorrect reasoning—it only knows how to generate text that looks like reasoning.

How RL Fixes This

Verifiable Feedback

RL can use verifiers (e.g., code execution, math checkers) to provide ground truth feedback.

Self-Correction

Models learn to check their work and revise incorrect answers.

Exploration

RL encourages exploring different solution paths, not just imitating one.

Optimization

Direct optimization toward the goal (correct answer) rather than stylistic similarity.

Types of RL for LLMs

RLHF (Reinforcement Learning from Human Feedback)

Human annotators rank model outputs
Model learns to produce preferred outputs
Used by OpenAI, Anthropic for alignment

Pros: Captures nuanced human preferences Cons: Expensive, doesn’t scale well for reasoning

DPO (Direct Preference Optimization)

Simplified alternative to RLHF
Uses paired examples (preferred vs rejected)
No separate reward model needed

Pros: Simpler to implement, more stable training Cons: Still requires preference data

GRPO (Group Relative Policy Optimization)

Generates multiple solutions for each problem
Uses verifier (e.g., code execution) to determine correct ones
Optimizes policy to favor correct solutions

Pros: Scalable, works with verifiable domains Cons: Requires verifiable tasks (math, code)

The GRPO Approach

GRPO is particularly relevant for reasoning tasks:

# Simplified GRPO concept

for problem in training_problems:
    # Generate multiple solutions
    solutions = model.generate(problem, num_samples=8)

    # Verify each solution
    rewards = [verifier.check(problem, solution) for solution in solutions]

    # Update model to favor correct solutions
    model.update(solutions, rewards)

This is how models like DeepSeek-R1 and OpenAI’s o1 achieve strong reasoning—they learn from verification, not just imitation.

SFT + RL: The Complete Pipeline

The most effective approach combines both:

┌─────────────────────────────────────────────────────────────┐
│                    TRAINING PIPELINE                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐         ┌──────────────┐                 │
│  │   Stage 1:   │  ───►   │   Stage 2:   │                 │
│  │     SFT      │         │     RL       │                 │
│  │              │         │   (GRPO)     │                 │
│  │ Learn format │         │ Learn to be  │                 │
│  │ and style    │         │ correct      │                 │
│  └──────────────┘         └──────────────┘                 │
│                                                             │
│  Output: Model that                                         │
│  - Writes well-formatted reasoning (from SFT)               │
│  - Actually reasons correctly (from RL)                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Why TeichAI Uses SFT Only (For Now)

We focus on SFT distillation because:

Accessibility - SFT is simpler and runs on consumer hardware
Speed - Quick iteration and experimentation
Foundation - SFT provides the base for future RL
Practical Value - Even SFT-only models are useful for many tasks

What You Can Do

Option 1: Use Our SFT Models As-Is

For many use cases, SFT-distilled models work well:

General conversation
Content generation
Common reasoning tasks
Code completion (with verification)

Option 2: Apply RL Yourself

Use our SFT models as a starting point for your own RL training:

# Using Unsloth's GRPO support
from unsloth import FastLanguageModel
from trl import GRPOTrainer, GRPOConfig

# Load our SFT-distilled model
model, tokenizer = FastLanguageModel.from_pretrained(
    "TeichAI/Qwen3-8B-DeepSeek-v3.2-Speciale-Distill"
)

# Define a verifier for your domain
def math_verifier(problem, solution):
    # Extract answer from solution
    # Compare to ground truth
    return is_correct

# Train with GRPO
trainer = GRPOTrainer(
    model=model,
    tokenizer=tokenizer,
    reward_fn=math_verifier,
    args=GRPOConfig(...)
)
trainer.train()

Option 3: Ensemble with Verification

Use the SFT model with external verification:

def verified_generate(model, problem, verifier, max_attempts=5):
    for _ in range(max_attempts):
        solution = model.generate(problem)
        if verifier.check(problem, solution):
            return solution
    return None  # Failed to find correct solution

Resources for RL Training

Unsloth GRPO Guide - GRPO training with Unsloth
TRL Documentation - HuggingFace’s RL library
DeepSeek-R1 Paper - How DeepSeek trained reasoning models

Next Steps

Supervised Fine-Tuning (SFT) - The foundation for RL training
Conclusions & Heuristics - Practical training guidelines
Training Parameters - Optimize your RL runs