Skip to content

Conclusions & Practical Guidance

Why this page

This page distills what actually matters when you train with our approach. Use it as a reality‑check and a set of starting heuristics—not rigid rules. Results vary significantly with base model, dataset, and objective.

How we train (very briefly)

  • Load an Unsloth base (Qwen3 family, including 2507 long‑context variants) with 4‑bit weights and an appropriate chat template.
  • Apply LoRA (typically r=32, attention + MLP targets, use_gradient_checkpointing="unsloth").
  • Format ShareGPT/messages datasets via tokenizer.apply_chat_template(..., add_generation_prompt=False) into a text field.
  • Train with TRL SFTTrainer using SFTConfig:
    • max_length ~ 8192 (trainer truncation)
    • per_device_train_batch_size=1, gradient_accumulation_steps=4
    • warmup_ratio=0.05, learning_rate=2e-4, optim="adamw_8bit"
    • Save regularly; upload merged and GGUF variants on finish
  • Validate and filter data: balanced <think> tags (if present), assistant‑final turns, non‑empty content.

SFT’s limits (what to expect)

  • SFT imitates the teacher’s outputs; it does not transfer the teacher’s internal reasoning or calibration.
  • Generalization beyond the training distribution is limited.
  • Over‑training on narrow data can reduce robustness.
  • High‑quality, diverse data matters more than anything else.

See the detailed discussion

What actually moves the needle

  • Base model choice dominates. Thinking vs Instruct, 2507 long‑context vs standard, dense vs MoE—all change memory needs and step dynamics.
  • Dataset size and distribution heavily influence step count and convergence.
  • Larger models (more parameters) often reach a target loss in fewer steps on the same dataset.
  • Chat template correctness is critical for formatting (<think> for thinking models).

Loss targets and training length (our heuristics)

  • We typically aim for training loss to settle in the ~0.10–0.04 range.
  • Start with a generous max_steps, watch the curve, and early‑stop when stable in that band.
  • A practical rule of thumb:
    • 250–1000 examples: ~2000 steps works well. Training far beyond 2000 often degrades quality.
    • Larger datasets: prefer at least one full epoch so the model sees the whole set once; scale steps accordingly.
  • Bigger models may need fewer steps to hit the same loss on the same data.

Practical checklist

  • Pick a base model that matches your goal (Thinking vs Instruct) and your memory budget (standard vs 2507 long‑context, dense vs MoE).
  • Verify the chat template matches the base model. Thinking models require qwen3-thinking.
  • Validate data rigorously: messages list, assistant‑final turn, balanced <think> tags, non‑empty after </think>.
  • Start training with defaults; watch loss and GPU memory.
  • Early‑stop when loss stabilizes ~0.10–0.04.
  • For >4000 rows, favor at least one full pass over the data.
  • Export merged + GGUF for broad usability.
  • Always test your model a few times before uploading it. Especially if you’re using a new base model or dataset.

Final word

Distillation is as much engineering judgment as it is code. The guidance here—and across this site—is meant to get you productive quickly, then out of your way. Measure, adjust, and prefer small, fast experiments over rigid recipes.