Quick Start

Prerequisites

Before you begin, you’ll need:

A Google account (for Google Colab)
A Hugging Face account with a write token
~2-4 hours of GPU time (free Colab tier should work)

Option 1: Use Our Colab Notebooks (Recommended)

The fastest way to get started is with our pre-built notebooks:

Choose a notebook

Visit our Notebooks page and select a model size:
- Qwen3-4B-2507 Distillation - Best for free Colab tier
- Qwen3-8B Distillation - Still possible but may take longer on the free tier
Open in Colab

Click the “Open in Colab” button. The notebook includes all dependencies and is ready to run.

Configure your settings

Update the configuration cell with your details:

hf_account = "your-username"     # Your HuggingFace username
hf_token = "hf_..."              # Your HF write token
output_model_name = "My-Model"   # Name for your distilled model

Select a dataset

Choose from our pre-built reasoning datasets:

# Option A: Use a TeichAI dataset
dataset_id = "TeichAI/claude-4.5-opus-high-reasoning-250x"

# Option B: Use your own dataset
dataset_file = "my-dataset.jsonl"

Run all cells

Click Runtime → Run all. Full distillation typically takes 2-4 hours (depending on your GPU).
Download your model

After training, your model will be uploaded to HuggingFace in both:
- Transformers format (merged 16-bit weights)
- GGUF format (f16, q8_0 quantizations)

Option 2: Run Locally

If you have a GPU with at least 16GB VRAM, you can run the training locally.

Install Dependencies

pip install unsloth
pip install datasets transformers trl

Create a Training Script

Create a new file train.py:

import os
import multiprocessing as mp

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["HF_DATASETS_DISABLE_MULTIPROCESSING"] = "1"

from datasets import load_dataset
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from trl import SFTTrainer, SFTConfig
import torch

# Configuration
hf_account = "your-username"
hf_token = "hf_your_token_here"
input_model = "unsloth/Qwen3-4B"
dataset_id = "TeichAI/claude-4.5-opus-high-reasoning-250x"
output_model_name = "Qwen3-4B-My-Distill"
chat_template = "qwen3"

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=input_model,
    max_seq_length=8192,
    load_in_4bit=True,
    token=hf_token,
    attn_implementation="eager",
)

# Apply LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

# Load and format dataset
tokenizer = get_chat_template(tokenizer, chat_template=chat_template)
raw_dataset = load_dataset(dataset_id, split="train")

def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False,
             add_generation_prompt=False) for convo in convos]
    return {"text": texts}

train_dataset = raw_dataset.map(formatting_prompts_func, batched=True)

# Train
if __name__ == "__main__":
    mp.freeze_support()
    trainer = SFTTrainer(
        model=model,
        processing_class=tokenizer,
        train_dataset=train_dataset,
        args=SFTConfig(
            dataset_text_field="text",
            max_length=8192,
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_ratio=0.05,
            max_steps=2000,
            learning_rate=2e-4,
            optim="adamw_8bit",
            output_dir="outputs",
        ),
    )
    trainer.train()

    # Upload to HuggingFace
    model.push_to_hub_merged(
        f"{hf_account}/{output_model_name}",
        tokenizer,
        save_method="merged_16bit",
        token=hf_token,
    )

    # Create GGUF versions
    model.push_to_hub_gguf(
        f"{hf_account}/{output_model_name}-GGUF",
        tokenizer,
        quantization_method=["bf16", "f16", "q8_0"],
        token=hf_token,
    )

Run Training

python train.py

Using Your Distilled Model

With Ollama

# Download the GGUF file
huggingface-cli download your-username/Qwen3-4B-My-Distill-GGUF \
    --include "*.gguf" --local-dir ./models

# Create an Ollama Modelfile
echo 'FROM ./models/model-q8_0.gguf' > Modelfile

# Import to Ollama
ollama create my-model -f Modelfile

# Run it!
ollama run my-model

With LM Studio

Open LM Studio
Go to Discover → My Models
Click Import and select your GGUF file
Start chatting!

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "your-username/Qwen3-4B-My-Distill",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("your-username/Qwen3-4B-My-Distill")

messages = [{"role": "user", "content": "Explain quantum entanglement"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Next Steps

Training Parameters Guide - Optimize your training
Creating Datasets - Make your own training data
SFT Limitations - Understand the tradeoffs