Skip to content

Quick Start

Prerequisites

Before you begin, you’ll need:

  • A Google account (for Google Colab)
  • A Hugging Face account with a write token
  • ~2-4 hours of GPU time (free Colab tier should work)

The fastest way to get started is with our pre-built notebooks:

  1. Choose a notebook

    Visit our Notebooks page and select a model size:

  2. Open in Colab

    Click the “Open in Colab” button. The notebook includes all dependencies and is ready to run.

  3. Configure your settings

    Update the configuration cell with your details:

    hf_account = "your-username" # Your HuggingFace username
    hf_token = "hf_..." # Your HF write token
    output_model_name = "My-Model" # Name for your distilled model
  4. Select a dataset

    Choose from our pre-built reasoning datasets:

    # Option A: Use a TeichAI dataset
    dataset_id = "TeichAI/claude-4.5-opus-high-reasoning-250x"
    # Option B: Use your own dataset
    dataset_file = "my-dataset.jsonl"
  5. Run all cells

    Click Runtime → Run all. Full distillation typically takes 2-4 hours (depending on your GPU).

  6. Download your model

    After training, your model will be uploaded to HuggingFace in both:

    • Transformers format (merged 16-bit weights)
    • GGUF format (f16, q8_0 quantizations)

Option 2: Run Locally

If you have a GPU with at least 16GB VRAM, you can run the training locally.

Install Dependencies

Terminal window
pip install unsloth
pip install datasets transformers trl

Create a Training Script

Create a new file train.py:

import os
import multiprocessing as mp
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["HF_DATASETS_DISABLE_MULTIPROCESSING"] = "1"
from datasets import load_dataset
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from trl import SFTTrainer, SFTConfig
import torch
# Configuration
hf_account = "your-username"
hf_token = "hf_your_token_here"
input_model = "unsloth/Qwen3-4B"
dataset_id = "TeichAI/claude-4.5-opus-high-reasoning-250x"
output_model_name = "Qwen3-4B-My-Distill"
chat_template = "qwen3"
# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=input_model,
max_seq_length=8192,
load_in_4bit=True,
token=hf_token,
attn_implementation="eager",
)
# Apply LoRA
model = FastLanguageModel.get_peft_model(
model,
r=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
)
# Load and format dataset
tokenizer = get_chat_template(tokenizer, chat_template=chat_template)
raw_dataset = load_dataset(dataset_id, split="train")
def formatting_prompts_func(examples):
convos = examples["messages"]
texts = [tokenizer.apply_chat_template(convo, tokenize=False,
add_generation_prompt=False) for convo in convos]
return {"text": texts}
train_dataset = raw_dataset.map(formatting_prompts_func, batched=True)
# Train
if __name__ == "__main__":
mp.freeze_support()
trainer = SFTTrainer(
model=model,
processing_class=tokenizer,
train_dataset=train_dataset,
args=SFTConfig(
dataset_text_field="text",
max_length=8192,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_ratio=0.05,
max_steps=2000,
learning_rate=2e-4,
optim="adamw_8bit",
output_dir="outputs",
),
)
trainer.train()
# Upload to HuggingFace
model.push_to_hub_merged(
f"{hf_account}/{output_model_name}",
tokenizer,
save_method="merged_16bit",
token=hf_token,
)
# Create GGUF versions
model.push_to_hub_gguf(
f"{hf_account}/{output_model_name}-GGUF",
tokenizer,
quantization_method=["bf16", "f16", "q8_0"],
token=hf_token,
)

Run Training

Terminal window
python train.py

Using Your Distilled Model

With Ollama

Terminal window
# Download the GGUF file
huggingface-cli download your-username/Qwen3-4B-My-Distill-GGUF \
--include "*.gguf" --local-dir ./models
# Create an Ollama Modelfile
echo 'FROM ./models/model-q8_0.gguf' > Modelfile
# Import to Ollama
ollama create my-model -f Modelfile
# Run it!
ollama run my-model

With LM Studio

  1. Open LM Studio
  2. Go to DiscoverMy Models
  3. Click Import and select your GGUF file
  4. Start chatting!

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"your-username/Qwen3-4B-My-Distill",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("your-username/Qwen3-4B-My-Distill")
messages = [{"role": "user", "content": "Explain quantum entanglement"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Next Steps