Quick Start
Prerequisites
Before you begin, you’ll need:
- A Google account (for Google Colab)
- A Hugging Face account with a write token
- ~2-4 hours of GPU time (free Colab tier should work)
Option 1: Use Our Colab Notebooks (Recommended)
The fastest way to get started is with our pre-built notebooks:
-
Choose a notebook
Visit our Notebooks page and select a model size:
- Qwen3-4B-2507 Distillation - Best for free Colab tier
- Qwen3-8B Distillation - Still possible but may take longer on the free tier
-
Open in Colab
Click the “Open in Colab” button. The notebook includes all dependencies and is ready to run.
-
Configure your settings
Update the configuration cell with your details:
hf_account = "your-username" # Your HuggingFace usernamehf_token = "hf_..." # Your HF write tokenoutput_model_name = "My-Model" # Name for your distilled model -
Select a dataset
Choose from our pre-built reasoning datasets:
# Option A: Use a TeichAI datasetdataset_id = "TeichAI/claude-4.5-opus-high-reasoning-250x"# Option B: Use your own datasetdataset_file = "my-dataset.jsonl" -
Run all cells
Click Runtime → Run all. Full distillation typically takes 2-4 hours (depending on your GPU).
-
Download your model
After training, your model will be uploaded to HuggingFace in both:
- Transformers format (merged 16-bit weights)
- GGUF format (f16, q8_0 quantizations)
Option 2: Run Locally
If you have a GPU with at least 16GB VRAM, you can run the training locally.
Install Dependencies
pip install unslothpip install datasets transformers trlCreate a Training Script
Create a new file train.py:
import osimport multiprocessing as mp
os.environ["TOKENIZERS_PARALLELISM"] = "false"os.environ["HF_DATASETS_DISABLE_MULTIPROCESSING"] = "1"
from datasets import load_datasetfrom unsloth import FastLanguageModelfrom unsloth.chat_templates import get_chat_templatefrom trl import SFTTrainer, SFTConfigimport torch
# Configurationhf_account = "your-username"hf_token = "hf_your_token_here"input_model = "unsloth/Qwen3-4B"dataset_id = "TeichAI/claude-4.5-opus-high-reasoning-250x"output_model_name = "Qwen3-4B-My-Distill"chat_template = "qwen3"
# Load modelmodel, tokenizer = FastLanguageModel.from_pretrained( model_name=input_model, max_seq_length=8192, load_in_4bit=True, token=hf_token, attn_implementation="eager",)
# Apply LoRAmodel = FastLanguageModel.get_peft_model( model, r=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_alpha=32, lora_dropout=0, bias="none", use_gradient_checkpointing="unsloth", random_state=3407,)
# Load and format datasettokenizer = get_chat_template(tokenizer, chat_template=chat_template)raw_dataset = load_dataset(dataset_id, split="train")
def formatting_prompts_func(examples): convos = examples["messages"] texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos] return {"text": texts}
train_dataset = raw_dataset.map(formatting_prompts_func, batched=True)
# Trainif __name__ == "__main__": mp.freeze_support() trainer = SFTTrainer( model=model, processing_class=tokenizer, train_dataset=train_dataset, args=SFTConfig( dataset_text_field="text", max_length=8192, per_device_train_batch_size=1, gradient_accumulation_steps=4, warmup_ratio=0.05, max_steps=2000, learning_rate=2e-4, optim="adamw_8bit", output_dir="outputs", ), ) trainer.train()
# Upload to HuggingFace model.push_to_hub_merged( f"{hf_account}/{output_model_name}", tokenizer, save_method="merged_16bit", token=hf_token, )
# Create GGUF versions model.push_to_hub_gguf( f"{hf_account}/{output_model_name}-GGUF", tokenizer, quantization_method=["bf16", "f16", "q8_0"], token=hf_token, )Run Training
python train.pyUsing Your Distilled Model
With Ollama
# Download the GGUF filehuggingface-cli download your-username/Qwen3-4B-My-Distill-GGUF \ --include "*.gguf" --local-dir ./models
# Create an Ollama Modelfileecho 'FROM ./models/model-q8_0.gguf' > Modelfile
# Import to Ollamaollama create my-model -f Modelfile
# Run it!ollama run my-modelWith LM Studio
- Open LM Studio
- Go to Discover → My Models
- Click Import and select your GGUF file
- Start chatting!
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained( "your-username/Qwen3-4B-My-Distill", torch_dtype="auto", device_map="auto")tokenizer = AutoTokenizer.from_pretrained("your-username/Qwen3-4B-My-Distill")
messages = [{"role": "user", "content": "Explain quantum entanglement"}]inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")outputs = model.generate(inputs, max_new_tokens=512)print(tokenizer.decode(outputs[0]))Next Steps
- Training Parameters Guide - Optimize your training
- Creating Datasets - Make your own training data
- SFT Limitations - Understand the tradeoffs