Training Parameters
Overview
This guide explains the key training parameters used in TeichAI distillation scripts and how to optimize them for your use case.
LoRA Configuration
LoRA (Low-Rank Adaptation) enables efficient fine-tuning by training only a small set of adapter weights.
model = FastLanguageModel.get_peft_model( model, r=32, # LoRA rank target_modules=[ # Layers to adapt "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ], lora_alpha=32, # Scaling factor lora_dropout=0, # Dropout rate bias="none", # Bias training use_gradient_checkpointing="unsloth", random_state=3407,)Key Parameters
r (LoRA Rank)
Controls the capacity of the adapter. Higher = more expressive but more VRAM.
| Rank | VRAM Impact | Capacity | Use Case |
|---|---|---|---|
| 8 | Minimal | Low | Simple tasks, small datasets |
| 16 | Low | Medium | General distillation |
| 32 | Moderate | High | Recommended default |
| 64 | Higher | Very High | Complex tasks, large datasets |
| 128 | Significant | Maximum | Full capability transfer |
Loss-based early stopping (Qwen3 distillation)
These practices reflect how we run Qwen3 distillation in production:
- Set a large
max_stepsinitially (e.g., 8k–12k) so you’re unlikely to hit the cap prematurely. - Monitor training loss live. When loss is consistently between 0.10 and 0.04 for ~100–200 steps, stop the run.
- Note the global step you stopped at, update
max_stepsto that value, and resume from the nearest checkpoint to finalize and export.
Example resume flow:
# 1) Start with a large capargs = SFTConfig( max_steps=12000, logging_steps=1, save_strategy="steps", save_steps=200, output_dir="outputs",)
# 2) Train, watch loss, stop early (Ctrl+C) once stable around 0.10–0.04
# 3) Resume at the nearest checkpoint and pin max_steps to your last good stepargs = SFTConfig( max_steps=5400, # replace with your observed stopping step logging_steps=1, save_strategy="steps", save_steps=200, output_dir="outputs",)
trainer = SFTTrainer(model=model, processing_class=tokenizer, train_dataset=train_dataset, args=args)trainer.train(resume_from_checkpoint=True)
# 4) Merge LoRA and export (HF and GGUF) after the resume completeslora_alpha
Scaling factor for LoRA updates. Usually set equal to rank or 2× rank.
# Common configurationslora_alpha = r # Conservative (our default)lora_alpha = r * 2 # More aggressive updatestarget_modules
Which layers to apply LoRA to. Our default covers all transformer layers:
target_modules = [ "q_proj", "k_proj", "v_proj", "o_proj", # Attention "gate_proj", "up_proj", "down_proj", # MLP]For memory savings, you can target fewer modules:
# Attention only (less VRAM)target_modules = ["q_proj", "v_proj"]
# MLP only (different behavior)target_modules = ["gate_proj", "up_proj", "down_proj"]Training Configuration
args = SFTConfig( # Data dataset_text_field="text", max_length=8192,
# Batch size per_device_train_batch_size=1, gradient_accumulation_steps=4,
# Learning warmup_ratio=0.05, max_steps=2000, learning_rate=2e-4, lr_scheduler_type="linear",
# Optimization optim="adamw_8bit", weight_decay=0.01,
# Logging & Saving logging_steps=1, save_strategy="steps", save_steps=200, save_total_limit=20, output_dir="outputs",)Batch Size & Accumulation
Effective batch size = per_device_train_batch_size × gradient_accumulation_steps
| Batch Size | Accumulation | Effective Batch |
|---|---|---|
| 1 | 4 | 4 |
| 2 | 4 | 8 |
| 4 | 4 | 16 |
| 8 | 4 | 32 |
Sequence Length
Two places control sequence length:
- SFTConfig
max_length(trainer truncation) - Model load
max_seq_length(tokenizer/model context window)
# Trainer truncationargs = SFTConfig(max_length=8192)
# Model context windowmodel, tokenizer = FastLanguageModel.from_pretrained( model_name=input_model, max_seq_length=8192,)Trade-offs:
- Longer = captures full reasoning traces but uses more VRAM
- Shorter = faster training but may truncate important content
Learning Rate
We use 2e-4 as default, which works well for LoRA fine-tuning.
# Conservative (slower but stable)learning_rate = 1e-4
# Default (good balance)learning_rate = 2e-4
# Aggressive (faster but may overshoot)learning_rate = 5e-4Training Duration
max_steps controls how long to train. We scale based on dataset size:
# Our formuladataset_rows = len(raw_dataset)steps = max(1000, int(2000 * (dataset_rows / 1000)))
# Examples:# 250 samples → 1000 steps (minimum)# 1000 samples → 2000 steps# 3000 samples → 6000 stepsMemory Optimization
Gradient Checkpointing
Trades compute for memory by recomputing activations during backward pass.
use_gradient_checkpointing="unsloth" # Recommended - 30% less VRAMuse_gradient_checkpointing=True # Standard checkpointinguse_gradient_checkpointing=False # Fastest but most memoryQuantization
4-bit quantization dramatically reduces memory:
load_in_4bit=True # ~75% memory reductionload_in_8bit=True # ~50% memory reductionload_in_4bit=False # Full precision (most memory)Optimizer
8-bit Adam reduces optimizer state memory:
optim="adamw_8bit" # Recommended - less memoryoptim="adamw_torch" # Standard Adamoptim="sgd" # Minimal memory but worse convergenceConfiguration Templates
# For RTX 3060, T4, etc.model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Qwen3-4B-Thinking-2507", max_seq_length=4096, load_in_4bit=True,)
model = FastLanguageModel.get_peft_model( model, r=16, lora_alpha=16, use_gradient_checkpointing="unsloth",)
args = SFTConfig( max_length=4096, per_device_train_batch_size=1, gradient_accumulation_steps=4, max_steps=1000, learning_rate=2e-4, optim="adamw_8bit",)# For RTX 4090, A10, etc.model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Qwen3-8B", max_seq_length=8192, load_in_4bit=True,)
model = FastLanguageModel.get_peft_model( model, r=32, lora_alpha=32, use_gradient_checkpointing="unsloth",)
args = SFTConfig( max_length=8192, per_device_train_batch_size=2, gradient_accumulation_steps=4, max_steps=2000, learning_rate=2e-4, optim="adamw_8bit",)# For A100, H100, etc.model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Qwen3-14B", max_seq_length=8192, load_in_4bit=True,)
model = FastLanguageModel.get_peft_model( model, r=64, lora_alpha=64, use_gradient_checkpointing="unsloth",)
args = SFTConfig( max_length=8192, per_device_train_batch_size=4, gradient_accumulation_steps=4, max_steps=2000, learning_rate=2e-4, optim="adamw_8bit",)Troubleshooting
Out of Memory (OOM)
- Reduce sequence length (
max_lengthin SFTConfig and/ormax_seq_lengthwhen loading the model) - Reduce
per_device_train_batch_sizeto 1 - Reduce LoRA
rto 16 or 8 - Ensure
use_gradient_checkpointing="unsloth"
Loss Not Decreasing
- Increase
learning_rateslightly - Check dataset quality - bad data = bad training
- Increase LoRA
rfor more capacity - Ensure chat template matches model type
Training Too Slow
- Reduce sequence length (
max_lengthin SFTConfig and/ormax_seq_lengthwhen loading the model) - Increase
per_device_train_batch_sizeif memory allows - Consider smaller base model
- Use
load_in_4bit=True