Skip to content

Training Parameters

Overview

This guide explains the key training parameters used in TeichAI distillation scripts and how to optimize them for your use case.

LoRA Configuration

LoRA (Low-Rank Adaptation) enables efficient fine-tuning by training only a small set of adapter weights.

model = FastLanguageModel.get_peft_model(
model,
r=32, # LoRA rank
target_modules=[ # Layers to adapt
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=32, # Scaling factor
lora_dropout=0, # Dropout rate
bias="none", # Bias training
use_gradient_checkpointing="unsloth",
random_state=3407,
)

Key Parameters

r (LoRA Rank)

Controls the capacity of the adapter. Higher = more expressive but more VRAM.

RankVRAM ImpactCapacityUse Case
8MinimalLowSimple tasks, small datasets
16LowMediumGeneral distillation
32ModerateHighRecommended default
64HigherVery HighComplex tasks, large datasets
128SignificantMaximumFull capability transfer

Loss-based early stopping (Qwen3 distillation)

These practices reflect how we run Qwen3 distillation in production:

  • Set a large max_steps initially (e.g., 8k–12k) so you’re unlikely to hit the cap prematurely.
  • Monitor training loss live. When loss is consistently between 0.10 and 0.04 for ~100–200 steps, stop the run.
  • Note the global step you stopped at, update max_steps to that value, and resume from the nearest checkpoint to finalize and export.

Example resume flow:

# 1) Start with a large cap
args = SFTConfig(
max_steps=12000,
logging_steps=1,
save_strategy="steps",
save_steps=200,
output_dir="outputs",
)
# 2) Train, watch loss, stop early (Ctrl+C) once stable around 0.10–0.04
# 3) Resume at the nearest checkpoint and pin max_steps to your last good step
args = SFTConfig(
max_steps=5400, # replace with your observed stopping step
logging_steps=1,
save_strategy="steps",
save_steps=200,
output_dir="outputs",
)
trainer = SFTTrainer(model=model, processing_class=tokenizer, train_dataset=train_dataset, args=args)
trainer.train(resume_from_checkpoint=True)
# 4) Merge LoRA and export (HF and GGUF) after the resume completes

lora_alpha

Scaling factor for LoRA updates. Usually set equal to rank or 2× rank.

# Common configurations
lora_alpha = r # Conservative (our default)
lora_alpha = r * 2 # More aggressive updates

target_modules

Which layers to apply LoRA to. Our default covers all transformer layers:

target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj", # Attention
"gate_proj", "up_proj", "down_proj", # MLP
]

For memory savings, you can target fewer modules:

# Attention only (less VRAM)
target_modules = ["q_proj", "v_proj"]
# MLP only (different behavior)
target_modules = ["gate_proj", "up_proj", "down_proj"]

Training Configuration

args = SFTConfig(
# Data
dataset_text_field="text",
max_length=8192,
# Batch size
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
# Learning
warmup_ratio=0.05,
max_steps=2000,
learning_rate=2e-4,
lr_scheduler_type="linear",
# Optimization
optim="adamw_8bit",
weight_decay=0.01,
# Logging & Saving
logging_steps=1,
save_strategy="steps",
save_steps=200,
save_total_limit=20,
output_dir="outputs",
)

Batch Size & Accumulation

Effective batch size = per_device_train_batch_size × gradient_accumulation_steps

Batch SizeAccumulationEffective Batch
144
248
4416
8432

Sequence Length

Two places control sequence length:

  • SFTConfig max_length (trainer truncation)
  • Model load max_seq_length (tokenizer/model context window)
# Trainer truncation
args = SFTConfig(max_length=8192)
# Model context window
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=input_model,
max_seq_length=8192,
)

Trade-offs:

  • Longer = captures full reasoning traces but uses more VRAM
  • Shorter = faster training but may truncate important content

Learning Rate

We use 2e-4 as default, which works well for LoRA fine-tuning.

# Conservative (slower but stable)
learning_rate = 1e-4
# Default (good balance)
learning_rate = 2e-4
# Aggressive (faster but may overshoot)
learning_rate = 5e-4

Training Duration

max_steps controls how long to train. We scale based on dataset size:

# Our formula
dataset_rows = len(raw_dataset)
steps = max(1000, int(2000 * (dataset_rows / 1000)))
# Examples:
# 250 samples → 1000 steps (minimum)
# 1000 samples → 2000 steps
# 3000 samples → 6000 steps

Memory Optimization

Gradient Checkpointing

Trades compute for memory by recomputing activations during backward pass.

use_gradient_checkpointing="unsloth" # Recommended - 30% less VRAM
use_gradient_checkpointing=True # Standard checkpointing
use_gradient_checkpointing=False # Fastest but most memory

Quantization

4-bit quantization dramatically reduces memory:

load_in_4bit=True # ~75% memory reduction
load_in_8bit=True # ~50% memory reduction
load_in_4bit=False # Full precision (most memory)

Optimizer

8-bit Adam reduces optimizer state memory:

optim="adamw_8bit" # Recommended - less memory
optim="adamw_torch" # Standard Adam
optim="sgd" # Minimal memory but worse convergence

Configuration Templates

# For RTX 3060, T4, etc.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen3-4B-Thinking-2507",
max_seq_length=4096,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
use_gradient_checkpointing="unsloth",
)
args = SFTConfig(
max_length=4096,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
max_steps=1000,
learning_rate=2e-4,
optim="adamw_8bit",
)

Troubleshooting

Out of Memory (OOM)

  1. Reduce sequence length (max_length in SFTConfig and/or max_seq_length when loading the model)
  2. Reduce per_device_train_batch_size to 1
  3. Reduce LoRA r to 16 or 8
  4. Ensure use_gradient_checkpointing="unsloth"

Loss Not Decreasing

  1. Increase learning_rate slightly
  2. Check dataset quality - bad data = bad training
  3. Increase LoRA r for more capacity
  4. Ensure chat template matches model type

Training Too Slow

  1. Reduce sequence length (max_length in SFTConfig and/or max_seq_length when loading the model)
  2. Increase per_device_train_batch_size if memory allows
  3. Consider smaller base model
  4. Use load_in_4bit=True