Training Parameters

Overview

This guide explains the key training parameters used in TeichAI distillation scripts and how to optimize them for your use case.

LoRA Configuration

LoRA (Low-Rank Adaptation) enables efficient fine-tuning by training only a small set of adapter weights.

model = FastLanguageModel.get_peft_model(
    model,
    r=32,                          # LoRA rank
    target_modules=[               # Layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=32,                 # Scaling factor
    lora_dropout=0,                # Dropout rate
    bias="none",                   # Bias training
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

Key Parameters

`r` (LoRA Rank)

Controls the capacity of the adapter. Higher = more expressive but more VRAM.

Rank	VRAM Impact	Capacity	Use Case
8	Minimal	Low	Simple tasks, small datasets
16	Low	Medium	General distillation
32	Moderate	High	Recommended default
64	Higher	Very High	Complex tasks, large datasets
128	Significant	Maximum	Full capability transfer

Loss-based early stopping (Qwen3 distillation)

These practices reflect how we run Qwen3 distillation in production:

Set a large max_steps initially (e.g., 8k–12k) so you’re unlikely to hit the cap prematurely.
Monitor training loss live. When loss is consistently between 0.10 and 0.04 for ~100–200 steps, stop the run.
Note the global step you stopped at, update max_steps to that value, and resume from the nearest checkpoint to finalize and export.

Example resume flow:

# 1) Start with a large cap
args = SFTConfig(
    max_steps=12000,
    logging_steps=1,
    save_strategy="steps",
    save_steps=200,
    output_dir="outputs",
)

# 2) Train, watch loss, stop early (Ctrl+C) once stable around 0.10–0.04

# 3) Resume at the nearest checkpoint and pin max_steps to your last good step
args = SFTConfig(
    max_steps=5400,  # replace with your observed stopping step
    logging_steps=1,
    save_strategy="steps",
    save_steps=200,
    output_dir="outputs",
)

trainer = SFTTrainer(model=model, processing_class=tokenizer, train_dataset=train_dataset, args=args)
trainer.train(resume_from_checkpoint=True)

# 4) Merge LoRA and export (HF and GGUF) after the resume completes

`lora_alpha`

Scaling factor for LoRA updates. Usually set equal to rank or 2× rank.

# Common configurations
lora_alpha = r        # Conservative (our default)
lora_alpha = r * 2    # More aggressive updates

`target_modules`

Which layers to apply LoRA to. Our default covers all transformer layers:

target_modules = [
    "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
    "gate_proj", "up_proj", "down_proj",      # MLP
]

For memory savings, you can target fewer modules:

# Attention only (less VRAM)
target_modules = ["q_proj", "v_proj"]

# MLP only (different behavior)
target_modules = ["gate_proj", "up_proj", "down_proj"]

Training Configuration

args = SFTConfig(
    # Data
    dataset_text_field="text",
    max_length=8192,

    # Batch size
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,

    # Learning
    warmup_ratio=0.05,
    max_steps=2000,
    learning_rate=2e-4,
    lr_scheduler_type="linear",

    # Optimization
    optim="adamw_8bit",
    weight_decay=0.01,

    # Logging & Saving
    logging_steps=1,
    save_strategy="steps",
    save_steps=200,
    save_total_limit=20,
    output_dir="outputs",
)

Batch Size & Accumulation

Effective batch size = per_device_train_batch_size × gradient_accumulation_steps

Batch Size	Accumulation	Effective Batch
1	4	4
2	4	8
4	4	16
8	4	32

Sequence Length

Two places control sequence length:

SFTConfig max_length (trainer truncation)
Model load max_seq_length (tokenizer/model context window)

# Trainer truncation
args = SFTConfig(max_length=8192)

# Model context window
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=input_model,
    max_seq_length=8192,
)

Trade-offs:

Longer = captures full reasoning traces but uses more VRAM
Shorter = faster training but may truncate important content

Learning Rate

We use 2e-4 as default, which works well for LoRA fine-tuning.

# Conservative (slower but stable)
learning_rate = 1e-4

# Default (good balance)
learning_rate = 2e-4

# Aggressive (faster but may overshoot)
learning_rate = 5e-4

Training Duration

max_steps controls how long to train. We scale based on dataset size:

# Our formula
dataset_rows = len(raw_dataset)
steps = max(1000, int(2000 * (dataset_rows / 1000)))

# Examples:
# 250 samples → 1000 steps (minimum)
# 1000 samples → 2000 steps
# 3000 samples → 6000 steps

Memory Optimization

Gradient Checkpointing

Trades compute for memory by recomputing activations during backward pass.

use_gradient_checkpointing="unsloth"  # Recommended - 30% less VRAM
use_gradient_checkpointing=True       # Standard checkpointing
use_gradient_checkpointing=False      # Fastest but most memory

Quantization

4-bit quantization dramatically reduces memory:

load_in_4bit=True   # ~75% memory reduction
load_in_8bit=True   # ~50% memory reduction
load_in_4bit=False  # Full precision (most memory)

Optimizer

8-bit Adam reduces optimizer state memory:

optim="adamw_8bit"      # Recommended - less memory
optim="adamw_torch"     # Standard Adam
optim="sgd"             # Minimal memory but worse convergence

Configuration Templates

# For RTX 3060, T4, etc.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-4B-Thinking-2507",
    max_seq_length=4096,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    use_gradient_checkpointing="unsloth",
)

args = SFTConfig(
    max_length=4096,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    max_steps=1000,
    learning_rate=2e-4,
    optim="adamw_8bit",
)

# For RTX 4090, A10, etc.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B",
    max_seq_length=8192,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    lora_alpha=32,
    use_gradient_checkpointing="unsloth",
)

args = SFTConfig(
    max_length=8192,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    max_steps=2000,
    learning_rate=2e-4,
    optim="adamw_8bit",
)

# For A100, H100, etc.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-14B",
    max_seq_length=8192,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    lora_alpha=64,
    use_gradient_checkpointing="unsloth",
)

args = SFTConfig(
    max_length=8192,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    max_steps=2000,
    learning_rate=2e-4,
    optim="adamw_8bit",
)

Troubleshooting

Out of Memory (OOM)

Reduce sequence length (max_length in SFTConfig and/or max_seq_length when loading the model)
Reduce per_device_train_batch_size to 1
Reduce LoRA r to 16 or 8
Ensure use_gradient_checkpointing="unsloth"

Loss Not Decreasing

Increase learning_rate slightly
Check dataset quality - bad data = bad training
Increase LoRA r for more capacity
Ensure chat template matches model type

Training Too Slow

Reduce sequence length (max_length in SFTConfig and/or max_seq_length when loading the model)
Increase per_device_train_batch_size if memory allows
Consider smaller base model
Use load_in_4bit=True