Configuration Options
Overview
This reference documents all configuration options used in TeichAI training scripts.
Global Configuration
# Identityhf_account = "your-username" # HuggingFace username or organizationhf_token = "hf_..." # HuggingFace write tokenoutput_model_name = "My-Distill" # Name for output model
# Modelinput_model = "unsloth/Qwen3-4B" # Base model to fine-tunechat_template = "qwen3" # Chat template type
# Datasetdataset_id = "TeichAI/dataset-name" # HuggingFace dataset IDdataset_file = "" # Local JSONL file path (alternative)
# Trainingmax_len = 8192 # Maximum sequence lengthsteps = 2000 # Training stepsresume = False # Resume from checkpoint
# Uploadprivate_upload = False # Upload as private modelModel Configuration
input_model
The base model to fine-tune. Use Unsloth model IDs for optimized training.
| Model Type | Example ID |
|---|---|
| Qwen3 Dense | unsloth/Qwen3-4B, unsloth/Qwen3-8B |
| Qwen3 Thinking | unsloth/Qwen3-4B-Thinking-2507 |
| Qwen3 Instruct | unsloth/Qwen3-4B-Instruct-2507 |
| Qwen3 MoE | unsloth/Qwen3-30B-A3B-Thinking-2507 |
| Nemotron | nvidia/Nemotron-Cascade-8B-Thinking |
chat_template
Chat template for formatting conversations. Must match model type.
| Template | Use Case |
|---|---|
qwen3 | Base Qwen3 models |
qwen3-thinking | Thinking variants with <think> tags |
qwen3-instruct | Instruct variants |
LoRA Configuration
model = FastLanguageModel.get_peft_model( model, r=32, # LoRA rank target_modules=[...], # Layers to adapt lora_alpha=32, # Scaling factor lora_dropout=0, # Dropout rate bias="none", # Bias training use_gradient_checkpointing="unsloth", # Memory optimization random_state=3407, # Reproducibility seed use_rslora=False, # Rank-stabilized LoRA loftq_config=None, # LoftQ configuration)r (LoRA Rank)
| Value | Memory | Capacity | Recommendation |
|---|---|---|---|
| 8 | Low | Basic | Simple tasks |
| 16 | Low | Medium | Resource-constrained |
| 32 | Medium | High | Default |
| 64 | Higher | Very High | Complex tasks |
| 128 | High | Maximum | Full capability |
target_modules
Layers to apply LoRA adapters to:
target_modules = [ "q_proj", # Query projection "k_proj", # Key projection "v_proj", # Value projection "o_proj", # Output projection "gate_proj", # Gate projection (MLP) "up_proj", # Up projection (MLP) "down_proj", # Down projection (MLP)]use_gradient_checkpointing
| Value | Effect |
|---|---|
False | Fastest, most memory |
True | Standard checkpointing |
"unsloth" | Recommended - 30% less VRAM |
SFT Configuration
args = SFTConfig( # Data dataset_text_field="text", max_length=8192,
# Batch per_device_train_batch_size=1, gradient_accumulation_steps=4,
# Learning warmup_ratio=0.05, max_steps=2000, learning_rate=2e-4, lr_scheduler_type="linear",
# Optimization optim="adamw_8bit", weight_decay=0.01,
# Logging logging_steps=1, report_to="none",
# Checkpoints output_dir="outputs", save_strategy="steps", save_steps=200, save_total_limit=20,
# System seed=3447, dataloader_num_workers=0,)Key Training Parameters
| Parameter | Default | Description |
|---|---|---|
max_length | 8192 | Maximum tokens per example |
per_device_train_batch_size | 1 | Batch size per GPU |
gradient_accumulation_steps | 4 | Steps before optimizer update |
learning_rate | 2e-4 | Learning rate |
max_steps | 2000 | Total training steps |
warmup_ratio | 0.05 | Fraction of total steps for LR warmup |
optim
| Value | Memory | Accuracy |
|---|---|---|
adamw_8bit | Low | Good |
adamw_torch | High | Best |
sgd | Lowest | Worse |
lr_scheduler_type
| Value | Behavior |
|---|---|
linear | Linear decay to 0 |
cosine | Cosine decay |
constant | No decay |
Export Configuration
Merged Model Upload
model.push_to_hub_merged( f"{hf_account}/{output_model_name}", tokenizer, save_method="merged_16bit", # "merged_16bit" or "merged_4bit" token=hf_token, private=False, # Private repository)GGUF Export
model.push_to_hub_gguf( f"{hf_account}/{output_model_name}-GGUF", tokenizer, quantization_method=[ "bf16", # BFloat16 (full precision) "f16", # Float16 (full precision) "q8_0", # 8-bit quantization "q6_k", # 6-bit k-quant "q5_k_m", # 5-bit k-quant mixed "q4_k_m", # 4-bit k-quant mixed ], token=hf_token, private=False,)Environment Variables
os.environ["TOKENIZERS_PARALLELISM"] = "false"os.environ["HF_DATASETS_DISABLE_MULTIPROCESSING"] = "1"
# Debug modesos.environ["CHECK_DATASET_ONLY"] = "1" # Validate dataset and exitos.environ["CHECK_LENGTHS_ONLY"] = "1" # Check token lengths and exitos.environ["SANITY_MAXLEN"] = "1" # Debug max length settings