Exporting & Quantization
Overview
After training, you’ll want to export your model in formats suitable for:
- HuggingFace Transformers - For Python inference and further training
- GGUF - For local deployment with llama.cpp, Ollama, LM Studio
Exporting to HuggingFace
Merged Model (16-bit)
Merges LoRA weights into the base model and saves full-precision weights:
model.push_to_hub_merged( f"{hf_account}/{output_model_name}", tokenizer, save_method="merged_16bit", token=hf_token, private=False, # Set True for private models)This creates a standard HuggingFace model that works with:
transformerslibrary- Text Generation Inference (TGI)
- vLLM
- Most inference frameworks
LoRA Adapters Only
If you want to keep adapters separate (smaller upload, flexible):
model.push_to_hub( f"{hf_account}/{output_model_name}-LoRA", tokenizer, token=hf_token,)Creating GGUF Quantizations
GGUF is the standard format for llama.cpp and derived tools (Ollama, LM Studio).
Basic Export
model.push_to_hub_gguf( f"{hf_account}/{output_model_name}-GGUF", tokenizer, quantization_method=["bf16", "f16", "q8_0"], token=hf_token,)Quantization Methods
| Method | Size | Quality | Speed | Use Case |
|---|---|---|---|---|
bf16 | 100% | Perfect | Baseline | A100/H100 GPUs |
f16 | 100% | Perfect | Baseline | Most GPUs |
q8_0 | 50% | Excellent | Fast | Recommended default |
q6_k | 37% | Very Good | Faster | Good balance |
q5_k_m | 31% | Good | Faster | Memory-constrained |
q4_k_m | 25% | Acceptable | Fastest | Minimum viable |
q4_0 | 25% | Lower | Fastest | Maximum compression |
Full Quality Ladder
For maximum compatibility, export multiple quantizations:
model.push_to_hub_gguf( f"{hf_account}/{output_model_name}-GGUF", tokenizer, quantization_method=[ "bf16", # Full precision (bfloat16) "f16", # Full precision (float16) "q8_0", # 8-bit quantization "q6_k", # 6-bit quantization "q5_k_m", # 5-bit mixed quantization "q4_k_m", # 4-bit mixed quantization ], token=hf_token,)Local Export (Without Upload)
Save Merged Model Locally
model.save_pretrained_merged( "./my-model", tokenizer, save_method="merged_16bit",)Create Local GGUF
model.save_pretrained_gguf( "./my-model-gguf", tokenizer, quantization_method="q8_0",)Using Your GGUF Model
With Ollama
# Ollama can pull directly from HuggingFaceollama run hf.co/your-username/Your-Model-GGUF:q8_0# Create a Modelfileecho 'FROM ./your-model-q8_0.gguf' > Modelfile
# Import to Ollamaollama create my-model -f Modelfile
# Runollama run my-modelWith LM Studio
- Open LM Studio
- Go to My Models tab
- Click Import Model
- Select your
.gguffile - Start chatting!
With llama.cpp
# Clone llama.cppgit clone https://github.com/ggerganov/llama.cppcd llama.cppmake
# Run inference./main -m /path/to/model.gguf -p "What is 2+2?"
# Or interactive mode./main -m /path/to/model.gguf -iWith Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama( model_path="./your-model-q8_0.gguf", n_ctx=8192, n_gpu_layers=-1, # Use all GPU layers)
output = llm.create_chat_completion( messages=[ {"role": "user", "content": "Explain quantum computing"} ])print(output["choices"][0]["message"]["content"])Choosing Quantization
Memory Requirements
For a Qwen3-8B model:
| Quantization | Model Size | RAM/VRAM Needed |
|---|---|---|
| f16 | ~16GB | ~18GB |
| q8_0 | ~8GB | ~10GB |
| q6_k | ~6GB | ~8GB |
| q4_k_m | ~4GB | ~6GB |
Quality vs Size Trade-off
Quality: bf16 ≈ f16 > q8_0 > q6_k > q5_k_m > q4_k_m > q4_0
For most users: q8_0 offers the best quality/size balanceFor memory-constrained: q4_k_m is the minimum we recommendModel Card
When uploading to HuggingFace, include a descriptive model card:
---tags:- qwen3- unsloth- distillationbase_model: unsloth/Qwen3-8Bdatasets:- TeichAI/claude-4.5-opus-high-reasoning-250xlicense: apache-2.0---
# Qwen3-8B-Claude-4.5-Opus-Distill
This model was created by distilling Claude 4.5 Opus reasoninginto Qwen3-8B using TeichAI's distillation methodology.
## Training
- Base Model: Qwen3-8B- Dataset: claude-4.5-opus-high-reasoning-250x- Method: SFT with LoRA (r=32)- Training: 2000 steps
## Usage
Works with any Qwen3-compatible inference setup.
## Limitations
This is an SFT-distilled model. See TeichAI docs forlimitations of SFT-only distillation.