Skip to content

Exporting & Quantization

Overview

After training, you’ll want to export your model in formats suitable for:

  1. HuggingFace Transformers - For Python inference and further training
  2. GGUF - For local deployment with llama.cpp, Ollama, LM Studio

Exporting to HuggingFace

Merged Model (16-bit)

Merges LoRA weights into the base model and saves full-precision weights:

model.push_to_hub_merged(
f"{hf_account}/{output_model_name}",
tokenizer,
save_method="merged_16bit",
token=hf_token,
private=False, # Set True for private models
)

This creates a standard HuggingFace model that works with:

  • transformers library
  • Text Generation Inference (TGI)
  • vLLM
  • Most inference frameworks

LoRA Adapters Only

If you want to keep adapters separate (smaller upload, flexible):

model.push_to_hub(
f"{hf_account}/{output_model_name}-LoRA",
tokenizer,
token=hf_token,
)

Creating GGUF Quantizations

GGUF is the standard format for llama.cpp and derived tools (Ollama, LM Studio).

Basic Export

model.push_to_hub_gguf(
f"{hf_account}/{output_model_name}-GGUF",
tokenizer,
quantization_method=["bf16", "f16", "q8_0"],
token=hf_token,
)

Quantization Methods

MethodSizeQualitySpeedUse Case
bf16100%PerfectBaselineA100/H100 GPUs
f16100%PerfectBaselineMost GPUs
q8_050%ExcellentFastRecommended default
q6_k37%Very GoodFasterGood balance
q5_k_m31%GoodFasterMemory-constrained
q4_k_m25%AcceptableFastestMinimum viable
q4_025%LowerFastestMaximum compression

Full Quality Ladder

For maximum compatibility, export multiple quantizations:

model.push_to_hub_gguf(
f"{hf_account}/{output_model_name}-GGUF",
tokenizer,
quantization_method=[
"bf16", # Full precision (bfloat16)
"f16", # Full precision (float16)
"q8_0", # 8-bit quantization
"q6_k", # 6-bit quantization
"q5_k_m", # 5-bit mixed quantization
"q4_k_m", # 4-bit mixed quantization
],
token=hf_token,
)

Local Export (Without Upload)

Save Merged Model Locally

model.save_pretrained_merged(
"./my-model",
tokenizer,
save_method="merged_16bit",
)

Create Local GGUF

model.save_pretrained_gguf(
"./my-model-gguf",
tokenizer,
quantization_method="q8_0",
)

Using Your GGUF Model

With Ollama

Terminal window
# Ollama can pull directly from HuggingFace
ollama run hf.co/your-username/Your-Model-GGUF:q8_0

With LM Studio

  1. Open LM Studio
  2. Go to My Models tab
  3. Click Import Model
  4. Select your .gguf file
  5. Start chatting!

With llama.cpp

Terminal window
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Run inference
./main -m /path/to/model.gguf -p "What is 2+2?"
# Or interactive mode
./main -m /path/to/model.gguf -i

With Python (llama-cpp-python)

from llama_cpp import Llama
llm = Llama(
model_path="./your-model-q8_0.gguf",
n_ctx=8192,
n_gpu_layers=-1, # Use all GPU layers
)
output = llm.create_chat_completion(
messages=[
{"role": "user", "content": "Explain quantum computing"}
]
)
print(output["choices"][0]["message"]["content"])

Choosing Quantization

Memory Requirements

For a Qwen3-8B model:

QuantizationModel SizeRAM/VRAM Needed
f16~16GB~18GB
q8_0~8GB~10GB
q6_k~6GB~8GB
q4_k_m~4GB~6GB

Quality vs Size Trade-off

Quality: bf16 ≈ f16 > q8_0 > q6_k > q5_k_m > q4_k_m > q4_0
For most users: q8_0 offers the best quality/size balance
For memory-constrained: q4_k_m is the minimum we recommend

Model Card

When uploading to HuggingFace, include a descriptive model card:

---
tags:
- qwen3
- unsloth
- distillation
base_model: unsloth/Qwen3-8B
datasets:
- TeichAI/claude-4.5-opus-high-reasoning-250x
license: apache-2.0
---
# Qwen3-8B-Claude-4.5-Opus-Distill
This model was created by distilling Claude 4.5 Opus reasoning
into Qwen3-8B using TeichAI's distillation methodology.
## Training
- Base Model: Qwen3-8B
- Dataset: claude-4.5-opus-high-reasoning-250x
- Method: SFT with LoRA (r=32)
- Training: 2000 steps
## Usage
Works with any Qwen3-compatible inference setup.
## Limitations
This is an SFT-distilled model. See TeichAI docs for
limitations of SFT-only distillation.