Exporting & Quantization

Overview

After training, you’ll want to export your model in formats suitable for:

HuggingFace Transformers - For Python inference and further training
GGUF - For local deployment with llama.cpp, Ollama, LM Studio

Exporting to HuggingFace

Merged Model (16-bit)

Merges LoRA weights into the base model and saves full-precision weights:

model.push_to_hub_merged(
    f"{hf_account}/{output_model_name}",
    tokenizer,
    save_method="merged_16bit",
    token=hf_token,
    private=False,  # Set True for private models
)

This creates a standard HuggingFace model that works with:

transformers library
Text Generation Inference (TGI)
vLLM
Most inference frameworks

LoRA Adapters Only

If you want to keep adapters separate (smaller upload, flexible):

model.push_to_hub(
    f"{hf_account}/{output_model_name}-LoRA",
    tokenizer,
    token=hf_token,
)

Creating GGUF Quantizations

GGUF is the standard format for llama.cpp and derived tools (Ollama, LM Studio).

Basic Export

model.push_to_hub_gguf(
    f"{hf_account}/{output_model_name}-GGUF",
    tokenizer,
    quantization_method=["bf16", "f16", "q8_0"],
    token=hf_token,
)

Quantization Methods

Method	Size	Quality	Speed	Use Case
`bf16`	100%	Perfect	Baseline	A100/H100 GPUs
`f16`	100%	Perfect	Baseline	Most GPUs
`q8_0`	50%	Excellent	Fast	Recommended default
`q6_k`	37%	Very Good	Faster	Good balance
`q5_k_m`	31%	Good	Faster	Memory-constrained
`q4_k_m`	25%	Acceptable	Fastest	Minimum viable
`q4_0`	25%	Lower	Fastest	Maximum compression

Full Quality Ladder

For maximum compatibility, export multiple quantizations:

model.push_to_hub_gguf(
    f"{hf_account}/{output_model_name}-GGUF",
    tokenizer,
    quantization_method=[
        "bf16",    # Full precision (bfloat16)
        "f16",     # Full precision (float16)
        "q8_0",    # 8-bit quantization
        "q6_k",    # 6-bit quantization
        "q5_k_m",  # 5-bit mixed quantization
        "q4_k_m",  # 4-bit mixed quantization
    ],
    token=hf_token,
)

Local Export (Without Upload)

Save Merged Model Locally

model.save_pretrained_merged(
    "./my-model",
    tokenizer,
    save_method="merged_16bit",
)

Create Local GGUF

model.save_pretrained_gguf(
    "./my-model-gguf",
    tokenizer,
    quantization_method="q8_0",
)

Using Your GGUF Model

# Ollama can pull directly from HuggingFace
ollama run hf.co/your-username/Your-Model-GGUF:q8_0

# Create a Modelfile
echo 'FROM ./your-model-q8_0.gguf' > Modelfile

# Import to Ollama
ollama create my-model -f Modelfile

# Run
ollama run my-model

With LM Studio

Open LM Studio
Go to My Models tab
Click Import Model
Select your .gguf file
Start chatting!

With llama.cpp

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Run inference
./main -m /path/to/model.gguf -p "What is 2+2?"

# Or interactive mode
./main -m /path/to/model.gguf -i

With Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="./your-model-q8_0.gguf",
    n_ctx=8192,
    n_gpu_layers=-1,  # Use all GPU layers
)

output = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ]
)
print(output["choices"][0]["message"]["content"])

Choosing Quantization

Memory Requirements

For a Qwen3-8B model:

Quantization	Model Size	RAM/VRAM Needed
f16	~16GB	~18GB
q8_0	~8GB	~10GB
q6_k	~6GB	~8GB
q4_k_m	~4GB	~6GB

Quality vs Size Trade-off

Quality: bf16 ≈ f16 > q8_0 > q6_k > q5_k_m > q4_k_m > q4_0

For most users: q8_0 offers the best quality/size balance
For memory-constrained: q4_k_m is the minimum we recommend

Model Card

When uploading to HuggingFace, include a descriptive model card:

---
tags:
- qwen3
- unsloth
- distillation
base_model: unsloth/Qwen3-8B
datasets:
- TeichAI/claude-4.5-opus-high-reasoning-250x
license: apache-2.0
---

# Qwen3-8B-Claude-4.5-Opus-Distill

This model was created by distilling Claude 4.5 Opus reasoning
into Qwen3-8B using TeichAI's distillation methodology.

## Training

- Base Model: Qwen3-8B
- Dataset: claude-4.5-opus-high-reasoning-250x
- Method: SFT with LoRA (r=32)
- Training: 2000 steps

## Usage

Works with any Qwen3-compatible inference setup.

## Limitations

This is an SFT-distilled model. See TeichAI docs for
limitations of SFT-only distillation.