Claude 4.5 Opus
claude-4.5-opus-high-reasoning-250x
250 samples, premium quality, ~100 likes
The quality of your distillation directly depends on the quality of your training data. This guide covers how TeichAI creates reasoning datasets and how you can create your own.
All our datasets use the standard ShareGPT/messages format:
{ "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "<think>\nThe user is asking about geography...\n</think>\n\nThe capital of France is Paris."} ]}We generate datasets with our team-maintained datagen npm package. It queries teacher models, formats outputs in the ShareGPT/messages schema (with optional <think>), and writes JSONL.
npm i -g datagen (or your internal package name)datagen --config config.ymlmodel: openai/gpt-4o-miniprompts: ./prompts.txtout: ./dataset.jsonlapi: https://openrouter.ai/api/v1system: | You are a helpful assistant. Answer concisely.store-system: trueconcurrent: 2reasoningEffort: highopenrouter: provider: - openai - anthropic providerSort: throughputno-progress: truePrepare Prompts
Create prompt files (e.g., prompts/general.txt) or reference existing sources. Datagen will read and shard prompts across providers.
Run datagen
Execute datagen --config config.yml. Datagen queries the teacher model(s), captures outputs, and writes JSONL in the messages format.
Package
Keep a README with model, prompt sources, generation settings, and any post-processing details.
prompts/math.txt, prompts/code.txt) to mix proportions explicitly.API_KEY environment variableYou can use our pre-built datasets instead of creating your own:
Claude 4.5 Opus
claude-4.5-opus-high-reasoning-250x
250 samples, premium quality, ~100 likes
Gemini 3 Pro
gemini-3-pro-preview-high-reasoning-1000x
1000 samples, high reasoning, ~50 likes
DeepSeek Code
deepseek-v3.2-speciale-OpenCodeReasoning-3k
3000 code samples, great for coding
GPT-5.1
1000 samples, diverse reasoning