Skip to content

Creating Datasets

Overview

The quality of your distillation directly depends on the quality of your training data. This guide covers how TeichAI creates reasoning datasets and how you can create your own.

Dataset Format

All our datasets use the standard ShareGPT/messages format:

{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "<think>\nThe user is asking about geography...\n</think>\n\nThe capital of France is Paris."}
]
}

Using datagen (Node/CLI)

We generate datasets with our team-maintained datagen npm package. It queries teacher models, formats outputs in the ShareGPT/messages schema (with optional <think>), and writes JSONL.

  • Install: npm i -g datagen (or your internal package name)
  • Run: datagen --config config.yml

Example config

model: openai/gpt-4o-mini
prompts: ./prompts.txt
out: ./dataset.jsonl
api: https://openrouter.ai/api/v1
system: |
You are a helpful assistant.
Answer concisely.
store-system: true
concurrent: 2
reasoningEffort: high
openrouter:
provider:
- openai
- anthropic
providerSort: throughput
no-progress: true

Dataset Generation Pipeline

  1. Prepare Prompts

    Create prompt files (e.g., prompts/general.txt) or reference existing sources. Datagen will read and shard prompts across providers.

  2. Run datagen

    Execute datagen --config config.yml. Datagen queries the teacher model(s), captures outputs, and writes JSONL in the messages format.

  3. Package

    Keep a README with model, prompt sources, generation settings, and any post-processing details.

Prompt Sources

  • Point datagen at prompt files (plain text, one per line).
  • Maintain separate files per domain (e.g., prompts/math.txt, prompts/code.txt) to mix proportions explicitly.

Provider Setup

  • Configure API key via the API_KEY environment variable
  • In your datagen config, set the teacher model identifier (e.g., Claude Opus, GPT‑5.1, Gemini 3 Pro) and reasoning depth

TeichAI Datasets

You can use our pre-built datasets instead of creating your own:

Best Practices

  1. Diversity matters - Cover many topics and difficulty levels
  2. Quality over quantity - 250 excellent samples beat 10,000 poor ones
  3. Include edge cases - Problems with no solution, multiple approaches
  4. Validate thoroughly - Bad data = bad model
  5. Document your process - Others should be able to replicate your work