Creating Datasets

Overview

The quality of your distillation directly depends on the quality of your training data. This guide covers how TeichAI creates reasoning datasets and how you can create your own.

Dataset Format

All our datasets use the standard ShareGPT/messages format:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "<think>\nThe user is asking about geography...\n</think>\n\nThe capital of France is Paris."}
  ]
}

Using datagen (Node/CLI)

We generate datasets with our team-maintained datagen npm package. It queries teacher models, formats outputs in the ShareGPT/messages schema (with optional <think>), and writes JSONL.

Install: npm i -g datagen (or your internal package name)
Run: datagen --config config.yml

Example config

model: openai/gpt-4o-mini
prompts: ./prompts.txt
out: ./dataset.jsonl
api: https://openrouter.ai/api/v1
system: |
  You are a helpful assistant.
  Answer concisely.
store-system: true
concurrent: 2
reasoningEffort: high
openrouter:
  provider:
    - openai
    - anthropic
  providerSort: throughput
no-progress: true

Dataset Generation Pipeline

Prepare Prompts

Create prompt files (e.g., prompts/general.txt) or reference existing sources. Datagen will read and shard prompts across providers.
Run datagen

Execute datagen --config config.yml. Datagen queries the teacher model(s), captures outputs, and writes JSONL in the messages format.
Package

Keep a README with model, prompt sources, generation settings, and any post-processing details.

Prompt Sources

Point datagen at prompt files (plain text, one per line).
Maintain separate files per domain (e.g., prompts/math.txt, prompts/code.txt) to mix proportions explicitly.

Provider Setup

Configure API key via the API_KEY environment variable
In your datagen config, set the teacher model identifier (e.g., Claude Opus, GPT‑5.1, Gemini 3 Pro) and reasoning depth