AfterImage is now open source for infrastructure-level dataset generation

April 13, 2026 · Yusuf Sarıgöz

Originally published on Medium

AfterImage open source

A Python library and CLI for synthetic conversational datasets — grounded, diverse, and observable.

Today we are releasing AfterImage as open source software under the Apache 2.0 license, with packages on PyPI (pip install afterimage / uv add afterimage) and documentation at afterimage.altai.dev.

If you have ever tried to scale instruction tuning or evaluation data with an LLM, you have probably felt the same tension: raw volume is easy; useful volume is not. AfterImage exists to treat synthetic dataset generation as a systems problem — one where you can steer grounding, diversity, structure, and quality instead of hoping a clever prompt loop will "just work."

Why we built it

Modern LLM workflows lean heavily on synthetic data: bootstrapped instructions, evolved difficulty, self-curation, RAG-style corpora, and more. The research and practice around these methods make a consistent point easy to miss in day-to-day engineering: the value of synthetic data is not automatic. Without controls, you get shallow diversity, weak grounding to sources, evaluator leakage, and answers that sound right but are not anchored in anything you trust.

When we first designed the architecture of AfterImage internally, the core idea was simple to state and hard to operationalize: synthetic data quality has to be measured and steered, not assumed. The system was designed around explicit roles, composable stages, schema constraints where you need them, and monitoring that does not get in the way of throughput.

That design is still the spine of the project. What changed since that early write-up is everything around it: a YAML-driven CLI for getting JSONL in one command, export to common fine-tuning shapes, preference / DPO-style pair generation, broader provider support (including local OpenAI-compatible servers), a public docs site, and a codebase that has been exercised on real pipelines rather than described only on paper.

What AfterImage actually does

AfterImage simulates conversations between two modeled roles:

Correspondent — the side that initiates: questions, tasks, follow-ups. Behavior is driven by instruction generators and can be shaped by personas (tone, expertise, intent) so your dataset is not "one generic user, forever."
Respondent — the assistant side, defined by your system prompt and runtime prompt modifiers (for example, injecting retrieved chunks for RAG-style grounding).

Generation is async-first, so you can run concurrent workers and make better use of API throughput. Persistence (JSONL or SQL) is decoupled from the core loop, so the same generation logic can land in different environments without rewrites.

There are two "front doors," intentionally:

CLI + config — describe a run in YAML, set keys in the environment, run afterimage generate. Dry-run with --dry-run when you want to validate the plan without spending tokens. When you are ready to train, afterimage export converts datasets into formats like ShareGPT-style or Hugging Face messages layouts; afterimage preference helps produce preference pairs for alignment workflows.
Python API — compose instruction generators, respondent prompt modifiers, stopping criteria, storage, judges, and monitoring the same way the CLI does under the hood. The API is where specialized flows live: document-grounded instruction, structured extraction, tool-calling-oriented setups, custom storage, and tighter integration with your stack.

If you want a quick way that your coding agents can get up and running with AfterImage, the project publishes llms.txt with install steps, CLI entry points, and links to the Markdown guides and examples.

Grounding, structure, and quality loops

Document providers can feed local files, JSONL, in-memory lists, or Qdrant — so "what the model is allowed to know" can be a first-class input to generation, not an afterthought pasted into a prompt.

Structured generation uses Pydantic schemas so single-turn outputs are valid JSON-shaped objects: extraction from unstructured text, synthetic rows with typed fields, or evaluation artifacts you can consume downstream without brittle parsing.

Evaluation is meant to be practical at scale: embedding-based signals for fast filtering (coherence, relevance, grounding-style checks where configured) and LLM-as-judge style rubrics when you need a second pass. The framework also supports quality gates and regeneration paths (for example, auto_improve workflows) when you want the generator to retry or revise under explicit criteria rather than silently accepting bad rows.

Monitoring tracks latency, token usage, errors, and evaluation signals over time so long runs remain operable — synthetic data generation is as much an engineering workload as a modeling one.

Design choices that survived contact with reality

Two ideas from the original technical framing stayed true in the shipped library:

Composition over inheritance. You should not need to fork a monolithic "generator" class for every new corpus or policy. Behavior is injected through callbacks and strategies — instruction logic, prompt modification, storage, evaluation — so the core stays stable while your pipeline evolves.

Provider portability. AfterImage normalizes chat sessions, structured calls, token accounting, and metadata across Gemini, OpenAI-compatible APIs (including DeepSeek and OpenRouter where applicable), and local OpenAI-compatible servers (for example vLLM, Ollama, llama.cpp). Conversation state is managed independently of any single vendor's chat API quirks.

Internally, recent work strengthened the parts that matter most in production: smarter context and persona sampling (including coverage-oriented behavior), clearer separation between orchestration, sampling, and quality gating, and optional capture of reasoning / thinking content from compatible providers when you want that signal preserved in your dataset.

What AfterImage is not

AfterImage is not a promise that synthetic data is "as good as hand-labeled gold for all sorts of use cases" without validation, and it is not a theory of instruction learning. It is a pragmatic engine for people who need repeatable, configurable, observable generation — from a quick experiment to a large batch job with budgets and filters.

The same limitations that applied when we wrote the internal report still apply in public form: evaluator bias can leak into curated sets; LLM judges cost time and money; personas need care so diversity does not collapse into caricature. We ship controls and visibility so you can see those trade-offs instead of discovering them only after training.

Try it

Repository: github.com/altaidevorg/afterimage
Documentation: afterimage.altai.dev
Install: Python 3.11+, then pip install afterimage or uv add afterimage
Quick start: afterimage generate -c examples/configs/basic.yaml (see examples/configs/ for RAG, local models, stopping budgets, and more)

Optional extras cover local embeddings (embeddings-local), a small FastAPI server (server), and a training / demo stack (training) for the Gradio demo and training helpers — see pyproject.toml for the exact dependency sets.

Closing

Open sourcing AfterImage is our invitation to treat synthetic data as infrastructure: something you configure, measure, and iterate on like any other part of the ML stack. If you build something with it — or hit a wall we should address — we would love to hear from you in issues and discussions on GitHub.

AfterImage is now open source for infrastructure-level dataset generation ​

Why we built it ​

What AfterImage actually does ​

Grounding, structure, and quality loops ​

Design choices that survived contact with reality ​

What AfterImage is not ​

Try it ​

Closing ​