Skip to content

AfterImage

Generate synthetic multi-turn chat datasets from your documents. Point AfterImage at a corpus — PDFs, markdown files, internal wikis — and get SFT-ready JSONL out.

Open Source

AfterImage is free and open source. View on GitHub →

Why AfterImage

When you fine-tune a model on a specific domain, you usually have the documents but not the conversations. Generating flat Q&A pairs is easy — but models trained on single-turn data sound like templates. They can't handle follow-ups, clarifications, or how users actually navigate a topic.

AfterImage generates multi-turn dialogues grounded strictly in your source documents. A simulated user (Correspondent) and a simulated assistant (Respondent) hold a real back-and-forth, both anchored to context chunks from your corpus.

Honest caveats:

  • Quality degrades past 5–6 turns — keep conversations short
  • Weak models (< 20B class) generate repetitive loops or superficial answers — use a capable mode such as Gemma4 31B or Deepseek Reasoner that explicitly allows distillation, or choose GPT-4o or higher, Claude 3.5 or higher, Gemini 2.5 Flash or gigher for researchpurposes and at your own risk.

Installation

bash
pip install afterimage

Python 3.11+ required.

Optional extras:

bash
pip install "afterimage[embeddings]"   # quality gating via embeddings
pip install "afterimage[server]"       # FastAPI=based API
pip install "afterimage[training]"     # training tool integrations on the demo

Quick Start

CLI

You can use the CLI to generate conversations from your documents in common use cases.

bash
afterimage generate --help
Usage: afterimage generate [OPTIONS]

  Generate synthetic conversation dataset from config.

Options:
  -c, --config PATH  Path to YAML config file.  [required]
  --dry-run          Validate config and print plan without generating.
  --help             Show this message and exit.

A sample config file can be found in the examples directory.

bash
# Generate conversations from a folder of documents
afterimage generate -c basic.yaml

# Export to ShareGPT format for Axolotl / Unsloth
afterimage export --format sharegpt --output dataset.jsonl

# Push to HuggingFace Hub
afterimage push --repo your-org/your-dataset

Python API

Afterimage is even more powerful when used as a library in your Python code. You can customize every aspect of the conversation generation process with minimal code thanks to our callback-based architecture and composable components. If you want to learn more, read Architecture & Design in the docs.

python
import os
import asyncio
from afterimage import (
    ConversationGenerator,
    PersonaInstructionGeneratorCallback,
    PersonaGenerator,
    InMemoryDocumentProvider,
    WithContextRespondentPromptModifier,
)

# Get API key
api_key = os.getenv("GEMINI_API_KEY")
if not api_key:
    raise ValueError("Set the GEMINI_API_KEY environment variable!")

# Define the respondent prompt
respondent_prompt = """You are a knowledgeable coffee expert.
Your task is to answer questions about coffee brewing, beans, and history.
Provide detailed, accurate, and enthusiastic responses.
"""

# Prepare contextual documents
# In a real scenario, you might load these from files using DirectoryDocumentProvider or JSONLDocumentProvider
texts = [
    """
    Pour-over coffee is a method of brewing coffee by pouring hot water over ground coffee beans through a filter. 
    The water drains through the coffee and filter into a carafe or mug. 
    Pour-over brewing allows for intricate flavor extraction. 
    Common devices include the V60, Chemex, and Kalita Wave.
    Key variables are grind size, water temperature, and pouring technique.
    """,
    """
    Espresso is a concentrated coffee beverage brewed by forcing hot water under high pressure (9-10 bars) through finely-ground coffee beans.
    It is the base for many drinks like lattes, cappuccinos, and macchiatos.
    A good espresso has a layer of crema on top.
    """,
]
documents = InMemoryDocumentProvider(texts)


async def main():
    print("1. Generating Personas from documents...")
    # Initialize PersonaGenerator
    persona_gen = PersonaGenerator(api_key=api_key)

    # Generate personas for the documents
    # This will populate the .personas attribute of each Document in the provider
    await persona_gen.generate_from_documents(documents)

    # Inspect generated personas
    for i, doc in enumerate(documents.get_all()):
        print(f"\nDocument {i + 1} Personas:")
        for p_entry in doc.personas:
            for p in p_entry.descriptions:
                print(f"- {p}")

    print("\n2. Setting up Conversation Generator...")

    # Set up the persona instruction generator callback
    # This callback will select a random persona from the document's personas
    # and instruct the LLM to roleplay that persona when asking questions.
    instruction_generator_callback = PersonaInstructionGeneratorCallback(
        api_key=api_key,
        documents=documents,
        num_random_contexts=1,
        n_instructions=2,
    )

    # Set up the respondent prompt modifier (optional, but good for RAG/Context usage)
    respondent_prompt_modifier = WithContextRespondentPromptModifier()

    # Initialize the ConversationGenerator
    conv_gen = ConversationGenerator(
        respondent_prompt=respondent_prompt,
        api_key=api_key,
        model_name="gemini-2.5-flash",
        instruction_generator_callback=instruction_generator_callback,
        respondent_prompt_modifier=respondent_prompt_modifier,
    )

    print("3. Generating Conversations...")
    # Generate conversations
    # The 'persona' field in the output will indicate which persona was used.
    await conv_gen.generate(
        num_dialogs=4,
        max_turns=1,
        max_concurrency=2,
    )

    print("\n4. Load generated conversations...")
    conversations = conv_gen.load_conversations()

    print(f"\nGenerated {len(conversations)} conversations.")

    # Display a sample
    if conversations:
        conv = conversations[0]
        print("\nSample Conversation:")
        print(f"Persona: {conv.persona}")
        print(f"Context: {conv.instruction_context[:100]}...")
        for turn in conv.conversations:
            print(f"{turn.role}: {turn.content[:100]} ...")


if __name__ == "__main__":
    asyncio.run(main())

Documentation

You can read the full documentation for afterimage atafterimage.altai.dev, but here's some useful links:

Your model. Not theirs.