Skip to content
← All posts

OpenSimula examples walkthrough

A guided explanation of the OpenSimula pieces used in examples/simula.

OpenSimula examples walkthrough

This walkthrough explains the scripts in examples/simula/ as OpenSimula compositions, not just as commands to run. The goal is to make the moving parts visible so you can replace the policy examples with your own documents, task format, checkpoint layout, model provider, or downstream export path.

The examples exercise afterimage.simula, an experimental open implementation of Simula-style synthetic data mechanisms inspired by Davidson et al., Reasoning-Driven Synthetic Data Generation and Evaluation. They are not a Google reference implementation. The examples default to gemini-2.5-flash because it is fast enough for iterative taxonomy, scenario, and critic loops while staying close to the teacher-model family used in the paper.

There are three runnable scripts:

ScriptWhat it demonstrates
minimal_pipeline.pyOne document-grounded single-QA datapoint with taxonomy, strategy sampling, meta-prompting, and requirement-critic refinement.
mcq_pipeline.pyOne four-option MCQ datapoint with the same global/local pipeline plus the MCQ double-critic gate.
corpus_batch_qa.pyA production-shaped batch run with a larger static corpus, checkpoint files, resume support, bounded concurrency, JSONL append, and optional Hub upload.

The Mental Model

OpenSimula separates global coverage from local variety:

PhaseOpenSimula piecePurpose
Dataset intentinstruction_yDescribes the target domain, audience, format, constraints, and non-goals.
Optional source groundingDocumentProviderSupplies bounded excerpts used while creating factors and taxonomies.
Global diversificationOpenSimula.build_taxonomy()Builds factor taxonomies: the conceptual space the dataset should cover.
Joint samplingOpenSimula.infer_strategies() and sample_mix()Chooses compatible taxonomy nodes to combine into one datapoint requirement mix.
Local diversificationOpenSimula.draw_meta_prompt()Generates scenario/meta-prompt candidates for a sampled mix, then optionally complexifies one.
Task generationgenerate_single_qa_datapoint() or generate_mcq_datapoint()Produces a row and checks it against the requirements.
PersistenceCheckpointer and append_datapoints_jsonl()Saves reusable taxonomy/strategy artifacts and accepted datapoints.

The generation loop in the minimal single-QA path looks like this:

text
instruction_y + optional documents
  -> build_taxonomy(y, S, D, N)
  -> infer_strategies(bundle)
  -> sample_mix(bundle, strategy)
  -> draw_meta_prompt(y, bundle, mix, K, complexify_c)
  -> generate_single_qa_datapoint(...)
  -> requirement critic accepts or refines
  -> DataPointRecord JSON

For MCQ, the final task step adds a second gate:

text
MCQ JSON
  -> requirement critic/refine loop
  -> double-critic probes the labeled answer
  -> accepted DataPointRecord or None

Imports: The Building Blocks

The examples all start by creating an LLM provider:

python
from afterimage.providers import LLMFactory

llm = LLMFactory.create(
    provider="gemini",
    model_name="gemini-2.5-flash",
    api_key=api_key,
)

OpenSimula takes that provider and uses it for every structured LLM step: taxonomy proposals, critic merges, strategy inference, meta-prompts, task JSON, requirement critiques, refinements, and MCQ double-critic probes.

The smallest examples import:

python
from afterimage.providers import InMemoryDocumentProvider, LLMFactory
from afterimage.simula import OpenSimula, configure_example_console

You can read them as roles:

  • LLMFactory creates the provider used by the OpenSimula facade.
  • InMemoryDocumentProvider supplies the example policy excerpts as source material.
  • OpenSimula orchestrates taxonomy, sampling, meta-prompt, and datapoint generation.
  • configure_example_console() keeps progress output readable by muting noisy third-party INFO logs.

The batch script adds persistence helpers:

python
from afterimage.simula import (
    Checkpointer,
    OpenSimulaRunConfig,
    append_datapoints_jsonl,
    load_checkpoint,
)

These helpers are the difference between a toy single-row run and a reusable workflow. They let you save the expensive global scaffold once, resume from it, and append accepted task rows as generation completes.

Dataset Intent: instruction_y

Each script defines one INSTRUCTION_Y string. This is the global dataset specification, usually called y in the paper mapping.

In minimal_pipeline.py, the intent is:

python
INSTRUCTION_Y = """\
You are generating synthetic **training Q&A** for enterprise employees...
"""

The wording does a lot of work:

  • It names the audience: enterprise employees.
  • It names the task format: training Q&A.
  • It establishes grounding: answers must use the policy excerpts.
  • It sets style and size constraints: factual tone, no panic language, word limits.
  • It prevents unwanted hallucination: no vendor-specific products or laws that are not mentioned.

When adapting these examples, this is usually the first thing to rewrite. A good instruction_y is specific enough that an evaluator could reject rows that miss the mark.

Optional Documents: Source Grounding

minimal_pipeline.py and corpus_batch_qa.py use static policy excerpts:

python
docs = InMemoryDocumentProvider(POLICY_EXCERPTS)

Those excerpts are used while building the taxonomy. They help OpenSimula infer factors and branches that are actually present in the domain material. In the minimal script, the source material is intentionally tiny. In the batch script, CORPUS_EXCERPTS has six policy-style snippets so the taxonomy has more surface area.

mcq_pipeline.py passes:

python
document_provider=None

That script demonstrates the no-corpus path. The taxonomy is inferred from instruction_y alone, which is useful when the target domain is a format or assessment style rather than a closed set of documents.

For real runs, replace InMemoryDocumentProvider with a provider that matches your source material. Directory, JSONL, or vector-backed providers are better fits once the corpus no longer fits comfortably in a Python list.

Console Setup and Progress

The examples call:

python
configure_example_console()

OpenSimula makes many structured LLM calls. Without quiet logging, HTTP client and SDK messages can bury the useful progress signal. The examples also pass:

python
show_progress=True

to build_taxonomy(). That enables nested tqdm progress for factor proposal and per-factor breadth-first taxonomy expansion.

If a script appears paused after constructing OpenSimula, it is usually inside build_taxonomy(). The first structured call can take tens of seconds, and taxonomy expansion is sequential by design because each critic and planning step depends on the current tree state.

Taxonomy Construction

All three scripts build a TaxonomyBundle:

python
bundle = await sim.build_taxonomy(
    INSTRUCTION_Y,
    document_provider=docs,
    target_depth_D=TARGET_DEPTH_D,
    proposal_N=PROPOSAL_N,
    max_factors=MAX_FACTORS,
    max_children_per_node=MAX_CHILDREN_PER_NODE,
    max_frontier_per_depth=MAX_FRONTIER_PER_DEPTH,
    show_progress=True,
)
OpenSimula.validate_taxonomy_bundle(bundle)

Important knobs:

ArgumentMeaning
target_depth_DMaximum taxonomy depth per factor. Deeper trees create finer control but cost more.
proposal_NNumber of independent child proposals before a critic merges them.
max_factorsCaps how many top-level factors are expanded.
max_children_per_nodeCaps breadth after the critic merge step.
max_frontier_per_depthCaps how many nodes are expanded at each depth.
show_progressShows tqdm progress during the expensive taxonomy phase.

The caps matter. A taxonomy is a branching structure; allowing too many factors, children, or frontier nodes can multiply into hundreds of sequential LLM calls. The examples choose conservative defaults so local runs are predictable.

The resulting bundle stores:

  • the original instruction_y;
  • document digests for bounded excerpts used during construction;
  • accepted factors;
  • one taxonomy tree per factor;
  • expansion traces and per-depth plans for auditability.

Sampling Strategies and Mixes

After the taxonomy is built, the scripts infer sampling strategies:

python
spec = await sim.infer_strategies(bundle)
mix = sim.sample_mix(bundle, spec)

The strategy spec describes which factors can be sampled together and with what weights. This prevents arbitrary combinations of taxonomy leaves from producing incoherent requirements.

sample_mix() draws one concrete Mix: a tuple of taxonomy nodes that becomes the requirement set for the next datapoint. In batch mode, every sample gets its own independent mix.

If you need full control, you can hand-author a SamplingStrategySpec, but the examples use infer_strategies() because it is the fastest way to get a reasonable mechanism from a new taxonomy.

Meta-Prompts: Local Diversity

The sampled mix defines requirements. The meta-prompt turns those requirements into a local scenario:

python
meta = await sim.draw_meta_prompt(
    instruction_y=bundle.instruction_y,
    bundle=bundle,
    mix=mix,
    K=META_PROMPT_K,
    complexify_c=COMPLEXIFY_C,
    sequential=False,
)

K controls how many candidate scenarios are generated before one is subsampled. Larger values give the model more chances to vary framing while keeping the same global requirements.

complexify_c is the probability of running an additional complexification step. It is orthogonal to taxonomy coverage: complexity changes the difficulty or nuance of a scenario, while the mix controls which conceptual requirements are present.

The examples set sequential=False for speed. The sequential path generates scenario candidates one by one with prior attempts in context, which can reduce mode collapse at higher cost.

Single-QA Generation

minimal_pipeline.py uses:

python
row = await sim.generate_single_qa_datapoint(
    instruction_y=bundle.instruction_y,
    bundle=bundle,
    mix=mix,
    meta=meta,
)

This produces one DataPointRecord with a single question and answer. The task generator first emits task JSON, then the requirement critic checks whether the row satisfies instruction_y, the sampled mix, and the meta-prompt.

If the critic rejects the row, OpenSimula can ask the LLM to refine it. The default maximum is four refine rounds. If the row still does not satisfy the requirements, the method returns None.

That is why the script handles both outcomes:

python
if row is None:
    print("No row accepted...")
else:
    print(row.model_dump_json(indent=2))

For a real dataset, treat None as a normal rejected sample, not an exception. Batch pipelines should count accepted rows and continue.

MCQ Generation and Double-Critic

mcq_pipeline.py follows the same taxonomy, strategy, mix, and meta-prompt path. The task-specific call changes:

python
row = await sim.generate_mcq_datapoint(
    instruction_y=bundle.instruction_y,
    bundle=bundle,
    mix=mix,
    meta=meta,
    num_choices=NUM_CHOICES,
)

The MCQ path still runs the requirement critic and refinement loop. After that, it adds a double-critic gate for labeled-answer quality. The gate asks two independent structured probes about the answer label: one from the "this is correct" angle and one from the "this is incorrect" angle. A row is accepted only when the probes support a verifiable single correct answer.

Use this path when the output label matters as much as the question text. For free-form QA, the single-QA requirement critic is usually enough.

Batch Generation

corpus_batch_qa.py turns the single-QA pattern into a multi-sample workflow:

python
async for _idx, rec in sim.aiter_single_qa_samples(
    instruction_y=bundle.instruction_y,
    bundle=bundle,
    spec=spec,
    n=args.num_samples,
    K=META_PROMPT_K,
    complexify_c=COMPLEXIFY_C,
    sequential=False,
    max_concurrency=args.max_concurrency,
    rng=rng,
):
    if rec is not None:
        append_datapoints_jsonl(jsonl_path, [rec])

Each sample independently draws:

text
mix -> meta-prompt -> task row -> critic/refine

max_concurrency bounds how many sample pipelines run at once. Increasing it can improve throughput, but each worker still performs several LLM calls. Start with 1 or 2 when using tight provider quotas.

Rows are appended as each async sample completes. This is deliberate: if a long run crashes after accepting some rows, those rows are already in data/train.jsonl.

Checkpoints and Resume

The batch script saves the expensive global artifacts under opensimula/:

text
<output-dir>/
  opensimula/
    manifest.json
    taxonomy_bundle.json
    sampling_strategy.json
    run_config.json
  data/
    train.jsonl

The save path is:

python
with Checkpointer(out) as cp:
    bundle.save(cp)
    spec.save(cp)
    cp.write_run_config(run_cfg)

run_config.json is typed as OpenSimulaRunConfig; it records the model, temperature, taxonomy knobs, sample count, seed, output path, and corpus size. The manifest.json identifies the subtree as an AfterImage opensimula checkpoint.

On resume, the script skips taxonomy construction and strategy inference when possible:

python
ckpt = load_checkpoint(out)
bundle = ckpt.bundle
spec = ckpt.sampling_strategy

Use --resume when you want to append more rows using the same conceptual scaffold. This keeps follow-up generation cheaper and makes dataset expansion more consistent.

Running the Examples

All scripts require GEMINI_API_KEY unless you edit the provider setup:

bash
export GEMINI_API_KEY="your_api_key_here"

From examples/simula/:

bash
python minimal_pipeline.py
python mcq_pipeline.py
python corpus_batch_qa.py --output-dir ./runs/corpus1 --num-samples 8 --max-concurrency 2
python corpus_batch_qa.py --output-dir ./runs/corpus1 --resume --num-samples 4

For Hub upload from the batch script, set HF_TOKEN or HUGGINGFACE_HUB_TOKEN, then pass:

bash
python corpus_batch_qa.py --output-dir ./runs/corpus1 --push-hf username/dataset-name

The push path uploads the opensimula/ checkpoint subtree and a generated dataset README. It does not replace inspecting your local JSONL before scaling.

How To Adapt This Example

Use this checklist when adapting examples/simula/ to another domain:

  1. Rewrite INSTRUCTION_Y so it states the audience, task format, constraints, and non-goals.
  2. Replace the static policy excerpts with realistic source material, or pass document_provider=None for an instruction-only taxonomy.
  3. Start with target_depth_D=2, proposal_N=2 or 3, and capped factors while you inspect the taxonomy.
  4. Validate the bundle after construction.
  5. Infer strategies once per taxonomy bundle, then sample mixes many times.
  6. Tune META_PROMPT_K and COMPLEXIFY_C only after the basic rows look grounded and on-format.
  7. Use single-QA for free-form answers; use MCQ when labeled choices need the double-critic gate.
  8. Save checkpoints for any run whose taxonomy cost you do not want to repeat.
  9. Append accepted rows incrementally and treat rejected rows as expected sampling loss.
  10. Inspect data/train.jsonl before increasing num_samples or concurrency.

Common Variations

Use Files Instead of Inline Strings

Replace InMemoryDocumentProvider with a provider that reads your corpus. Keep the rest of the OpenSimula flow the same:

text
documents -> taxonomy -> strategies -> mixes -> meta-prompts -> datapoints

Generate a Larger QA Dataset

Use corpus_batch_qa.py as the starting point. Increase --num-samples, keep --max-concurrency conservative, and use --resume so you do not rebuild the taxonomy for every extension run.

Build a Multiple-Choice Benchmark

Start from mcq_pipeline.py, strengthen INSTRUCTION_Y around answer verifiability, and keep num_choices=4 unless your downstream benchmark format requires a different shape.

Feed Simula Scenarios Into Multi-Turn Conversations

OpenSimula can also prepare first-turn scenarios for ConversationGenerator via SimulaInstructionGeneratorCallback. That bridge does not call the LLM itself; it replays precomputed scenario text and metadata so the normal AfterImage conversation machinery can generate multi-turn assistant/user dialogs.

What These Examples Teach

The reusable lesson is not "security policy Q&A." The reusable pattern is:

text
dataset intent
  -> optional source grounding
  -> factor taxonomies for global coverage
  -> weighted strategy mixes
  -> local meta-prompts for scenario diversity
  -> task-specific generation and critics
  -> checkpointed artifacts plus accepted JSONL rows

Once you understand that shape, you can apply OpenSimula to employee training, technical assessments, policy education, support simulations, benchmark-style MCQ generation, or any domain where you want explicit control over both coverage and local variation.

Your model. Not theirs.