OpenSimula examples walkthrough
This walkthrough explains the scripts in examples/simula/ as OpenSimula compositions, not just as commands to run. The goal is to make the moving parts visible so you can replace the policy examples with your own documents, task format, checkpoint layout, model provider, or downstream export path.
The examples exercise afterimage.simula, an experimental open implementation of Simula-style synthetic data mechanisms inspired by Davidson et al., Reasoning-Driven Synthetic Data Generation and Evaluation. They are not a Google reference implementation. The examples default to gemini-2.5-flash because it is fast enough for iterative taxonomy, scenario, and critic loops while staying close to the teacher-model family used in the paper.
There are three runnable scripts:
| Script | What it demonstrates |
|---|---|
minimal_pipeline.py | One document-grounded single-QA datapoint with taxonomy, strategy sampling, meta-prompting, and requirement-critic refinement. |
mcq_pipeline.py | One four-option MCQ datapoint with the same global/local pipeline plus the MCQ double-critic gate. |
corpus_batch_qa.py | A production-shaped batch run with a larger static corpus, checkpoint files, resume support, bounded concurrency, JSONL append, and optional Hub upload. |
The Mental Model
OpenSimula separates global coverage from local variety:
| Phase | OpenSimula piece | Purpose |
|---|---|---|
| Dataset intent | instruction_y | Describes the target domain, audience, format, constraints, and non-goals. |
| Optional source grounding | DocumentProvider | Supplies bounded excerpts used while creating factors and taxonomies. |
| Global diversification | OpenSimula.build_taxonomy() | Builds factor taxonomies: the conceptual space the dataset should cover. |
| Joint sampling | OpenSimula.infer_strategies() and sample_mix() | Chooses compatible taxonomy nodes to combine into one datapoint requirement mix. |
| Local diversification | OpenSimula.draw_meta_prompt() | Generates scenario/meta-prompt candidates for a sampled mix, then optionally complexifies one. |
| Task generation | generate_single_qa_datapoint() or generate_mcq_datapoint() | Produces a row and checks it against the requirements. |
| Persistence | Checkpointer and append_datapoints_jsonl() | Saves reusable taxonomy/strategy artifacts and accepted datapoints. |
The generation loop in the minimal single-QA path looks like this:
instruction_y + optional documents
-> build_taxonomy(y, S, D, N)
-> infer_strategies(bundle)
-> sample_mix(bundle, strategy)
-> draw_meta_prompt(y, bundle, mix, K, complexify_c)
-> generate_single_qa_datapoint(...)
-> requirement critic accepts or refines
-> DataPointRecord JSONFor MCQ, the final task step adds a second gate:
MCQ JSON
-> requirement critic/refine loop
-> double-critic probes the labeled answer
-> accepted DataPointRecord or NoneImports: The Building Blocks
The examples all start by creating an LLM provider:
from afterimage.providers import LLMFactory
llm = LLMFactory.create(
provider="gemini",
model_name="gemini-2.5-flash",
api_key=api_key,
)OpenSimula takes that provider and uses it for every structured LLM step: taxonomy proposals, critic merges, strategy inference, meta-prompts, task JSON, requirement critiques, refinements, and MCQ double-critic probes.
The smallest examples import:
from afterimage.providers import InMemoryDocumentProvider, LLMFactory
from afterimage.simula import OpenSimula, configure_example_consoleYou can read them as roles:
LLMFactorycreates the provider used by the OpenSimula facade.InMemoryDocumentProvidersupplies the example policy excerpts as source material.OpenSimulaorchestrates taxonomy, sampling, meta-prompt, and datapoint generation.configure_example_console()keeps progress output readable by muting noisy third-party INFO logs.
The batch script adds persistence helpers:
from afterimage.simula import (
Checkpointer,
OpenSimulaRunConfig,
append_datapoints_jsonl,
load_checkpoint,
)These helpers are the difference between a toy single-row run and a reusable workflow. They let you save the expensive global scaffold once, resume from it, and append accepted task rows as generation completes.
Dataset Intent: instruction_y
Each script defines one INSTRUCTION_Y string. This is the global dataset specification, usually called y in the paper mapping.
In minimal_pipeline.py, the intent is:
INSTRUCTION_Y = """\
You are generating synthetic **training Q&A** for enterprise employees...
"""The wording does a lot of work:
- It names the audience: enterprise employees.
- It names the task format: training Q&A.
- It establishes grounding: answers must use the policy excerpts.
- It sets style and size constraints: factual tone, no panic language, word limits.
- It prevents unwanted hallucination: no vendor-specific products or laws that are not mentioned.
When adapting these examples, this is usually the first thing to rewrite. A good instruction_y is specific enough that an evaluator could reject rows that miss the mark.
Optional Documents: Source Grounding
minimal_pipeline.py and corpus_batch_qa.py use static policy excerpts:
docs = InMemoryDocumentProvider(POLICY_EXCERPTS)Those excerpts are used while building the taxonomy. They help OpenSimula infer factors and branches that are actually present in the domain material. In the minimal script, the source material is intentionally tiny. In the batch script, CORPUS_EXCERPTS has six policy-style snippets so the taxonomy has more surface area.
mcq_pipeline.py passes:
document_provider=NoneThat script demonstrates the no-corpus path. The taxonomy is inferred from instruction_y alone, which is useful when the target domain is a format or assessment style rather than a closed set of documents.
For real runs, replace InMemoryDocumentProvider with a provider that matches your source material. Directory, JSONL, or vector-backed providers are better fits once the corpus no longer fits comfortably in a Python list.
Console Setup and Progress
The examples call:
configure_example_console()OpenSimula makes many structured LLM calls. Without quiet logging, HTTP client and SDK messages can bury the useful progress signal. The examples also pass:
show_progress=Trueto build_taxonomy(). That enables nested tqdm progress for factor proposal and per-factor breadth-first taxonomy expansion.
If a script appears paused after constructing OpenSimula, it is usually inside build_taxonomy(). The first structured call can take tens of seconds, and taxonomy expansion is sequential by design because each critic and planning step depends on the current tree state.
Taxonomy Construction
All three scripts build a TaxonomyBundle:
bundle = await sim.build_taxonomy(
INSTRUCTION_Y,
document_provider=docs,
target_depth_D=TARGET_DEPTH_D,
proposal_N=PROPOSAL_N,
max_factors=MAX_FACTORS,
max_children_per_node=MAX_CHILDREN_PER_NODE,
max_frontier_per_depth=MAX_FRONTIER_PER_DEPTH,
show_progress=True,
)
OpenSimula.validate_taxonomy_bundle(bundle)Important knobs:
| Argument | Meaning |
|---|---|
target_depth_D | Maximum taxonomy depth per factor. Deeper trees create finer control but cost more. |
proposal_N | Number of independent child proposals before a critic merges them. |
max_factors | Caps how many top-level factors are expanded. |
max_children_per_node | Caps breadth after the critic merge step. |
max_frontier_per_depth | Caps how many nodes are expanded at each depth. |
show_progress | Shows tqdm progress during the expensive taxonomy phase. |
The caps matter. A taxonomy is a branching structure; allowing too many factors, children, or frontier nodes can multiply into hundreds of sequential LLM calls. The examples choose conservative defaults so local runs are predictable.
The resulting bundle stores:
- the original
instruction_y; - document digests for bounded excerpts used during construction;
- accepted factors;
- one taxonomy tree per factor;
- expansion traces and per-depth plans for auditability.
Sampling Strategies and Mixes
After the taxonomy is built, the scripts infer sampling strategies:
spec = await sim.infer_strategies(bundle)
mix = sim.sample_mix(bundle, spec)The strategy spec describes which factors can be sampled together and with what weights. This prevents arbitrary combinations of taxonomy leaves from producing incoherent requirements.
sample_mix() draws one concrete Mix: a tuple of taxonomy nodes that becomes the requirement set for the next datapoint. In batch mode, every sample gets its own independent mix.
If you need full control, you can hand-author a SamplingStrategySpec, but the examples use infer_strategies() because it is the fastest way to get a reasonable mechanism from a new taxonomy.
Meta-Prompts: Local Diversity
The sampled mix defines requirements. The meta-prompt turns those requirements into a local scenario:
meta = await sim.draw_meta_prompt(
instruction_y=bundle.instruction_y,
bundle=bundle,
mix=mix,
K=META_PROMPT_K,
complexify_c=COMPLEXIFY_C,
sequential=False,
)K controls how many candidate scenarios are generated before one is subsampled. Larger values give the model more chances to vary framing while keeping the same global requirements.
complexify_c is the probability of running an additional complexification step. It is orthogonal to taxonomy coverage: complexity changes the difficulty or nuance of a scenario, while the mix controls which conceptual requirements are present.
The examples set sequential=False for speed. The sequential path generates scenario candidates one by one with prior attempts in context, which can reduce mode collapse at higher cost.
Single-QA Generation
minimal_pipeline.py uses:
row = await sim.generate_single_qa_datapoint(
instruction_y=bundle.instruction_y,
bundle=bundle,
mix=mix,
meta=meta,
)This produces one DataPointRecord with a single question and answer. The task generator first emits task JSON, then the requirement critic checks whether the row satisfies instruction_y, the sampled mix, and the meta-prompt.
If the critic rejects the row, OpenSimula can ask the LLM to refine it. The default maximum is four refine rounds. If the row still does not satisfy the requirements, the method returns None.
That is why the script handles both outcomes:
if row is None:
print("No row accepted...")
else:
print(row.model_dump_json(indent=2))For a real dataset, treat None as a normal rejected sample, not an exception. Batch pipelines should count accepted rows and continue.
MCQ Generation and Double-Critic
mcq_pipeline.py follows the same taxonomy, strategy, mix, and meta-prompt path. The task-specific call changes:
row = await sim.generate_mcq_datapoint(
instruction_y=bundle.instruction_y,
bundle=bundle,
mix=mix,
meta=meta,
num_choices=NUM_CHOICES,
)The MCQ path still runs the requirement critic and refinement loop. After that, it adds a double-critic gate for labeled-answer quality. The gate asks two independent structured probes about the answer label: one from the "this is correct" angle and one from the "this is incorrect" angle. A row is accepted only when the probes support a verifiable single correct answer.
Use this path when the output label matters as much as the question text. For free-form QA, the single-QA requirement critic is usually enough.
Batch Generation
corpus_batch_qa.py turns the single-QA pattern into a multi-sample workflow:
async for _idx, rec in sim.aiter_single_qa_samples(
instruction_y=bundle.instruction_y,
bundle=bundle,
spec=spec,
n=args.num_samples,
K=META_PROMPT_K,
complexify_c=COMPLEXIFY_C,
sequential=False,
max_concurrency=args.max_concurrency,
rng=rng,
):
if rec is not None:
append_datapoints_jsonl(jsonl_path, [rec])Each sample independently draws:
mix -> meta-prompt -> task row -> critic/refinemax_concurrency bounds how many sample pipelines run at once. Increasing it can improve throughput, but each worker still performs several LLM calls. Start with 1 or 2 when using tight provider quotas.
Rows are appended as each async sample completes. This is deliberate: if a long run crashes after accepting some rows, those rows are already in data/train.jsonl.
Checkpoints and Resume
The batch script saves the expensive global artifacts under opensimula/:
<output-dir>/
opensimula/
manifest.json
taxonomy_bundle.json
sampling_strategy.json
run_config.json
data/
train.jsonlThe save path is:
with Checkpointer(out) as cp:
bundle.save(cp)
spec.save(cp)
cp.write_run_config(run_cfg)run_config.json is typed as OpenSimulaRunConfig; it records the model, temperature, taxonomy knobs, sample count, seed, output path, and corpus size. The manifest.json identifies the subtree as an AfterImage opensimula checkpoint.
On resume, the script skips taxonomy construction and strategy inference when possible:
ckpt = load_checkpoint(out)
bundle = ckpt.bundle
spec = ckpt.sampling_strategyUse --resume when you want to append more rows using the same conceptual scaffold. This keeps follow-up generation cheaper and makes dataset expansion more consistent.
Running the Examples
All scripts require GEMINI_API_KEY unless you edit the provider setup:
export GEMINI_API_KEY="your_api_key_here"From examples/simula/:
python minimal_pipeline.py
python mcq_pipeline.py
python corpus_batch_qa.py --output-dir ./runs/corpus1 --num-samples 8 --max-concurrency 2
python corpus_batch_qa.py --output-dir ./runs/corpus1 --resume --num-samples 4For Hub upload from the batch script, set HF_TOKEN or HUGGINGFACE_HUB_TOKEN, then pass:
python corpus_batch_qa.py --output-dir ./runs/corpus1 --push-hf username/dataset-nameThe push path uploads the opensimula/ checkpoint subtree and a generated dataset README. It does not replace inspecting your local JSONL before scaling.
How To Adapt This Example
Use this checklist when adapting examples/simula/ to another domain:
- Rewrite
INSTRUCTION_Yso it states the audience, task format, constraints, and non-goals. - Replace the static policy excerpts with realistic source material, or pass
document_provider=Nonefor an instruction-only taxonomy. - Start with
target_depth_D=2,proposal_N=2or3, and capped factors while you inspect the taxonomy. - Validate the bundle after construction.
- Infer strategies once per taxonomy bundle, then sample mixes many times.
- Tune
META_PROMPT_KandCOMPLEXIFY_Conly after the basic rows look grounded and on-format. - Use single-QA for free-form answers; use MCQ when labeled choices need the double-critic gate.
- Save checkpoints for any run whose taxonomy cost you do not want to repeat.
- Append accepted rows incrementally and treat rejected rows as expected sampling loss.
- Inspect
data/train.jsonlbefore increasingnum_samplesor concurrency.
Common Variations
Use Files Instead of Inline Strings
Replace InMemoryDocumentProvider with a provider that reads your corpus. Keep the rest of the OpenSimula flow the same:
documents -> taxonomy -> strategies -> mixes -> meta-prompts -> datapointsGenerate a Larger QA Dataset
Use corpus_batch_qa.py as the starting point. Increase --num-samples, keep --max-concurrency conservative, and use --resume so you do not rebuild the taxonomy for every extension run.
Build a Multiple-Choice Benchmark
Start from mcq_pipeline.py, strengthen INSTRUCTION_Y around answer verifiability, and keep num_choices=4 unless your downstream benchmark format requires a different shape.
Feed Simula Scenarios Into Multi-Turn Conversations
OpenSimula can also prepare first-turn scenarios for ConversationGenerator via SimulaInstructionGeneratorCallback. That bridge does not call the LLM itself; it replays precomputed scenario text and metadata so the normal AfterImage conversation machinery can generate multi-turn assistant/user dialogs.
What These Examples Teach
The reusable lesson is not "security policy Q&A." The reusable pattern is:
dataset intent
-> optional source grounding
-> factor taxonomies for global coverage
-> weighted strategy mixes
-> local meta-prompts for scenario diversity
-> task-specific generation and critics
-> checkpointed artifacts plus accepted JSONL rowsOnce you understand that shape, you can apply OpenSimula to employee training, technical assessments, policy education, support simulations, benchmark-style MCQ generation, or any domain where you want explicit control over both coverage and local variation.