Caselaw RAG generation walkthrough

This walkthrough explains examples/caselaw_rag/generate.py as an AfterImage composition, not just as a script to run. The goal is to make each moving part visible so you can replace the caselaw pieces with your own documents, prompts, storage, retrieval backend, or model provider.

The script generates synthetic legal research conversations. For demonstration, we use a small slice of the free-law/Caselaw_Access_Project_embeddings dataset: cleaned U.S. court opinion text with precomputed BAAI/bge-base-en-v1.5 vectors. This keeps the example easy to run while still showing the full RAG data-generation pattern.

You can use the same structure with your own corpus. The important requirement is that index_corpus.py or your own ingestion job creates a Qdrant collection with document text in a known payload field and vectors that match the embedding model used later for query-time retrieval.

The Mental Model

At a high level, the script wires together two related but different context flows:

Flow	AfterImage piece	Purpose
Instruction-side context	`QdrantDocumentProvider` + `ContextualInstructionGeneratorCallback`	Samples documents so the simulated user asks grounded questions.
Response-side retrieval	`QdrantRetriever` + `WithRAGRespondentPromptModifier`	Retrieves relevant excerpts for the assistant before it answers.

That split is the main idea. The user side gets a sampled briefing so it can ask realistic questions. The assistant side gets retrieval results so it can answer from the corpus instead of inventing facts.

The generation loop then looks like this:

text

Qdrant collection
  -> QdrantDocumentProvider samples context
  -> ContextualInstructionGeneratorCallback creates user instructions
  -> QdrantRetriever retrieves answer context for each instruction
  -> WithRAGRespondentPromptModifier injects retrieved context
  -> ConversationGenerator simulates user/assistant turns
  -> JSONLStorage writes conversations

Imports: The Building Blocks

The afterimage imports are the reusable pieces:

python

from afterimage import (
    ConversationGenerator,
    ContextualInstructionGeneratorCallback,
    EmbeddingProviderFactory,
    GenerationMonitor,
    WithRAGRespondentPromptModifier,
)
from afterimage.providers import QdrantDocumentProvider
from afterimage.retrievers import QdrantRetriever
from afterimage.storage import JSONLStorage

You can read them as roles:

ConversationGenerator is the orchestrator-facing facade. It runs the simulated conversation and saves rows.
ContextualInstructionGeneratorCallback creates the initial user questions from sampled document context.
QdrantDocumentProvider is a document sampler. It gives AfterImage source text for instruction generation.
QdrantRetriever is a semantic retriever. It searches Qdrant for passages relevant to the current instruction.
WithRAGRespondentPromptModifier adds retrieved passages to the assistant prompt before the assistant answers.
EmbeddingProviderFactory creates the query embedding backend used by QdrantRetriever.
GenerationMonitor records generation metrics and alerts.
JSONLStorage writes the final dataset.

The Qdrant client imports are infrastructure rather than AfterImage concepts:

python

from qdrant_client import AsyncQdrantClient, QdrantClient

This example uses the sync client for document sampling and the async client for retrieval.

Prompts: Defining the Two Actors

The script defines two system prompts:

python

RESPONDENT_PROMPT = """You are a careful senior legal research assistant..."""
CORRESPONDENT_PROMPT = """You are an experienced lawyer or a legally curious client..."""

In AfterImage terms:

The respondent is the assistant being trained or simulated.
The correspondent is the user side of the conversation.

The respondent prompt is strict about grounding:

Use retrieved court opinions and excerpts.
Preserve identifiers when available.
Explain in plain English.
Say what is missing when the context is insufficient.
Do not provide real-world legal advice.

The correspondent prompt shapes the simulated user:

Ask realistic legal questions.
Stay in role.
Do not invent docket numbers or citations.
Use the provided briefing as inspiration.

When adapting this example, these prompts are usually the first thing to change. For customer support, the respondent might be a support agent and the correspondent might be a customer. For medical education, the respondent might be a study tutor and the correspondent might be a student. The architecture can stay the same while the roles change.

CLI Arguments: Making the Example Reusable

_build_parser() exposes the parts you are likely to tune:

python

p.add_argument("--qdrant-url", ...)
p.add_argument("--collection", ...)
p.add_argument("--num-dialogs", ...)
p.add_argument("--max-turns", ...)
p.add_argument("--max-concurrency", ...)
p.add_argument("--gemini-model", ...)
p.add_argument("--embedding-model", ...)
p.add_argument("--output", ...)

Most flags also read from environment variables. That makes the same script usable in local demos, scheduled jobs, and documentation examples without editing source code.

Important knobs:

Flag	Why it matters
`--collection`	Which Qdrant collection to read from.
`--content-key`	Payload field that contains document text. It must match indexing.
`--max-docs`	How many documents the document provider can sample.
`--num-dialogs`	Target number of conversations to generate.
`--max-turns`	Maximum turns per conversation. AfterImage samples uniformly from `1..max_turns`.
`--max-concurrency`	Number of concurrent conversation workers. Keep low for tight API quotas.
`--gemini-max-retries`	Retry count for transient Gemini 429/5xx responses.
`--embedding-model`	Query embedding model. It must match the indexed vector family and dimension.
`--auto-improve`	Enables evaluator-based quality retries. This costs additional LLM and embedding calls.

Qdrant Connection

The helper _qdrant_kwargs() builds shared client settings:

python

def _qdrant_kwargs(url: str, api_key: str | None) -> dict:
    kwargs: dict = {"url": url, "timeout": 120.0}
    if api_key:
        kwargs["api_key"] = api_key
    return kwargs

Both local Qdrant and Qdrant Cloud use the same shape. Cloud adds an API key; local Docker usually does not.

In _async_main(), the script creates both clients:

python

qd = QdrantClient(**qd_kw)
qd_async = AsyncQdrantClient(**qd_kw)

The async client is closed in finally:

python

await qd_async.close()

That pattern is worth copying for any script that owns network clients.

Monitoring and Alerts

The monitor records metrics such as generation time, success rate, error rate, token usage, and conversation length:

python

monitor = GenerationMonitor(
    log_dir=str(args.log_dir),
    alert_handlers=[on_alert],
    metrics_interval=60,
)

In this example, alerts are printed:

python

def on_alert(alert) -> None:
    print(f"alert - {alert.name} - {alert.message}")

For production workflows, this could send alerts to a dashboard, Slack, logs, or your own observability pipeline. For documentation and demos, printing is enough because it shows when generation quality or provider reliability is degrading.

Instruction-Side Context

This block creates the document provider:

python

documents = QdrantDocumentProvider(
    client=qd,
    collection_name=args.collection,
    content_key=args.content_key,
    max_docs=args.max_docs,
)

QdrantDocumentProvider gives AfterImage documents to sample from. It is not the same as retrieval. It is used before the conversation starts, so the instruction generator can create grounded user questions.

Then the script creates the instruction callback:

python

instruction_cb = ContextualInstructionGeneratorCallback(
    api_key=api_key,
    documents=documents,
    model_name=args.gemini_model,
    num_random_contexts=1,
    llm_create_extras={...},
)

This callback asks the LLM to produce user instructions from sampled context. Its output is a GeneratedInstructions object containing:

instructions: one or more user prompts.
context: the sampled source text.
context_id / context_ids: metadata for coverage tracking.

In your own project, swap QdrantDocumentProvider for another provider if your source material lives somewhere else:

InMemoryDocumentProvider for small examples.
DirectoryDocumentProvider for local files.
JSONLDocumentProvider for prepared document rows.
QdrantDocumentProvider for vector database-backed corpora.

Query Embeddings

The retriever needs to embed user instructions before searching Qdrant:

python

embedding_provider = EmbeddingProviderFactory.create(
    {
        "type": "process",
        "model": args.embedding_model,
        "workers": args.embedding_workers,
    },
)

This example uses a local SentenceTransformer process pool. That is why setup requires:

bash

uv sync --extra embeddings-local

The important rule is that query embeddings must be compatible with indexed vectors. Here, index_corpus.py stores 768-dimensional BAAI/bge-base-en-v1.5 vectors, so generate.py defaults to the same model.

If you index with a different model, change both indexing and querying together. Otherwise Qdrant search may fail or return poor matches.

Response-Side Retrieval

The retriever is built separately from the document provider:

python

retriever = QdrantRetriever(
    client=qd,
    collection_name=args.collection,
    embedding_provider=embedding_provider,
    async_client=qd_async,
    payload_key=args.content_key,
    limit=3,
)

This retrieves the top matching excerpts for a generated instruction. limit=3 means the assistant will see up to three retrieved passages.

Then the retriever is wrapped in a respondent prompt modifier:

python

modifier = WithRAGRespondentPromptModifier(retriever=retriever)

The modifier runs before the assistant answers. It takes the base respondent prompt and adds retrieved context, so the assistant response is grounded in search results.

This is the pattern to copy when you want RAG-style synthetic conversations:

text

DocumentProvider -> helps produce the user question
Retriever        -> helps produce the assistant answer
PromptModifier   -> injects retrieved answer context

Storage

The generated rows are written to JSONL:

python

storage = JSONLStorage(conversations_path=str(args.output))

JSONL is a good default because it is easy to inspect, stream, and convert later. Each row contains the conversation plus metadata such as sampled context, retrieved context, and context ids.

If your workflow needs a database, AfterImage also supports SQL storage. The rest of the generation composition can stay mostly the same.

The ConversationGenerator

This is where the pieces come together:

python

conv_gen = ConversationGenerator(
    respondent_prompt=RESPONDENT_PROMPT,
    correspondent_prompt=CORRESPONDENT_PROMPT,
    api_key=api_key,
    model_name=args.gemini_model,
    monitor=monitor,
    auto_improve=args.auto_improve,
    storage=storage,
    instruction_generator_callback=instruction_cb,
    respondent_prompt_modifier=modifier,
    embedding_provider=embedding_provider if args.auto_improve else None,
    llm_factory_kwargs={...},
)

The most important fields are:

Argument	Meaning
`respondent_prompt`	System prompt for the assistant side.
`correspondent_prompt`	System prompt for the simulated user side.
`instruction_generator_callback`	Produces the first user message from sampled context.
`respondent_prompt_modifier`	Adds retrieved context before the assistant answers.
`storage`	Persists generated conversations.
`monitor`	Tracks metrics, alerts, and plots.
`auto_improve`	Runs evaluator-based retries when enabled.
`llm_factory_kwargs`	Passes provider-specific options, such as Gemini retry settings.

This construction is the reusable recipe. The caselaw dataset is just one instance of it.

Running Generation

The actual generation call is short:

python

await conv_gen.generate(
    num_dialogs=args.num_dialogs,
    max_turns=args.max_turns,
    max_concurrency=args.max_concurrency,
)

num_dialogs controls how many conversations to produce.

max_turns is not "always exactly this many turns." AfterImage samples the actual turn count uniformly from 1 through max_turns. With max_turns=1, every conversation is single-turn.

max_concurrency controls how many conversation workers run at once. Higher values can improve throughput but also increase API pressure. For Gemini free or tight quotas, 1 is the safest starting point.

Cleanup

The script closes resources in a finally block:

python

finally:
    await embedding_provider.aclose()
    await qd_async.close()
    monitor.shutdown()

This matters because generation scripts often run for a long time and own process pools, async HTTP sessions, and monitoring threads.

If you add another resource, such as a database client or custom retriever, close it here too.

How To Adapt This Example

Use this checklist when adapting the example to another domain:

Replace RESPONDENT_PROMPT with the assistant behavior you want.
Replace CORRESPONDENT_PROMPT with the user role you want to simulate.
Choose a document provider for your source material.
Make sure your indexed vectors and query embedding model match.
Choose a retriever and prompt modifier if the assistant should use RAG.
Pick storage: JSONL for files, SQL for database-backed runs.
Start with low num_dialogs and max_concurrency.
Turn on auto_improve only when you are ready to pay for quality retries.
Inspect output JSONL before scaling up.

Common Variations

Use local files instead of Qdrant for instruction context

If you do not need vector-backed sampling, use a file or directory provider for the instruction side. You can still use a retriever for response-side RAG, or skip retrieval entirely and use WithContextRespondentPromptModifier for simpler context injection.

Generate customer support data

Keep the same shape:

text

support articles -> document provider -> customer questions
support articles -> retriever -> agent answers

Change the prompts so the respondent is a support agent and the correspondent is a customer with realistic issues.

Generate domain tutoring dialogs

Use textbook sections, lecture notes, or documentation pages as source material. The correspondent can be a beginner, advanced learner, or examiner. The respondent can be a tutor that explains concepts and asks clarifying questions.

Skip RAG for simpler datasets

If every generated answer can rely on the same sampled context, you can remove QdrantRetriever and WithRAGRespondentPromptModifier, then use a simpler context prompt modifier. The tradeoff is less dynamic answer-time retrieval.

What This Example Teaches

The important lesson is not "caselaw plus Qdrant." The reusable pattern is:

text

source material
  -> instruction generation
  -> optional retrieval
  -> controlled assistant/user simulation
  -> monitored JSONL dataset

Once you understand those pieces, you can assemble your own synthetic dataset pipeline for legal research, customer support, technical documentation, medical education, internal knowledge bases, or any domain where conversations should be grounded in source documents.

Convert your raw corpus to SFT data: A walkthrough with AfterImage to generate legal research conversations

Caselaw RAG generation walkthrough ​

The Mental Model ​

Imports: The Building Blocks ​

Prompts: Defining the Two Actors ​

CLI Arguments: Making the Example Reusable ​

Qdrant Connection ​

Monitoring and Alerts ​

Instruction-Side Context ​

Query Embeddings ​

Response-Side Retrieval ​

Storage ​

The ConversationGenerator ​

Running Generation ​

Cleanup ​

How To Adapt This Example ​

Common Variations ​

Use local files instead of Qdrant for instruction context ​

Generate customer support data ​

Generate domain tutoring dialogs ​

Skip RAG for simpler datasets ​

What This Example Teaches ​

Caselaw RAG generation walkthrough

The Mental Model

Imports: The Building Blocks

Prompts: Defining the Two Actors

CLI Arguments: Making the Example Reusable

Qdrant Connection

Monitoring and Alerts

Instruction-Side Context

Query Embeddings

Response-Side Retrieval

Storage

The ConversationGenerator

Running Generation

Cleanup

How To Adapt This Example

Common Variations

Use local files instead of Qdrant for instruction context

Generate customer support data

Generate domain tutoring dialogs

Skip RAG for simpler datasets

What This Example Teaches