Caselaw RAG generation walkthrough
This walkthrough explains examples/caselaw_rag/generate.py as an AfterImage composition, not just as a script to run. The goal is to make each moving part visible so you can replace the caselaw pieces with your own documents, prompts, storage, retrieval backend, or model provider.
The script generates synthetic legal research conversations. For demonstration, we use a small slice of the free-law/Caselaw_Access_Project_embeddings dataset: cleaned U.S. court opinion text with precomputed BAAI/bge-base-en-v1.5 vectors. This keeps the example easy to run while still showing the full RAG data-generation pattern.
You can use the same structure with your own corpus. The important requirement is that index_corpus.py or your own ingestion job creates a Qdrant collection with document text in a known payload field and vectors that match the embedding model used later for query-time retrieval.
The Mental Model
At a high level, the script wires together two related but different context flows:
| Flow | AfterImage piece | Purpose |
|---|---|---|
| Instruction-side context | QdrantDocumentProvider + ContextualInstructionGeneratorCallback | Samples documents so the simulated user asks grounded questions. |
| Response-side retrieval | QdrantRetriever + WithRAGRespondentPromptModifier | Retrieves relevant excerpts for the assistant before it answers. |
That split is the main idea. The user side gets a sampled briefing so it can ask realistic questions. The assistant side gets retrieval results so it can answer from the corpus instead of inventing facts.
The generation loop then looks like this:
Qdrant collection
-> QdrantDocumentProvider samples context
-> ContextualInstructionGeneratorCallback creates user instructions
-> QdrantRetriever retrieves answer context for each instruction
-> WithRAGRespondentPromptModifier injects retrieved context
-> ConversationGenerator simulates user/assistant turns
-> JSONLStorage writes conversationsImports: The Building Blocks
The afterimage imports are the reusable pieces:
from afterimage import (
ConversationGenerator,
ContextualInstructionGeneratorCallback,
EmbeddingProviderFactory,
GenerationMonitor,
WithRAGRespondentPromptModifier,
)
from afterimage.providers import QdrantDocumentProvider
from afterimage.retrievers import QdrantRetriever
from afterimage.storage import JSONLStorageYou can read them as roles:
ConversationGeneratoris the orchestrator-facing facade. It runs the simulated conversation and saves rows.ContextualInstructionGeneratorCallbackcreates the initial user questions from sampled document context.QdrantDocumentProvideris a document sampler. It gives AfterImage source text for instruction generation.QdrantRetrieveris a semantic retriever. It searches Qdrant for passages relevant to the current instruction.WithRAGRespondentPromptModifieradds retrieved passages to the assistant prompt before the assistant answers.EmbeddingProviderFactorycreates the query embedding backend used byQdrantRetriever.GenerationMonitorrecords generation metrics and alerts.JSONLStoragewrites the final dataset.
The Qdrant client imports are infrastructure rather than AfterImage concepts:
from qdrant_client import AsyncQdrantClient, QdrantClientThis example uses the sync client for document sampling and the async client for retrieval.
Prompts: Defining the Two Actors
The script defines two system prompts:
RESPONDENT_PROMPT = """You are a careful senior legal research assistant..."""
CORRESPONDENT_PROMPT = """You are an experienced lawyer or a legally curious client..."""In AfterImage terms:
- The respondent is the assistant being trained or simulated.
- The correspondent is the user side of the conversation.
The respondent prompt is strict about grounding:
- Use retrieved court opinions and excerpts.
- Preserve identifiers when available.
- Explain in plain English.
- Say what is missing when the context is insufficient.
- Do not provide real-world legal advice.
The correspondent prompt shapes the simulated user:
- Ask realistic legal questions.
- Stay in role.
- Do not invent docket numbers or citations.
- Use the provided briefing as inspiration.
When adapting this example, these prompts are usually the first thing to change. For customer support, the respondent might be a support agent and the correspondent might be a customer. For medical education, the respondent might be a study tutor and the correspondent might be a student. The architecture can stay the same while the roles change.
CLI Arguments: Making the Example Reusable
_build_parser() exposes the parts you are likely to tune:
p.add_argument("--qdrant-url", ...)
p.add_argument("--collection", ...)
p.add_argument("--num-dialogs", ...)
p.add_argument("--max-turns", ...)
p.add_argument("--max-concurrency", ...)
p.add_argument("--gemini-model", ...)
p.add_argument("--embedding-model", ...)
p.add_argument("--output", ...)Most flags also read from environment variables. That makes the same script usable in local demos, scheduled jobs, and documentation examples without editing source code.
Important knobs:
| Flag | Why it matters |
|---|---|
--collection | Which Qdrant collection to read from. |
--content-key | Payload field that contains document text. It must match indexing. |
--max-docs | How many documents the document provider can sample. |
--num-dialogs | Target number of conversations to generate. |
--max-turns | Maximum turns per conversation. AfterImage samples uniformly from 1..max_turns. |
--max-concurrency | Number of concurrent conversation workers. Keep low for tight API quotas. |
--gemini-max-retries | Retry count for transient Gemini 429/5xx responses. |
--embedding-model | Query embedding model. It must match the indexed vector family and dimension. |
--auto-improve | Enables evaluator-based quality retries. This costs additional LLM and embedding calls. |
Qdrant Connection
The helper _qdrant_kwargs() builds shared client settings:
def _qdrant_kwargs(url: str, api_key: str | None) -> dict:
kwargs: dict = {"url": url, "timeout": 120.0}
if api_key:
kwargs["api_key"] = api_key
return kwargsBoth local Qdrant and Qdrant Cloud use the same shape. Cloud adds an API key; local Docker usually does not.
In _async_main(), the script creates both clients:
qd = QdrantClient(**qd_kw)
qd_async = AsyncQdrantClient(**qd_kw)The async client is closed in finally:
await qd_async.close()That pattern is worth copying for any script that owns network clients.
Monitoring and Alerts
The monitor records metrics such as generation time, success rate, error rate, token usage, and conversation length:
monitor = GenerationMonitor(
log_dir=str(args.log_dir),
alert_handlers=[on_alert],
metrics_interval=60,
)In this example, alerts are printed:
def on_alert(alert) -> None:
print(f"alert - {alert.name} - {alert.message}")For production workflows, this could send alerts to a dashboard, Slack, logs, or your own observability pipeline. For documentation and demos, printing is enough because it shows when generation quality or provider reliability is degrading.
Instruction-Side Context
This block creates the document provider:
documents = QdrantDocumentProvider(
client=qd,
collection_name=args.collection,
content_key=args.content_key,
max_docs=args.max_docs,
)QdrantDocumentProvider gives AfterImage documents to sample from. It is not the same as retrieval. It is used before the conversation starts, so the instruction generator can create grounded user questions.
Then the script creates the instruction callback:
instruction_cb = ContextualInstructionGeneratorCallback(
api_key=api_key,
documents=documents,
model_name=args.gemini_model,
num_random_contexts=1,
llm_create_extras={...},
)This callback asks the LLM to produce user instructions from sampled context. Its output is a GeneratedInstructions object containing:
instructions: one or more user prompts.context: the sampled source text.context_id/context_ids: metadata for coverage tracking.
In your own project, swap QdrantDocumentProvider for another provider if your source material lives somewhere else:
InMemoryDocumentProviderfor small examples.DirectoryDocumentProviderfor local files.JSONLDocumentProviderfor prepared document rows.QdrantDocumentProviderfor vector database-backed corpora.
Query Embeddings
The retriever needs to embed user instructions before searching Qdrant:
embedding_provider = EmbeddingProviderFactory.create(
{
"type": "process",
"model": args.embedding_model,
"workers": args.embedding_workers,
},
)This example uses a local SentenceTransformer process pool. That is why setup requires:
uv sync --extra embeddings-localThe important rule is that query embeddings must be compatible with indexed vectors. Here, index_corpus.py stores 768-dimensional BAAI/bge-base-en-v1.5 vectors, so generate.py defaults to the same model.
If you index with a different model, change both indexing and querying together. Otherwise Qdrant search may fail or return poor matches.
Response-Side Retrieval
The retriever is built separately from the document provider:
retriever = QdrantRetriever(
client=qd,
collection_name=args.collection,
embedding_provider=embedding_provider,
async_client=qd_async,
payload_key=args.content_key,
limit=3,
)This retrieves the top matching excerpts for a generated instruction. limit=3 means the assistant will see up to three retrieved passages.
Then the retriever is wrapped in a respondent prompt modifier:
modifier = WithRAGRespondentPromptModifier(retriever=retriever)The modifier runs before the assistant answers. It takes the base respondent prompt and adds retrieved context, so the assistant response is grounded in search results.
This is the pattern to copy when you want RAG-style synthetic conversations:
DocumentProvider -> helps produce the user question
Retriever -> helps produce the assistant answer
PromptModifier -> injects retrieved answer contextStorage
The generated rows are written to JSONL:
storage = JSONLStorage(conversations_path=str(args.output))JSONL is a good default because it is easy to inspect, stream, and convert later. Each row contains the conversation plus metadata such as sampled context, retrieved context, and context ids.
If your workflow needs a database, AfterImage also supports SQL storage. The rest of the generation composition can stay mostly the same.
The ConversationGenerator
This is where the pieces come together:
conv_gen = ConversationGenerator(
respondent_prompt=RESPONDENT_PROMPT,
correspondent_prompt=CORRESPONDENT_PROMPT,
api_key=api_key,
model_name=args.gemini_model,
monitor=monitor,
auto_improve=args.auto_improve,
storage=storage,
instruction_generator_callback=instruction_cb,
respondent_prompt_modifier=modifier,
embedding_provider=embedding_provider if args.auto_improve else None,
llm_factory_kwargs={...},
)The most important fields are:
| Argument | Meaning |
|---|---|
respondent_prompt | System prompt for the assistant side. |
correspondent_prompt | System prompt for the simulated user side. |
instruction_generator_callback | Produces the first user message from sampled context. |
respondent_prompt_modifier | Adds retrieved context before the assistant answers. |
storage | Persists generated conversations. |
monitor | Tracks metrics, alerts, and plots. |
auto_improve | Runs evaluator-based retries when enabled. |
llm_factory_kwargs | Passes provider-specific options, such as Gemini retry settings. |
This construction is the reusable recipe. The caselaw dataset is just one instance of it.
Running Generation
The actual generation call is short:
await conv_gen.generate(
num_dialogs=args.num_dialogs,
max_turns=args.max_turns,
max_concurrency=args.max_concurrency,
)num_dialogs controls how many conversations to produce.
max_turns is not "always exactly this many turns." AfterImage samples the actual turn count uniformly from 1 through max_turns. With max_turns=1, every conversation is single-turn.
max_concurrency controls how many conversation workers run at once. Higher values can improve throughput but also increase API pressure. For Gemini free or tight quotas, 1 is the safest starting point.
Cleanup
The script closes resources in a finally block:
finally:
await embedding_provider.aclose()
await qd_async.close()
monitor.shutdown()This matters because generation scripts often run for a long time and own process pools, async HTTP sessions, and monitoring threads.
If you add another resource, such as a database client or custom retriever, close it here too.
How To Adapt This Example
Use this checklist when adapting the example to another domain:
- Replace
RESPONDENT_PROMPTwith the assistant behavior you want. - Replace
CORRESPONDENT_PROMPTwith the user role you want to simulate. - Choose a document provider for your source material.
- Make sure your indexed vectors and query embedding model match.
- Choose a retriever and prompt modifier if the assistant should use RAG.
- Pick storage: JSONL for files, SQL for database-backed runs.
- Start with low
num_dialogsandmax_concurrency. - Turn on
auto_improveonly when you are ready to pay for quality retries. - Inspect output JSONL before scaling up.
Common Variations
Use local files instead of Qdrant for instruction context
If you do not need vector-backed sampling, use a file or directory provider for the instruction side. You can still use a retriever for response-side RAG, or skip retrieval entirely and use WithContextRespondentPromptModifier for simpler context injection.
Generate customer support data
Keep the same shape:
support articles -> document provider -> customer questions
support articles -> retriever -> agent answersChange the prompts so the respondent is a support agent and the correspondent is a customer with realistic issues.
Generate domain tutoring dialogs
Use textbook sections, lecture notes, or documentation pages as source material. The correspondent can be a beginner, advanced learner, or examiner. The respondent can be a tutor that explains concepts and asks clarifying questions.
Skip RAG for simpler datasets
If every generated answer can rely on the same sampled context, you can remove QdrantRetriever and WithRAGRespondentPromptModifier, then use a simpler context prompt modifier. The tradeoff is less dynamic answer-time retrieval.
What This Example Teaches
The important lesson is not "caselaw plus Qdrant." The reusable pattern is:
source material
-> instruction generation
-> optional retrieval
-> controlled assistant/user simulation
-> monitored JSONL datasetOnce you understand those pieces, you can assemble your own synthetic dataset pipeline for legal research, customer support, technical documentation, medical education, internal knowledge bases, or any domain where conversations should be grounded in source documents.