Are you tired of getting only garbage instead of Markdown from PDFs? Yeah, same.

May 16, 2025 · Yusuf Sarıgöz

Originally published on Medium

Serving files for hungry LLMs

That's why we built llm-food — a FastAPI-based service that converts documents and URLs into clean, LLM-friendly Markdown. It supports batch processing, integrates with Google's Gemini for blazing-fast PDF OCR, and wraps everything in a CLI and Python client that doesn't make you cry.

tl;dr

Converts PDFs, DOCX, RTF, PPTX, and HTML/web pages into Markdown.
Synchronous and async modes via HTTP API, CLI, and Python client.
Batch PDFs go through Gemini Batch Prediction API for high throughput ($1 for ~6,000 pages).
Tracks batch jobs in DuckDB.
Fully dockerized. Optional auth. One-liner CLI.

Why?

Extracting clean text from PDFs is still a mess.

You've probably seen dockling, marker, or pymupdf4llm. They're okay — but they're either slow, resource-hungry, or AGPL-licensed (ouch!). If you just want clean Markdown to fine-tune or RAG with your LLM, it's way more effort than it should be.

Enter Gemini Batch Prediction. It's fast and insanely cheap — $1 per 6,000 pages. But it's not dev-friendly.

So we wrapped it all up in a neat little microservice: llm-food.

What You Get

Features

Convert:
- PDFs (Gemini, pymupdf4llm, or pypdf)
- DOC/DOCX (via mammoth)
- RTF (via striprtf)
- PPTX (python-pptx)
- HTML or URLs (trafilatura)
Run as:
- FastAPI server (sync and async modes)
- CLI (llm-food convert-file my.pdf)
- Python client
Async batch jobs with task tracking (DuckDB-based).
Docker-ready, with optional Bearer token auth.
Extensible and configurable.

Batch Mode: Fast, Cheap, Parallel

PDFs get chunked and sent to Gemini's Batch API. You just upload files and give it a GCS path — we handle the queuing, status tracking, and Markdown export.

Other formats (DOCX, RTF, PPTX) are handled individually as background tasks.

How to Use It

Install

bash

pip install 'llm-food[server]'          # Full server
pip install 'llm-food[server,pymupdf]'  # Optional: if you want pymupdf backend

Or just the client:

bash

uv add llm-food

Start the Server

bash

llm-food-serve

API lives at http://localhost:8000/docs.

Want Docker?

bash

docker build -t llm-food .
docker run -d -p 8000:8000 --env-file .env llm-food

CLI Usage

bash

llm-food convert-file ./paper.pdf
llm-food convert-url https://example.com
llm-food batch-create doc1.pdf doc2.pdf gs://my-bucket/outputs/
llm-food batch-status <task_id>
llm-food batch-results <task_id>

Set server URL/token via env:

bash

export LLMFOOD_SERVERURL=http://my-host:8000
export LLMFOOD_APITOKEN=secret

Auth & Config

Add a .env file to configure:

GCS credentials
Gemini API location/project
File size limits
PDF backend
Auth token

Run locally or on a containerized server.

Roadmap

Support for transcription of media files (videos and audios)
YouTube crawling and batch transcription
Configurable automatic chunking

Final Thoughts

Building a RAG pipeline, a search agent, or just need clean Markdown from messy enterprise docs? llm-food gets you there fast.

But let's be real: raw Markdown is just raw ingredients. We're cooking up an enterprise-grade engine that builds custom synthetic datasets and trains LLMs tailored to your docs. Join the waitlist or drop us a line if that's what you're hungry for.

Are you tired of getting only garbage instead of Markdown from PDFs? Yeah, same. ​

tl;dr ​

Why? ​

What You Get ​

Features ​

Batch Mode: Fast, Cheap, Parallel ​

How to Use It ​

Install ​

Start the Server ​

CLI Usage ​

Auth & Config ​

Roadmap ​

Final Thoughts ​

Are you tired of getting only garbage instead of Markdown from PDFs? Yeah, same.

tl;dr

Why?

What You Get

Features

Batch Mode: Fast, Cheap, Parallel

How to Use It

Install

Start the Server

CLI Usage

Auth & Config

Roadmap

Final Thoughts