Skip to content

Are you tired of getting only garbage instead of Markdown from PDFs? Yeah, same.

May 16, 2025 · Yusuf Sarıgöz

Originally published on Medium


Serving files for hungry LLMs

Are you tired of getting only garbage instead of Markdown from PDFs? Yeah, same.

That's why we built llm-food — a FastAPI-based service that converts documents and URLs into clean, LLM-friendly Markdown. It supports batch processing, integrates with Google's Gemini for blazing-fast PDF OCR, and wraps everything in a CLI and Python client that doesn't make you cry.

tl;dr

  • Converts PDFs, DOCX, RTF, PPTX, and HTML/web pages into Markdown.
  • Synchronous and async modes via HTTP API, CLI, and Python client.
  • Batch PDFs go through Gemini Batch Prediction API for high throughput ($1 for ~6,000 pages).
  • Tracks batch jobs in DuckDB.
  • Fully dockerized. Optional auth. One-liner CLI.

Why?

Extracting clean text from PDFs is still a mess.

You've probably seen dockling, marker, or pymupdf4llm. They're okay — but they're either slow, resource-hungry, or AGPL-licensed (ouch!). If you just want clean Markdown to fine-tune or RAG with your LLM, it's way more effort than it should be.

Enter Gemini Batch Prediction. It's fast and insanely cheap — $1 per 6,000 pages. But it's not dev-friendly.

So we wrapped it all up in a neat little microservice: llm-food.

What You Get

Features

  • Convert:
    • PDFs (Gemini, pymupdf4llm, or pypdf)
    • DOC/DOCX (via mammoth)
    • RTF (via striprtf)
    • PPTX (python-pptx)
    • HTML or URLs (trafilatura)
  • Run as:
    • FastAPI server (sync and async modes)
    • CLI (llm-food convert-file my.pdf)
    • Python client
  • Async batch jobs with task tracking (DuckDB-based).
  • Docker-ready, with optional Bearer token auth.
  • Extensible and configurable.

Batch Mode: Fast, Cheap, Parallel

PDFs get chunked and sent to Gemini's Batch API. You just upload files and give it a GCS path — we handle the queuing, status tracking, and Markdown export.

Other formats (DOCX, RTF, PPTX) are handled individually as background tasks.

How to Use It

Install

bash
pip install 'llm-food[server]'          # Full server
pip install 'llm-food[server,pymupdf]'  # Optional: if you want pymupdf backend

Or just the client:

bash
uv add llm-food

Start the Server

bash
llm-food-serve

API lives at http://localhost:8000/docs.

Want Docker?

bash
docker build -t llm-food .
docker run -d -p 8000:8000 --env-file .env llm-food

CLI Usage

bash
llm-food convert-file ./paper.pdf
llm-food convert-url https://example.com
llm-food batch-create doc1.pdf doc2.pdf gs://my-bucket/outputs/
llm-food batch-status <task_id>
llm-food batch-results <task_id>

Set server URL/token via env:

bash
export LLMFOOD_SERVERURL=http://my-host:8000
export LLMFOOD_APITOKEN=secret

Auth & Config

Add a .env file to configure:

  • GCS credentials
  • Gemini API location/project
  • File size limits
  • PDF backend
  • Auth token

Run locally or on a containerized server.

Roadmap

  • Support for transcription of media files (videos and audios)
  • YouTube crawling and batch transcription
  • Configurable automatic chunking

Final Thoughts

Building a RAG pipeline, a search agent, or just need clean Markdown from messy enterprise docs? llm-food gets you there fast.

But let's be real: raw Markdown is just raw ingredients. We're cooking up an enterprise-grade engine that builds custom synthetic datasets and trains LLMs tailored to your docs. Join the waitlist or drop us a line if that's what you're hungry for.

Your model. Not theirs.