A Model Context Protocol (MCP) server that extracts text from PDF files, with optional OCR support for scanned documents.
- Text extraction from PDF files using
pypdf - Word Document support - read
.docxfiles with formatting (headers, tables) - OCR support via Tesseract for scanned/image-based PDFs and DOCX images
- URL support - read documents directly from HTTP/HTTPS URLs
- Page selection - extract specific pages or ranges (PDF only)
- Password protection - decrypt protected PDFs
For OCR support (highly recommended), install Tesseract OCR:
- Windows: Download from UB-Mannheim or run
choco install tesseract - Default path:
C:\Program Files\Tesseract-OCR\tesseract.exe
No manual setup required. The included start_server.bat script will automatically create a virtual environment and install all dependencies the first time you run it.
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
pip install -r requirements.txtMethod A: Using Batch Script (Recommended for Windows)
gemini mcp add pdf-parser "C:\Path\To\LLM-PDF-Parser\start_server.bat"Method B: Using Python Directly
gemini mcp add pdf-parser "python" --args "C:\Path\To\LLM-PDF-Parser\server.py"Add the following to your MCP configuration file (e.g., claude_desktop_config.json or VS Code settings).
Method A: Using Batch Script (Recommended for Windows) Ensures dependencies are managed automatically.
{
"mcpServers": {
"pdf-parser": {
"command": "C:\\Path\\To\\LLM-PDF-Parser\\start_server.bat",
"args": [],
"env": {
"PYTHONIOENCODING": "utf-8"
}
}
}
}Method B: Using Python Directly Requires you to handle the virtual environment and dependencies manually.
{
"mcpServers": {
"pdf-parser": {
"command": "python",
"args": ["C:\\Path\\To\\LLM-PDF-Parser\\server.py"],
"env": {
"PYTHONIOENCODING": "utf-8"
}
}
}
}| Parameter | Type | Required | Description |
|---|---|---|---|
file_path |
string | Yes | Absolute path to PDF or URL (http/https) |
pages |
string | No | Page selection: "0,1,5" or "0-5" (0-indexed) |
password |
string | No | Password for encrypted PDFs |
ocr |
boolean | No | Enable Tesseract OCR for images (default: true) |
| Parameter | Type | Required | Description |
|---|---|---|---|
file_path |
string | Yes | Absolute path to DOCX or URL (http/https) |
ocr |
boolean | No | Enable Tesseract OCR for embedded images (default: true) |
# Read entire PDF
read_pdf("C:/docs/report.pdf")
# Read specific pages from PDF
read_pdf("C:/docs/report.pdf", pages="0,2-5")
# Read DOCX file (preserves headers and tables)
read_docx("C:/docs/notes.docx")
# Read from URL
read_pdf("https://example.com/document.pdf")
read_docx("https://example.com/meeting.docx")
# Disable OCR (enabled by default)
read_pdf("C:/docs/scanned.pdf", ocr=False)
# Decrypt protected PDF
read_pdf("C:/docs/protected.pdf", password="secret")- Python 3.10+
pypdf- PDF text extractionpython-docx- Word document extractionpytesseract- OCR wrapper (optional)Pillow- Image processing (optional)requests- URL downloadsmcp- MCP server framework
- Uses pure Python
pypdfinstead ofpymupdffor ARM64 compatibility - OCR requires Tesseract to be installed on the system
- Page numbers are 0-indexed in the
pagesparameter