PDF Parser MCP Server

A Model Context Protocol (MCP) server that extracts text from PDF files, with optional OCR support for scanned documents.

Features

Text extraction from PDF files using pypdf
Word Document support - read .docx files with formatting (headers, tables)
OCR support via Tesseract for scanned/image-based PDFs and DOCX images
URL support - read documents directly from HTTP/HTTPS URLs
Page selection - extract specific pages or ranges (PDF only)
Password protection - decrypt protected PDFs

Prerequisites

For OCR support (highly recommended), install Tesseract OCR:

Windows: Download from UB-Mannheim or run choco install tesseract
Default path: C:\Program Files\Tesseract-OCR\tesseract.exe

Installation & Setup

Option 1: Automatic (Windows)

No manual setup required. The included start_server.bat script will automatically create a virtual environment and install all dependencies the first time you run it.

Option 2: Manual (Mac/Linux/Custom)

python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows
pip install -r requirements.txt

Configuration

1. Gemini CLI

Method A: Using Batch Script (Recommended for Windows)

gemini mcp add pdf-parser "C:\Path\To\LLM-PDF-Parser\start_server.bat"

Method B: Using Python Directly

gemini mcp add pdf-parser "python" --args "C:\Path\To\LLM-PDF-Parser\server.py"

2. Claude Desktop & Other MCP Clients

Add the following to your MCP configuration file (e.g., claude_desktop_config.json or VS Code settings).

Method A: Using Batch Script (Recommended for Windows) Ensures dependencies are managed automatically.

{
  "mcpServers": {
    "pdf-parser": {
      "command": "C:\\Path\\To\\LLM-PDF-Parser\\start_server.bat",
      "args": [],
      "env": {
        "PYTHONIOENCODING": "utf-8"
      }
    }
  }
}

Method B: Using Python Directly Requires you to handle the virtual environment and dependencies manually.

{
  "mcpServers": {
    "pdf-parser": {
      "command": "python",
      "args": ["C:\\Path\\To\\LLM-PDF-Parser\\server.py"],
      "env": {
        "PYTHONIOENCODING": "utf-8"
      }
    }
  }
}

Tool: `read_pdf`

Parameter	Type	Required	Description
`file_path`	string	Yes	Absolute path to PDF or URL (http/https)
`pages`	string	No	Page selection: `"0,1,5"` or `"0-5"` (0-indexed)
`password`	string	No	Password for encrypted PDFs
`ocr`	boolean	No	Enable Tesseract OCR for images (default: true)

Tool: `read_docx`

Parameter	Type	Required	Description
`file_path`	string	Yes	Absolute path to DOCX or URL (http/https)
`ocr`	boolean	No	Enable Tesseract OCR for embedded images (default: true)

Examples

# Read entire PDF
read_pdf("C:/docs/report.pdf")

# Read specific pages from PDF
read_pdf("C:/docs/report.pdf", pages="0,2-5")

# Read DOCX file (preserves headers and tables)
read_docx("C:/docs/notes.docx")

# Read from URL
read_pdf("https://example.com/document.pdf")
read_docx("https://example.com/meeting.docx")

# Disable OCR (enabled by default)
read_pdf("C:/docs/scanned.pdf", ocr=False)

# Decrypt protected PDF
read_pdf("C:/docs/protected.pdf", password="secret")

Requirements

Python 3.10+
pypdf - PDF text extraction
python-docx - Word document extraction
pytesseract - OCR wrapper (optional)
Pillow - Image processing (optional)
requests - URL downloads
mcp - MCP server framework

Notes

Uses pure Python pypdf instead of pymupdf for ARM64 compatibility
OCR requires Tesseract to be installed on the system
Page numbers are 0-indexed in the pages parameter

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mcp_config_template.json		mcp_config_template.json
requirements.txt		requirements.txt
server.py		server.py
start_server.bat		start_server.bat
test.pdf		test.pdf
test.py		test.py
test_parser.py		test_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Parser MCP Server

Features

Prerequisites

Installation & Setup

Option 1: Automatic (Windows)

Option 2: Manual (Mac/Linux/Custom)

Configuration

1. Gemini CLI

2. Claude Desktop & Other MCP Clients

Tool: `read_pdf`

Tool: `read_docx`

Examples

Requirements

Notes

About

Uh oh!

Releases

Packages

Languages

License

Beledarian/LLM-PDF-Parser

Folders and files

Latest commit

History

Repository files navigation

PDF Parser MCP Server

Features

Prerequisites

Installation & Setup

Option 1: Automatic (Windows)

Option 2: Manual (Mac/Linux/Custom)

Configuration

1. Gemini CLI

2. Claude Desktop & Other MCP Clients

Tool: read_pdf

Tool: read_docx

Examples

Requirements

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Tool: `read_pdf`

Tool: `read_docx`

Packages