GitHub - sssxks/pdf2md_cli

2md

Convert one or more documents/images to Markdown using Mistral OCR or GLM OCR.

Install (uv)

uv tool install https://github.com/sssxks/pdf2md_cli.git --reinstall

Use as a library

from pdf2md_cli import ConvertOptions, convert_file, convert_files

# single file
res = convert_file("docs/sample.pdf", api_key="...", outdir="out")
print(res.markdown_path)

# batch (supports globs)
batch = convert_files(["docs/*.pdf", "docs/*.png"], api_key="...", outdir="out", workers=4)
print(len(batch.succeeded), len(batch.failed))

Usage

Set your API key:

setx MISTRAL_API_KEY "YOUR_KEY"
setx BIGMODEL_API_KEY "YOUR_KEY"

Run:

2md path\\to\\file.pdf
2md path\\to\\image.png
2md path\\to\\file.docx
2md path\\to\\slides.pptx
2md docs\\*.pdf --workers 4
2md docs\\*.png --workers 4
2md path\\to\\file.pdf -o out
2md path\\to\\file.pdf --model mistral-ocr-latest
2md path\\to\\file.pdf --backend glm --model glm-ocr
2md path\\to\\file.pdf --keep-remote-file
2md path\\to\\file.pdf                      # default: --table-format html (extract + inline; no tbl-*.html files)
2md path\\to\\file.pdf --table-format markdown
2md path\\to\\file.pdf --extract-table --table-format html
2md path\\to\\file.pdf --extract-table --table-format markdown
2md path\\to\\file.pdf --no-front-matter --no-page-markers
2md docs\\*.pdf --workers 4 --retries 8 --backoff-max-ms 60000

Help:

2md -h
2md help tables
2md help advanced

Table handling behavior:

Flags	Mistral `table_format`	Markdown output	Sidecar files
(default) `--table-format html`	`html`	HTML tables are inlined	none
`--table-format markdown`	(not sent)	Tables stay inline as markdown	none
`--extract-table --table-format html`	`html`	Links to `tbl-*.html`	writes `tbl-*.html`
`--extract-table --table-format markdown`	`markdown`	Links to `tbl-*.md`	writes `tbl-*.md`

Header/footer handling (advanced):

--header {inline,discard,extract,comment} and --footer {inline,discard,extract,comment}
- comment (default): extract and add them back as HTML comments in the markdown
- inline: keep headers/footers in the main markdown
- discard: extract headers/footers but drop them
- extract: extract and write <stem>_headers_footers.md

Supported document formats:

PDF (.pdf)
Word (.docx)
PowerPoint (.pptx)
Text (.txt)
EPUB (.epub)
XML / DocBook / JATS XML (.xml)
RTF (.rtf)
OpenDocument Text (.odt)
BibTeX / BibLaTeX (.bib)
FictionBook (.fb2)
Jupyter Notebooks (.ipynb)
LaTeX (.tex)
OPML (.opml)
Troff (.1, .man)

Supported image formats:

JPEG (.jpg, .jpeg)
PNG (.png)
AVIF (.avif)
TIFF (.tif, .tiff)
GIF (.gif)
HEIC/HEIF (.heic, .heif)
BMP (.bmp)
WebP (.webp)

Backend notes:

--backend mistral (default): set MISTRAL_API_KEY.
--backend glm: set BIGMODEL_API_KEY (or ZHIPUAI_API_KEY).
--backend mock: hidden test backend, only available when PDF2MD_ENABLE_MOCK=1.

Retries and backoff:

By default the CLI retries transient API failures (e.g. HTTP 429/5xx) with exponential backoff + jitter.
Tweak with --retries, --backoff-initial-ms, --backoff-max-ms, --backoff-multiplier, --backoff-jitter. (For contributors: set PDF2MD_ENABLE_MOCK=1 to enable the hidden mock backend for UX testing.)

Tips: Remote file cleanup

For document inputs (e.g. .pdf, .docx, .pptx), this CLI uploads the file to Mistral's Files API with purpose="ocr" and deletes the uploaded file after the OCR call completes (best-effort).

For image inputs, the CLI sends a data: URL to the OCR API (no file upload), so there is no remote file to delete.

If CLI did best-effort cleanup and it still fails. You can do:

from mistralai import Mistral
import os

with Mistral(api_key=os.getenv("MISTRAL_API_KEY", "")) as mistral:
    res = mistral.files.delete(file_id="3b6d45eb-e30b-416f-8019-f47e2e93d930")
    print(res)

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
pdf2md_cli		pdf2md_cli
scripts		scripts
tests		tests
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2md

Install (uv)

Use as a library

Usage

Tips: Remote file cleanup

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

2md

Install (uv)

Use as a library

Usage

Tips: Remote file cleanup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages