PyPI page
Home page
Author:
None
License:
MIT
Summary:
Best open-source document to markdown extractor for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract
Latest version:
1.0.4
Required dependencies:
beautifulsoup4
|
docling-ibm-models
|
easyocr
|
huggingface_hub
|
lxml
|
markdownify
|
numpy
|
openpyxl
|
pandas
|
pdf2image
|
pillow
|
pymupdf
|
pypandoc
|
python-docx
|
python-pptx
|
requests
|
setuptools
|
tokenizers
|
tqdm
|
transformers
|
wheel
Optional dependencies:
black
|
flake8
|
mypy
|
ollama
|
pytest
|
pytest-cov
Downloads last day:
9
Downloads last week:
40
Downloads last month:
69