PyPI page
Home page
Author:
None
Summary:
Production-grade data collection and processing pipeline for training LLMs and multimodal AI
Latest version:
0.1.11
Required dependencies:
click
|
datasets
|
datasketch
|
ftfy
|
huggingface-hub
|
langdetect
|
numpy
|
pillow
|
pyyaml
|
requests
|
rich
|
safetensors
|
scipy
|
sentencepiece
|
soundfile
|
tqdm
|
xxhash
Optional dependencies:
astropy
|
azure-storage-blob
|
black
|
boto3
|
decord
|
extract-msg
|
faiss-cpu
|
google-cloud-storage
|
h5py
|
librosa
|
mlflow
|
mypy
|
opencv-python
|
openpyxl
|
pdfplumber
|
pre-commit
|
psutil
|
pytest
|
pytest-asyncio
|
pytest-cov
|
python-docx
|
python-pptx
|
rarfile
|
ray
|
ruff
|
sentence-transformers
|
striprtf
|
timm
|
transformers
|
tree-sitter
|
tree-sitter-languages
|
wandb
|
warcio
|
zarr
Downloads last day:
28
Downloads last week:
81
Downloads last month:
112