PyPI page
Home page
Author:
None
License:
Apache-2.0
Summary:
Toolkit for pre-processing LLM training data.
Latest version:
1.2.1
Required dependencies:
anyascii
|
blingfire
|
boto3
|
cchardet
|
charset-normalizer
|
fasttext-wheel
|
faust-cchardet
|
fsspec
|
jq
|
jsonpath-ng
|
msgspec
|
necessary
|
nltk
|
numpy
|
omegaconf
|
platformdirs
|
python-dotenv
|
pyyaml
|
requests
|
rich
|
s3fs
|
smart-open
|
tokenizers
|
tqdm
|
uniseg
|
zstandard
Optional dependencies:
beautifulsoup4
|
black
|
brotli
|
detect-secrets
|
dolma
|
fasttext-wheel
|
fastwarc
|
flake8
|
flake8-pyi
|
flake8-pyproject
|
htmldate
|
ipdb
|
ipython
|
isort
|
lingua-language-detector
|
mypy
|
py3langid
|
pycld2
|
pygments
|
pytest
|
regex
|
resiliparse
|
trafilatura
|
types-dateparser
|
types-pyyaml
|
url-normalize
|
w3lib
Downloads last day:
118
Downloads last week:
2,073
Downloads last month:
7,654