PyPI page
Home page
Author:
Joseph Jennings, Mostofa Patwary, Sandeep Subramanian, Shrimai Prabhumoye, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ryan Wolf
Summary:
Scalable Data Preprocessing Tool for Training Large Language Models
Latest version:
0.5.0
Required dependencies:
awscli
|
beautifulsoup4
|
charset-normalizer
|
comment-parser
|
crossfit
|
dask
|
dask-mpi
|
distributed
|
fasttext
|
ftfy
|
in-place
|
jieba
|
justext
|
lxml-html-clean
|
mwparserfromhell
|
nemo-toolkit
|
numpy
|
openai
|
peft
|
presidio-analyzer
|
presidio-anonymizer
|
pycld2
|
resiliparse
|
spacy
|
unidic-lite
|
usaddress
|
warcio
|
zstandard
Optional dependencies:
cudf-cu12
|
cugraph-cu12
|
cuml-cu12
|
dask-cuda
|
dask-cudf-cu12
|
nvidia-dali-cuda120
|
nvidia-nvjpeg2k-cu12
|
spacy
|
timm
Downloads last day:
0
Downloads last week:
8
Downloads last month:
21