Allow tokenization for gzipped json files in datapreprocess by taidnguyen · Pull Request #285 · mlfoundations/open_lm

taidnguyen · 2024-06-03T14:48:02Z

Some datasets, such as Dolma, comes in *.json.gz format. Add an option to smart_open and tokenize these files in make_2048.py - Feel free to close if this is available elsewhere already. Thanks!

add smartopen for gzipped jsonl files

e1a528c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow tokenization for gzipped json files in datapreprocess#285

Allow tokenization for gzipped json files in datapreprocess#285
taidnguyen wants to merge 1 commit intomlfoundations:mainfrom
taidnguyen:main

taidnguyen commented Jun 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taidnguyen commented Jun 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant