Releases: thammegowda/mtdata
Releases · thammegowda/mtdata
v5.0.0 - WMT26
What's Changed
- setuptools on python 3.12+, set newest version by @thammegowda in #177
- chore: Add setuptools to dependencies for 3.12+ by @effigies in #176
- WIP v0.5.0 - wmt26 recipes by @thammegowda in #175
New Contributors
Full Changelog: v0.4.3...v5.0.0
v0.4.3: WMT25
What's Changed
- Update typo in README.md by @qpleple in #163
- Added WMT usecases to README by @qpleple in #164
- Huggingface datasets; optional dependency; setup.py -> pyproject.toml by @thammegowda in #165
- Upgrade GH actions to use new python versions by @thammegowda in #167
- "mtdata score" -- score QE metrics; pymarian integration by @thammegowda in #168
WMT25 by @thammegowda in #166 - remove python 3.8; add python 3.13 by @thammegowda in #171
New Contributors
Full Changelog: v0.4.1...v0.4.3
v0.4.1
- Better parallelization: parallel and mono data are scheduled at once (previously it was one after the other)
mtdata cacheadded. Improves concurrency by supporting multiple recipes- Added WMT general test 2022 and 2023
- Added news commentary 18.1. news crawl 2023
- mtdata-bcp47 : -p/--pipe to map codes from stdin -> stdout
- mtdata-bcp47 : --script {suppress-default,suppress-all,express}
- Uses
pigzto read and write gzip files by default when pigz is in PATH. exportUSE_PIGZ=0to disable
v0.4.0 - monolingual data, mtdata echo, add new datasets at runtime
- Fix: allenai_nllb.json is now included in MANIFEST.in #137. Also fixed CI: Travis -> github actions
- Update ELRC datasets #138. Thanks @AlexUmnov
- Add Jparacrawl Chinese-Japanese subset #143. Thanks @BrightXiaoHan
- Add Flores200 dev and devtests #145. Thanks @ZenBel
- Add support for
mtdata echo <ID> - dataset entries only store bibtext keys and not full citation text
- creates index cache as JSONLine file. (WIP towards dataset statistics)
- Simplified index loading
- simplified compression format handlers. Added support for opening .bz2 files without creating temp files.
- all resources are moved to
mtdata/resourcedir and any new additions to that dir are automatically included in python package (Fail proof for future issues like #137 )
New and exciting features:
- Support for adding new datasets at runtime (
mtdata*.pyfrom run dir). Note: you have to reindex by callingmtdata -ri list - Monolingual datasets support in progress (currently testing)
- Dataset IDs are now
Group-name-version-lang1-lang2for bitext andGroup-name-version-langfor monolingual mtdata listis updated.mtdata list -l eng-deufor bitext andmtdata list -l engfor monolingual- Added: Statmt Newscrawl, news-discussions, Leipzig corpus, ...
- Dataset IDs are now
skipped 0.3.9 because the changes are too significant and wanted to bump from 0.3x -> 0.4x
0.3.8 - log level, progress bar, refresh OPUS and ELRC; stats
- CLI arg
--log-levelwith default set toWARNING - progressbar can be disabled from CLI
--no-pbar; default is enabled--pbar mtdata stats --quickdoes HTTP HEAD and shows content length; e.g.mtdata stats --quick Statmt-commoncrawl-wmt19-fra-deupython -m mtdata.scripts.recipe_statsto read stats from output directory- Security fix with tar extract | Thanks @TrellixVulnTeam
- Added NLLB datasets prepared by AllenAI | Thanks @AlexUmnov
- Opus and ELRC datasets update | Thanks @ZenBel
- default for
fail_on_erroris set to true; returns non zero exit code on error. set--no-failflag to ignore errors whilemtdata getcommand
0.3.7
v0.3.6 : fixes and additions for wmt22
- Fixed KECL-JParaCrawl
- added Paracrawl bonus for ukr-eng
- added Yandex rus-eng corpus
- added Yakut sah-eng
- update recipe for wmt22 constrained eval
disable JW300; add WMT22 recipes; auto generate references.bib
- Parallel download support
-j/--n-jobsargument (with default4) - Automatically create references.bib file based on datasets selected
- Add histogram to web search interface (Thanks, @sgowdaks)
- ELRC index updates; (Thanks @kpu)
- Update OPUS index. Use OPUS API to download all datasets
- A lot of new datasets added.
- WARNING: Some OPUS IDs are not backward compatible (version number mismatch)
- Fix: JESC dataset language IDs were wrong
- New datasets:
- jpn-eng: add paracrawl v3, and wmt19 TED
- backtranslation datasets for en2ru ru2en en2ru
- Option to set
MTDATA_RECIPESdir (default is $PWD). All files matching the glob${MTDATA_RECIPES}/mtdata.recipes*.ymlare loaded - WMT22 recipes added
- JW300 is disabled #77
v0.3.3
- bug fix: xml reading inside tar: Element tree's complain about TarPath
mtdata listhas-g/--groupsand-ng/--not-groupsas include exclude filters on group name | closes #91mtdata listhas-id/--idflag to print only dataset IDs | closes #91- add WMT21 tests | closes #90
- add ccaligned datasets wmt21 | closes #89
- add ParIce datasets | closes #88
- add wmt21 en-ha | closes #87
- add wmt21 wikititles v3 | closes #86
- Add train and test sets from StanfordNLP NMT page (large: en-cs, medium: en-de, small: en-vi) | closes #84
- Add support for two URLs for a single dataset (i.e. without zip/tar files)
- Fixed a language match bug #92 / #93
- Fix: language compatibility checks; Closes #94
v0.3.2 - 20211205
- Fix: recipes.yml is missing in the pip installed package
- Add Project Anuvaad: 196 datasets belonging to Indian languages
- add CLI
mtdata gethas--fail / --no-failarguments to tell whether to crash or no-crash upon errors