Releases · thammegowda/mtdata

13 Apr 18:15

thammegowda

v5.0.0

b33427c

v5.0.0 - WMT26

What's Changed

setuptools on python 3.12+, set newest version by @thammegowda in #177
chore: Add setuptools to dependencies for 3.12+ by @effigies in #176
WIP v0.5.0 - wmt26 recipes by @thammegowda in #175

New Contributors

@effigies made their first contribution in #176

Full Changelog: v0.4.3...v5.0.0

Contributors

effigies and thammegowda

Assets 2

01 Apr 02:12

thammegowda

v0.4.3

fe5e2a2

v0.4.3: WMT25 Latest

Latest

What's Changed

Update typo in README.md by @qpleple in #163
Added WMT usecases to README by @qpleple in #164
Huggingface datasets; optional dependency; setup.py -> pyproject.toml by @thammegowda in #165
Upgrade GH actions to use new python versions by @thammegowda in #167
"mtdata score" -- score QE metrics; pymarian integration by @thammegowda in #168
WMT25 by @thammegowda in #166
remove python 3.8; add python 3.13 by @thammegowda in #171

New Contributors

@qpleple made their first contribution in #163

Full Changelog: v0.4.1...v0.4.3

Contributors

qpleple and thammegowda

Assets 2

26 Apr 05:08

thammegowda

v0.4.1

9579e11

v0.4.1

Better parallelization: parallel and mono data are scheduled at once (previously it was one after the other)
mtdata cache added. Improves concurrency by supporting multiple recipes
Added WMT general test 2022 and 2023
Added news commentary 18.1. news crawl 2023
mtdata-bcp47 : -p/--pipe to map codes from stdin -> stdout
mtdata-bcp47 : --script {suppress-default,suppress-all,express}
Uses pigz to read and write gzip files by default when pigz is in PATH. export USE_PIGZ=0 to disable

Assets 2

27 Mar 04:09

thammegowda

v0.4.0

a50b916

v0.4.0 - monolingual data, mtdata echo, add new datasets at runtime

Fix: allenai_nllb.json is now included in MANIFEST.in #137. Also fixed CI: Travis -> github actions
Update ELRC datasets #138. Thanks @AlexUmnov
Add Jparacrawl Chinese-Japanese subset #143. Thanks @BrightXiaoHan
Add Flores200 dev and devtests #145. Thanks @ZenBel
Add support for mtdata echo <ID>
dataset entries only store bibtext keys and not full citation text
- creates index cache as JSONLine file. (WIP towards dataset statistics)
Simplified index loading
simplified compression format handlers. Added support for opening .bz2 files without creating temp files.
all resources are moved to mtdata/resource dir and any new additions to that dir are automatically included in python package (Fail proof for future issues like #137 )

New and exciting features:

Support for adding new datasets at runtime (mtdata*.py from run dir). Note: you have to reindex by calling mtdata -ri list
Monolingual datasets support in progress (currently testing)
- Dataset IDs are now Group-name-version-lang1-lang2 for bitext and Group-name-version-lang for monolingual
- mtdata list is updated. mtdata list -l eng-deu for bitext and mtdata list -l eng for monolingual
- Added: Statmt Newscrawl, news-discussions, Leipzig corpus, ...

skipped 0.3.9 because the changes are too significant and wanted to bump from 0.3x -> 0.4x

Assets 2

25 Nov 03:31

thammegowda

v0.3.8

d04438d

0.3.8 - log level, progress bar, refresh OPUS and ELRC; stats

CLI arg --log-level with default set to WARNING
progressbar can be disabled from CLI --no-pbar; default is enabled --pbar
mtdata stats --quick does HTTP HEAD and shows content length; e.g. mtdata stats --quick Statmt-commoncrawl-wmt19-fra-deu
python -m mtdata.scripts.recipe_stats to read stats from output directory
Security fix with tar extract | Thanks @TrellixVulnTeam
Added NLLB datasets prepared by AllenAI | Thanks @AlexUmnov
Opus and ELRC datasets update | Thanks @ZenBel
default for fail_on_error is set to true; returns non zero exit code on error. set --no-fail flag to ignore errors while mtdata get command

Contributors

ZenBel, AlexUmnov, and TrellixVulnTeam

Assets 2

11 Jul 20:43

thammegowda

v0.3.7

b1c0b21

0.3.7

Update ELRC data including EU acts which is used for wmt22 (thanks @kpu)

Contributors

kpu

Assets 2

08 Jul 22:37

thammegowda

v0.3.6

f26eda9

v0.3.6 : fixes and additions for wmt22

Fixed KECL-JParaCrawl
added Paracrawl bonus for ukr-eng
added Yandex rus-eng corpus
added Yakut sah-eng
update recipe for wmt22 constrained eval

Assets 2

11 Mar 03:20

thammegowda

v0.3.5

5a9c034

disable JW300; add WMT22 recipes; auto generate references.bib

Parallel download support -j/--n-jobs argument (with default 4)
Automatically create references.bib file based on datasets selected
Add histogram to web search interface (Thanks, @sgowdaks)
ELRC index updates; (Thanks @kpu)
Update OPUS index. Use OPUS API to download all datasets
- A lot of new datasets added.
- WARNING: Some OPUS IDs are not backward compatible (version number mismatch)
Fix: JESC dataset language IDs were wrong
New datasets:
- jpn-eng: add paracrawl v3, and wmt19 TED
- backtranslation datasets for en2ru ru2en en2ru
Option to set MTDATA_RECIPES dir (default is $PWD). All files matching the glob ${MTDATA_RECIPES}/mtdata.recipes*.yml are loaded
WMT22 recipes added
JW300 is disabled #77

Contributors

kpu and sgowdaks

Assets 2

28 Jan 06:58

thammegowda

v0.3.3

9990d94

v0.3.3

bug fix: xml reading inside tar: Element tree's complain about TarPath
mtdata list has -g/--groups and -ng/--not-groups as include exclude filters on group name | closes #91
mtdata list has -id/--id flag to print only dataset IDs | closes #91
add WMT21 tests | closes #90
add ccaligned datasets wmt21 | closes #89
add ParIce datasets | closes #88
add wmt21 en-ha | closes #87
add wmt21 wikititles v3 | closes #86
Add train and test sets from StanfordNLP NMT page (large: en-cs, medium: en-de, small: en-vi) | closes #84
- Add support for two URLs for a single dataset (i.e. without zip/tar files)
Fixed a language match bug #92 / #93
Fix: language compatibility checks; Closes #94

Assets 2

06 Dec 17:41

thammegowda

v0.3.2

ca6615a

v0.3.2 - 20211205

Fix: recipes.yml is missing in the pip installed package
Add Project Anuvaad: 196 datasets belonging to Indian languages
add CLI mtdata get has --fail / --no-fail arguments to tell whether to crash or no-crash upon errors

Assets 2

Releases: thammegowda/mtdata

v5.0.0 - WMT26

What's Changed

New Contributors

Contributors

Uh oh!

v0.4.3: WMT25

What's Changed

New Contributors

Contributors

Uh oh!

v0.4.1

Uh oh!

v0.4.0 - monolingual data, mtdata echo, add new datasets at runtime

Uh oh!

0.3.8 - log level, progress bar, refresh OPUS and ELRC; stats

Contributors

Uh oh!

0.3.7

Contributors

Uh oh!

v0.3.6 : fixes and additions for wmt22

Uh oh!

disable JW300; add WMT22 recipes; auto generate references.bib

Contributors

Uh oh!

v0.3.3

Uh oh!

v0.3.2 - 20211205

Uh oh!