Skip to content

Releases: thammegowda/mtdata

v5.0.0 - WMT26

13 Apr 18:15
b33427c

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.4.3...v5.0.0

v0.4.3: WMT25

01 Apr 02:12
fe5e2a2

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.4.1...v0.4.3

v0.4.1

26 Apr 05:08

Choose a tag to compare

  • Better parallelization: parallel and mono data are scheduled at once (previously it was one after the other)
  • mtdata cache added. Improves concurrency by supporting multiple recipes
  • Added WMT general test 2022 and 2023
  • Added news commentary 18.1. news crawl 2023
  • mtdata-bcp47 : -p/--pipe to map codes from stdin -> stdout
  • mtdata-bcp47 : --script {suppress-default,suppress-all,express}
  • Uses pigz to read and write gzip files by default when pigz is in PATH. export USE_PIGZ=0 to disable

v0.4.0 - monolingual data, mtdata echo, add new datasets at runtime

27 Mar 04:09

Choose a tag to compare

  • Fix: allenai_nllb.json is now included in MANIFEST.in #137. Also fixed CI: Travis -> github actions
  • Update ELRC datasets #138. Thanks @AlexUmnov
  • Add Jparacrawl Chinese-Japanese subset #143. Thanks @BrightXiaoHan
  • Add Flores200 dev and devtests #145. Thanks @ZenBel
  • Add support for mtdata echo <ID>
  • dataset entries only store bibtext keys and not full citation text
    • creates index cache as JSONLine file. (WIP towards dataset statistics)
  • Simplified index loading
  • simplified compression format handlers. Added support for opening .bz2 files without creating temp files.
  • all resources are moved to mtdata/resource dir and any new additions to that dir are automatically included in python package (Fail proof for future issues like #137 )

New and exciting features:

  • Support for adding new datasets at runtime (mtdata*.py from run dir). Note: you have to reindex by calling mtdata -ri list
  • Monolingual datasets support in progress (currently testing)
    • Dataset IDs are now Group-name-version-lang1-lang2 for bitext and Group-name-version-lang for monolingual
    • mtdata list is updated. mtdata list -l eng-deu for bitext and mtdata list -l eng for monolingual
    • Added: Statmt Newscrawl, news-discussions, Leipzig corpus, ...

skipped 0.3.9 because the changes are too significant and wanted to bump from 0.3x -> 0.4x

0.3.8 - log level, progress bar, refresh OPUS and ELRC; stats

25 Nov 03:31

Choose a tag to compare

  • CLI arg --log-level with default set to WARNING
  • progressbar can be disabled from CLI --no-pbar; default is enabled --pbar
  • mtdata stats --quick does HTTP HEAD and shows content length; e.g. mtdata stats --quick Statmt-commoncrawl-wmt19-fra-deu
  • python -m mtdata.scripts.recipe_stats to read stats from output directory
  • Security fix with tar extract | Thanks @TrellixVulnTeam
  • Added NLLB datasets prepared by AllenAI | Thanks @AlexUmnov
  • Opus and ELRC datasets update | Thanks @ZenBel
  • default for fail_on_error is set to true; returns non zero exit code on error. set --no-fail flag to ignore errors while mtdata get command

0.3.7

11 Jul 20:43

Choose a tag to compare

Update ELRC data including EU acts which is used for wmt22 (thanks @kpu)

v0.3.6 : fixes and additions for wmt22

08 Jul 22:37

Choose a tag to compare

  • Fixed KECL-JParaCrawl
  • added Paracrawl bonus for ukr-eng
  • added Yandex rus-eng corpus
  • added Yakut sah-eng
  • update recipe for wmt22 constrained eval

disable JW300; add WMT22 recipes; auto generate references.bib

11 Mar 03:20

Choose a tag to compare

  • Parallel download support -j/--n-jobs argument (with default 4)
  • Automatically create references.bib file based on datasets selected
  • Add histogram to web search interface (Thanks, @sgowdaks)
  • ELRC index updates; (Thanks @kpu)
  • Update OPUS index. Use OPUS API to download all datasets
    • A lot of new datasets added.
    • WARNING: Some OPUS IDs are not backward compatible (version number mismatch)
  • Fix: JESC dataset language IDs were wrong
  • New datasets:
    • jpn-eng: add paracrawl v3, and wmt19 TED
    • backtranslation datasets for en2ru ru2en en2ru
  • Option to set MTDATA_RECIPES dir (default is $PWD). All files matching the glob ${MTDATA_RECIPES}/mtdata.recipes*.yml are loaded
  • WMT22 recipes added
  • JW300 is disabled #77

v0.3.3

28 Jan 06:58

Choose a tag to compare

  • bug fix: xml reading inside tar: Element tree's complain about TarPath
  • mtdata list has -g/--groups and -ng/--not-groups as include exclude filters on group name | closes #91
  • mtdata list has -id/--id flag to print only dataset IDs | closes #91
  • add WMT21 tests | closes #90
  • add ccaligned datasets wmt21 | closes #89
  • add ParIce datasets | closes #88
  • add wmt21 en-ha | closes #87
  • add wmt21 wikititles v3 | closes #86
  • Add train and test sets from StanfordNLP NMT page (large: en-cs, medium: en-de, small: en-vi) | closes #84
    • Add support for two URLs for a single dataset (i.e. without zip/tar files)
  • Fixed a language match bug #92 / #93
  • Fix: language compatibility checks; Closes #94

v0.3.2 - 20211205

06 Dec 17:41
ca6615a

Choose a tag to compare

  • Fix: recipes.yml is missing in the pip installed package
  • Add Project Anuvaad: 196 datasets belonging to Indian languages
  • add CLI mtdata get has --fail / --no-fail arguments to tell whether to crash or no-crash upon errors