PyMUSAS-Models

This repository contains all of the released PyMUSAS models, and the following documentation:

Model introduction, this is useful context to better understand the model naming conventions.
Model naming conventions
Overview of the models, useful for comparing the models.
Issues / bug reports / improving the models
Development

All of the models are released through GitHub releases as .whl and .tar.gz files. Each model release contains detailed information about the model, more information than specified through the model naming convention, for example the lexicons used, URL to the lexicon, etc.

Model introduction

The models are a mix of rule based taggers and neural taggers, and all output USAS semantic categories on the token level. The rules rely on lexicon and lexical information to classify a token into semantic categories. The lexicon information comes from a lexicon, of which the lexicons used in these models all come from the Multilingual USAS GitHub repository. The lexical information used is the text, lemma, and Part Of Speech (POS) of the token, this information is then used to find the correct lexicon entry in the given lexicon(s). Note that not all lexical information is required, but the more information the more likely you will have a more accurate tagger.

If the model uses a Multi Word Expression (MWE) lexicon then the tagger can identify MWEs and their associated semantic categories. Furthermore, these lexicons can be more than just lookup tables they can contain a pattern matching syntax, of which more details of this can be found in these notes. In addition, the POS tagset used in these lexicons can differ from the tagset within the lexical information therefore POS mappers are used to map from the lexical POS tagset (lexical POS tagset is most likely determined by the POS tagger used on the text) to the lexicon POS tagset.

As a token or tokens, in MWE cases, can be matched to multiple lexicon entries the rule based tagger uses a ranking system to determine the best token match.

For more detailed information on the rule based tagger go to the following PyMUSAS API documentation page.

The neural taggers use one of the neural models found within the following HuggingFace collection. They only require the text to be tokenized therefore do not need lemma or POS information. In addition some of the models are highly multilingual therefore covering many languages. NOTE the performance of these model vary a lot between languages therefore please look at the model performance before choosing a model.

Model naming conventions

We expect all model packages to follow the naming convention of [lang]_[name], whereby lang is a BCP 47 code of the language, which is a similar convention as spaCy uses. The name is then split into:

rules used:
- single: Only a single word lexicon is used.
- dual: Both a single and Multi Word Expression lexicons are used.
- none : no rules used. (typically a neural only model).
POS Mapping used to map the POS tagset from the tagged text to the POS tagset used in the lexicons of the rule based tagger.
- upos2usas: Maps from UPOS tagged text to USAS core tagset of the lexicons.
- basiccorcencc2usas: Maps Basic CorCenCC tagged text to USAS core tagset of the lexicons.
- none: No POS mapper was used.
ranker the ranker used to determine the best lexicon entry match for the token.
- contextual: Uses the ContextualRuleBasedRanker, which ranks based on heuristic rules and then finds the best lexicon match for each token taking into account all other tokens in the text. For more details on this ranker see the ContextualRuleBasedRanker documentation.
- none : does not use a ranker. (typically a neural only model).
neural model used:
- englishsmallbem : Uses the ucrelnlp/PyMUSAS-Neural-English-Small-BEM neural model which is 17 million parameter in size.
- englishbasebem : Uses the ucrelnlp/PyMUSAS-Neural-English-Base-BEM neural model which is 68 million parameter in size.
- multilingualsmallbem : Uses the ucrelnlp/PyMUSAS-Neural-Multilingual-Small-BEM neural model which is 140 million parameter in size.
- multilingualbasebem : Uses the ucrelnlp/PyMUSAS-Neural-Multilingual-Base-BEM neural model which is 307 million parameter in size.
- none : does not use a neual model.

For example, cy_single_basiccorcencc2usas_contextual_none is a Welsh single word lexicon model that maps the tagged text POS labels from Basic CorCenCC tagset to the USAS core tagset to be compatible with the lexicons used in this rule based tagger and uses the contextual ranker.

en_none_none_none_englishsmallbem is an English model that uses only the Small English BEM neural model (ucrelnlp/PyMUSAS-Neural-English-Small-BEM).

Model versioning

Similar to the the spaCy models, our model versioning reflects the compatibility with PyMUSAS, as well as the model version. A model version a.b.c translates to:

a: PyMUSAS major version. For example, 0 for PyMUSAS v0.x.x
b: PyMUSAS minor version. For example, 2 for PyMUSAS v0.2.x
c: Model version. Different model configurations, for example using different or updated lexicons.

Overview of the models

Language (BCP 47 language code)	Model Name	MWE	POS Mapper	Ranker	Neural Model	File Size
Mandarin Chinese (cmn)	cmn_dual_upos2usas_contextual_none	✔️	UPOS 2 USAS	Contextual	❌	1.28MB
Mandarin Chinese (cmn)	cmn_single_upos2usas_contextual_none	❌	UPOS 2 USAS	Contextual	❌	1.00MB
Welsh (cy)	cy_dual_basiccorcencc2usas_contextual_none	✔️	Basic CorCenCC 2 USAS	Contextual	❌	1.10MB
Welsh (cy)	cy_single_basiccorcencc2usas_contextual_none	❌	Basic CorCenCC 2 USAS	Contextual	❌	1.09MB
Danish (da)	da_dual_none_contextual_none	✔️	None	Contextual	❌	0.82MB
Danish (da)	da_single_none_contextual_none	❌	None	Contextual	❌	0.63MB
English (en)	en_dual_none_contextual_none	✔️	None	Contextual	❌	0.86MB
English (en)	en_none_none_none_englishsmallbem	❌	❌	❌	ucrelnlp/PyMUSAS-Neural-English-Small-BEM	60.18MB
English (en)	en_none_none_none_englishbasebem	❌	❌	❌	ucrelnlp/PyMUSAS-Neural-English-Base-BEM	242.06MB
English (en)	en_single_none_contextual_none	❌	None	Contextual	❌	0.71MB
Spanish, Castilian (es)	es_dual_upos2usas_contextual_none	✔️	UPOS 2 USAS	Contextual	❌	0.26MB
Spanish, Castilian (es)	es_single_upos2usas_contextual_none	❌	UPOS 2 USAS	Contextual	❌	0.20MB
Finnish (fi)	fi_single_upos2usas_contextual_none	❌	UPOS 2 USAS	Contextual	❌	0.64MB
French (fr)	fr_single_upos2usas_contextual_none	❌	UPOS 2 USAS	Contextual	❌	0.08MB
Indonesian (id)	id_single_none_contextual_none	❌	None	Contextual	❌	0.24MB
Italian (it)	it_dual_upos2usas_contextual_none	✔️	UPOS 2 USAS	Contextual	❌	0.50MB
Italian (it)	it_single_upos2usas_contextual_none	❌	UPOS 2 USAS	Contextual	❌	0.42MB
Dutch, Flemish (nl)	nl_single_upos2usas_contextual_none	❌	UPOS 2 USAS	Contextual	❌	0.15MB
Portuguese (pt)	pt_dual_upos2usas_contextual_none	✔️	UPOS 2 USAS	Contextual	❌	0.27MB
Portuguese (pt)	pt_single_upos2usas_contextual_none	❌	UPOS 2 USAS	Contextual	❌	0.25MB
Multilingual (xx)	xx_none_none_none_multilingualsmallbem	❌	❌	❌	ucrelnlp/PyMUSAS-Neural-Multilingual-Small-BEM	501.45MB
Multilingual (xx)	xx_none_none_none_multilingualbasebem	❌	❌	❌	ucrelnlp/PyMUSAS-Neural-Multilingual-Base-BEM	1089.74MB

MWE -- ✔️ means that the model supports identification and tagging of Multi Word Expressions.

Issues / bug reports / improving the models

These models are not statistical they are rule based, however they are still error prone as not all rules will cover every situation and in some cases this is not possible. If you are finding a lot of mis-classified tokens please do file a report on the PyMUSAS issue tracker so that we can try to improve the model. Thank you in advance for your support.

Development

If you are contributing to this repository, please go to the CONTRIBUTING.md file to learn more about how to contribute to this repository and in general learn more about this repository.

Acknowledgements

The contents of this README is heavily based on the spaCy models repository README, many thanks for writing that great README.

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
model_creation_tests		model_creation_tests
model_function_tests		model_function_tests
pymusas_models		pymusas_models
.dockerignore		.dockerignore
.flake8		.flake8
.gitignore		.gitignore
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
language_resources.json		language_resources.json
makefile		makefile
model_release.py		model_release.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyMUSAS-Models

Model introduction

Model naming conventions

Model versioning

Overview of the models

Issues / bug reports / improving the models

Development

Acknowledgements

About

Uh oh!

Releases 85

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PyMUSAS-Models

Model introduction

Model naming conventions

Model versioning

Overview of the models

Issues / bug reports / improving the models

Development

Acknowledgements

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 85

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages