Skip to content

UCREL/USAS-CSV-Auto-Labeling

Repository files navigation

USAS-CSV-Auto-Labeling

Tool that annotates data with USAS labels for human verification in Excel format (CSV format will be added at a later point).

Currently the tool is very specific in that it only supports English, Spanish, Danish, Dutch, Hindi, and Igbo and it requires the data to be in a specific format.

Before you can run the tool please follow the setup guide and models to install.

Setup

You can either use the dev container with your favourite editor, e.g. VSCode. Or you can create your setup locally below we demonstrate both.

In both cases they share the same tools, of which these tools are:

  • uv for Python packaging and development
  • make (OPTIONAL) for automation of tasks, not strictly required but makes life easier.

Dev Container

A dev container uses a docker container to create the required development environment, the Dockerfile we use for this dev container can be found at ./.devcontainer/Dockerfile. To run it locally it requires docker to be installed, you can also run it in a cloud based code editor, for a list of supported editors/cloud editors see the following webpage.

To run for the first time on a local VSCode editor (a slightly more detailed and better guide on the VSCode website):

  1. Ensure docker is running.
  2. Ensure the VSCode Dev Containers extension is installed in your VSCode editor.
  3. Open the command pallete CMD + SHIFT + P and then select Dev Containers: Rebuild and Reopen in Container

You should now have everything you need to develop, uv, make, for VSCode various extensions like Pylance, etc.

If you have any trouble see the VSCode website..

Local

To run locally first ensure you have the following tools installted locally:

  • uv for Python packaging and development. (version 0.9.6)
  • make (OPTIONAL) for automation of tasks, not strictly required but makes life easier.
    • Ubuntu: apt-get install make
    • Mac: Xcode command line tools includes make else you can use brew.
    • Windows: Various solutions proposed in this blog post on how to install on Windows, inclduing Cygwin, and Windows Subsystem for Linux.

When developing on the project you will want to install the Python package locally in editable format with all the extra requirements, this can be done like so:

uv sync --all-extras

Linting

Linting and formatting with ruff it is a replacement for tools like Flake8, isort, Black etc, and we us ty for type checking.

To run the linting:

make lint

Tests

To run the tests (uses pytest and coverage) and generate a coverage report:

make test

Models to install

Note

Only download models for languages that you are tagging text for.

The models_install.py script allows you to install all of the required models and lexicons for all languages, it also allows you to install language specific models, and it will detail what models will be installed before using the --describe flag :

uv run models_install.py --help

 Usage: models_install.py [OPTIONS]                                                                                                                                                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 Install the language specific models. You can either select the languages you want to install or use the --all flag to install all language specific models.                                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 If you want to describe the models that will be installed use the --describe flag.                                                                                                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 Example:                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 To install all language specific models run:                                                                                                                                                                                                                                                                                                                                                                                                                                
 python models_install.py --all                                                                                                                                                                                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 To install only the English and Dutch language specific models run:                                                                                                                                                                                                                                                                                                                                                                                                         
 python models_install.py -l English -l Dutch                                                                                                                                                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 To describe the models that will be installed run:                                                                                                                                                                                                                                                                                                                                                                                                                          
 python models_install.py --describe                                                                                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 To describe English specific models:                                                                                                                                                                                                                                                                                                                                                                                                                                        
 python models_install.py -l English --describe                                                                                                                                                                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --languages           -l      [English|Danish|Dutch|Spanish|Hindi|Igbo]  Install the language specific models for the given languages.                                                                                                                                                                                                                                                                                                                                    │
│ --all                 -a                                                 Install all language specific models.                                                                                                                                                                                                                                                                                                                                                            │
│ --describe            -d                                                 Describe the models that will be installed and exit.                                                                                                                                                                                                                                                                                                                                             │
│ --install-completion                                                     Install completion for the current shell.                                                                                                                                                                                                                                                                                                                                                        │
│ --show-completion                                                        Show completion for the current shell, to copy it or customize the installation.                                                                                                                                                                                                                                                                                                                 │
│ --help                                                                   Show this message and exit.                                                                                                                                                                                                                                                                                                                                                                      │
╰───

The following will download all of the resources (lexicons and neural models) to run the Hybrid USAS tagger and Neural USAS tagger and the relevant Stanza models for each language:

uv run models_install.py --all

The following languages require Stanza models to tokenizer, lemmatise, and Part Of Speech (POS) tag the data. Below we state the language and the license of the model:

Tools

Tag data with USAS labels into Excel format

This tool tags all text files in a given directory (following the format specified in the help shown below) taking into account the language of the text file and outputs an Excel spreadsheet per text file in a given output directory. The Excel spreadsheet will allow annotators to correct the USAS tags and Multi Word Expression (MWE) indexes produced by the USAS tagger, allowing you to create a Gold labelled USAS tagged and MWE indexed dataset that can be used for evaluating and/or training a USAS tagger on the data of your choice.

Note

Remember to download the relevant language specific models, see section models to install if you need to install any models.

Below is the help guide for the tool:

uv run tag_data_to_excel.py --help
 Usage: tag_data_to_excel.py [OPTIONS] DATA_PATH OUTPUT_PATH                                                                                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                                                                                                                                      
 Tag all of the files in the given data directory (`data_path`) with pre downloaded language taggers and write the results to the given output directory (`output_path`), in the same file structure as the data directory, in excel format.                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                                                                                                                                      
 The Excel file has the following columns:                                                                                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                                                      
 | id | sentence id | token id | token | lemma | POS | predicted USAS | predicted MWE | corrected USAS | corrected MWE |                                                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                                                                                                                                      
 whereby all but the `corrected` columns are filled in by the taggers.                                                                                                                                                                                                                                                                                                                                                                
 The `id` is in the following format `{language}|{wikipedia_article_name}|{sentence_id}|{token_id}`                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                                                                      
 The data directory file structure should be as follows:                                                                                                                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                                                                                                                                      
 data_path                                                                                                                                                                                                                                                                                                                                                                                                                            
 |                                                                                                                                                                                                                                                                                                                                                                                                                                    
 |__ language                                                                                                                                                                                                                                                                                                                                                                                                                         
 |   |                                                                                                                                                                                                                                                                                                                                                                                                                                
 |   |__ wikipedia_article_name                                                                                                                                                                                                                                                                                                                                                                                                       
 |   |   |                                                                                                                                                                                                                                                                                                                                                                                                                            
 |   |   |__ file_name.txt                                                                                                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                                                      
 Whereby the `language` is used to determine which tagger to use and both                                                                                                                                                                                                                                                                                                                                                             
 the `language` and `wikipedia_article_name` are added to the ID of each token                                                                                                                                                                                                                                                                                                                                                        
 tagged and written to the excel output file.                                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                                                      
 Languages supported:                                                                                                                                                                                                                                                                                                                                                                                                                 
 * english                                                                                                                                                                                                                                                                                                                                                                                                                            
 * dutch                                                                                                                                                                                                                                                                                                                                                                                                                              
 * spanish                                                                                                                                                                                                                                                                                                                                                                                                                            
 * danish                                                                                                                                                                                                                                                                                                                                                                                                                             
 * hindi                                                                                                                                                                                                                                                                                                                                                                                                                              
 * igbo                                                                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                                                                      
╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    data_path        DIRECTORY  Path to the data directory [required]                                                                                                                                                                                                                                                                                                                                                             │
│ *    output_path      PATH       Path to the output directory [required]                                                                                                                                                                                                                                                                                                                                                           │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --verbose-logging    --no-verbose-logging      Print verbose logging [default: no-verbose-logging]                                                                                                                                                                                                                                                                                                                                 │
│ --overwrite          --no-overwrite            If the output path exists overwrite all files in it [default: no-overwrite]                                                                                                                                                                                                                                                                                                         │
│ --help                                         Show this message and exit.                                                                                                                                                                                                                                                                                                                                                         │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

About

Tool that annotates data with USAS labels for human verification in CSV format

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages