Skip to content

UCREL/wikipedia-USAS-processing

Repository files navigation

wikipedia-USAS-processing

We are going to stream Wikipedia data for all languages listed in the languages section from HuggingFaceFW finewiki as the data is going to be in markdown format we are going to perform some pre-processing as outlined in the pre-processing section so that the data is text only with no additional formatting, before pre-processing any of the data we are going to filter the data as outlined in the filtering section which should drastically reduce the number of articles. Once the data has been filtered and pre-processed we are going to save the data to.

Setup

You can either use the dev container with your favourite editor, e.g. VSCode. Or you can create your setup locally below we demonstrate both.

In both cases they share the same tools, of which these tools are:

  • uv for Python packaging and development
  • make (OPTIONAL) for automation of tasks, not strictly required but makes life easier.

Dev Container

A dev container uses a docker container to create the required development environment, the Dockerfile we use for this dev container can be found at ./.devcontainer/Dockerfile. To run it locally it requires docker to be installed, you can also run it in a cloud based code editor, for a list of supported editors/cloud editors see the following webpage.

To run for the first time on a local VSCode editor (a slightly more detailed and better guide on the VSCode website):

  1. Ensure docker is running.
  2. Ensure the VSCode Dev Containers extension is installed in your VSCode editor.
  3. Open the command pallete CMD + SHIFT + P and then select Dev Containers: Rebuild and Reopen in Container

You should now have everything you need to develop, uv, make, for VSCode various extensions like Pylance, etc.

If you have any trouble see the VSCode website..

Local

To run locally first ensure you have the following tools installted locally:

  • uv for Python packaging and development. (version 0.9.6)
  • make (OPTIONAL) for automation of tasks, not strictly required but makes life easier.
    • Ubuntu: apt-get install make
    • Mac: Xcode command line tools includes make else you can use brew.
    • Windows: Various solutions proposed in this blog post on how to install on Windows, inclduing Cygwin, and Windows Subsystem for Linux.

When developing on the project you will want to install the Python package locally in editable format with all the extra requirements, this can be done like so:

uv sync

Linting

Linting and formatting with ruff it is a replacement for tools like Flake8, isort, Black etc, and we us ty for type checking.

To run the linting:

make lint

Tests

To run the tests (uses pytest and coverage) and generate a coverage report:

make test

Data

Any data that is required to be saved locally will be saved within the ./data folder.

Languages

Language ISO 639-3
English eng
Dutch nld
Spanish spa
Hindi hin
Danish dan
Korean kor
Italian ita
Portuguese por
Chinese zho
Finnish fin
Irish gle
Welsh cym

Filtering

Before Pre-Processing

All articles have to be rated as either "Good Articles" (GA) or "Featured Articles" (FA) by an editor, we hope that this will remove articles that might be incomplete or require additional editing. This filtering is inspired by Conia et al. 2024 whereby they found training on data from only "featured" and "good" articles performed similarly to training on the far larger Wikipedia articles that contained non-good and non-featured articles thus showing that training on smaller amounts of data is as affective and more efficient. The "Featured" and "Good" article can be defined differently for each Wikipedia language site as stated in the English site definition within the following article.

To get the metadata on which articles are "Good" or "Featured" we will use the "page_props" table, of which the data for English from 1st of March 2026 can be downloaded from (this will start downloading the data) https://dumps.wikimedia.org/enwiki/20260301/enwiki-20260301-page_props.sql.gz

SQL code generated by Claude, an AI, that might be of use:

-- Featured Articles
SELECT pp_page FROM page_props WHERE pp_propname = 'featured_article';

-- Good Articles
SELECT pp_page FROM page_props WHERE pp_propname = 'good_article';
SELECT p.page_title, pp.pp_propname
FROM page_props pp
JOIN page p ON pp.pp_page = p.page_id
WHERE pp.pp_propname IN ('featured_article', 'good_article');

After Pre-Processing

If an article text is less than 200 words it will be removed.

Pre-Processing

We will remove the following:

  1. Titles and Headers
  2. Tables
  3. Equations

Once those have been removed we will remove the markdown formatting.

License

The code is licensed under Apache License Version 2.0.

About

Process data from Wikipedia to tag using USAS taggers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors