wikipedia-USAS-processing

We are going to stream Wikipedia data for all languages listed in the languages section from HuggingFaceFW finewiki as the data is going to be in markdown format we are going to perform some pre-processing as outlined in the pre-processing section so that the data is text only with no additional formatting, before pre-processing any of the data we are going to filter the data as outlined in the filtering section which should drastically reduce the number of articles. Once the data has been filtered and pre-processed we are going to save the data to.

Setup

You can either use the dev container with your favourite editor, e.g. VSCode. Or you can create your setup locally below we demonstrate both.

In both cases they share the same tools, of which these tools are:

uv for Python packaging and development
make (OPTIONAL) for automation of tasks, not strictly required but makes life easier.

Dev Container

A dev container uses a docker container to create the required development environment, the Dockerfile we use for this dev container can be found at ./.devcontainer/Dockerfile. To run it locally it requires docker to be installed, you can also run it in a cloud based code editor, for a list of supported editors/cloud editors see the following webpage.

To run for the first time on a local VSCode editor (a slightly more detailed and better guide on the VSCode website):

Ensure docker is running.
Ensure the VSCode Dev Containers extension is installed in your VSCode editor.
Open the command pallete CMD + SHIFT + P and then select Dev Containers: Rebuild and Reopen in Container

You should now have everything you need to develop, uv, make, for VSCode various extensions like Pylance, etc.

If you have any trouble see the VSCode website..

Local

To run locally first ensure you have the following tools installted locally:

uv for Python packaging and development. (version 0.9.6)
make (OPTIONAL) for automation of tasks, not strictly required but makes life easier.
- Ubuntu: apt-get install make
- Mac: Xcode command line tools includes make else you can use brew.
- Windows: Various solutions proposed in this blog post on how to install on Windows, inclduing Cygwin, and Windows Subsystem for Linux.

When developing on the project you will want to install the Python package locally in editable format with all the extra requirements, this can be done like so:

uv sync

Linting

Linting and formatting with ruff it is a replacement for tools like Flake8, isort, Black etc, and we us ty for type checking.

To run the linting:

make lint

Tests

To run the tests (uses pytest and coverage) and generate a coverage report:

make test

Data

Any data that is required to be saved locally will be saved within the ./data folder.

Languages

Language	ISO 639-3
English	eng
Dutch	nld
Spanish	spa
Hindi	hin
Danish	dan
Korean	kor
Italian	ita
Portuguese	por
Chinese	zho
Finnish	fin
Irish	gle
Welsh	cym

Filtering

Before Pre-Processing

All articles have to be rated as either "Good Articles" (GA) or "Featured Articles" (FA) by an editor, we hope that this will remove articles that might be incomplete or require additional editing. This filtering is inspired by Conia et al. 2024 whereby they found training on data from only "featured" and "good" articles performed similarly to training on the far larger Wikipedia articles that contained non-good and non-featured articles thus showing that training on smaller amounts of data is as affective and more efficient. The "Featured" and "Good" article can be defined differently for each Wikipedia language site as stated in the English site definition within the following article.

To get the metadata on which articles are "Good" or "Featured" we will use the "page_props" table, of which the data for English from 1st of March 2026 can be downloaded from (this will start downloading the data) https://dumps.wikimedia.org/enwiki/20260301/enwiki-20260301-page_props.sql.gz

SQL code generated by Claude, an AI, that might be of use:

-- Featured Articles
SELECT pp_page FROM page_props WHERE pp_propname = 'featured_article';

-- Good Articles
SELECT pp_page FROM page_props WHERE pp_propname = 'good_article';

SELECT p.page_title, pp.pp_propname
FROM page_props pp
JOIN page p ON pp.pp_page = p.page_id
WHERE pp.pp_propname IN ('featured_article', 'good_article');

After Pre-Processing

If an article text is less than 200 words it will be removed.

Pre-Processing

We will remove the following:

Titles and Headers
Tables
Equations

Once those have been removed we will remove the markdown formatting.

License

The code is licensed under Apache License Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.devcontainer		.devcontainer
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
coding_style_format_example.py		coding_style_format_example.py
makefile		makefile
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikipedia-USAS-processing

Setup

Dev Container

Local

Linting

Tests

Data

Languages

Filtering

Before Pre-Processing

After Pre-Processing

Pre-Processing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

wikipedia-USAS-processing

Setup

Dev Container

Local

Linting

Tests

Data

Languages

Filtering

Before Pre-Processing

After Pre-Processing

Pre-Processing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages