We are going to stream Wikipedia data for all languages listed in the languages section from HuggingFaceFW finewiki as the data is going to be in markdown format we are going to perform some pre-processing as outlined in the pre-processing section so that the data is text only with no additional formatting, before pre-processing any of the data we are going to filter the data as outlined in the filtering section which should drastically reduce the number of articles. Once the data has been filtered and pre-processed we are going to save the data to.
You can either use the dev container with your favourite editor, e.g. VSCode. Or you can create your setup locally below we demonstrate both.
In both cases they share the same tools, of which these tools are:
- uv for Python packaging and development
- make (OPTIONAL) for automation of tasks, not strictly required but makes life easier.
A dev container uses a docker container to create the required development environment, the Dockerfile we use for this dev container can be found at ./.devcontainer/Dockerfile. To run it locally it requires docker to be installed, you can also run it in a cloud based code editor, for a list of supported editors/cloud editors see the following webpage.
To run for the first time on a local VSCode editor (a slightly more detailed and better guide on the VSCode website):
- Ensure docker is running.
- Ensure the VSCode Dev Containers extension is installed in your VSCode editor.
- Open the command pallete
CMD + SHIFT + Pand then selectDev Containers: Rebuild and Reopen in Container
You should now have everything you need to develop, uv, make, for VSCode various extensions like Pylance, etc.
If you have any trouble see the VSCode website..
To run locally first ensure you have the following tools installted locally:
- uv for Python packaging and development. (version
0.9.6) - make (OPTIONAL) for automation of tasks, not strictly required but makes life easier.
- Ubuntu:
apt-get install make - Mac: Xcode command line tools includes
makeelse you can use brew. - Windows: Various solutions proposed in this blog post on how to install on Windows, inclduing
Cygwin, andWindows Subsystem for Linux.
- Ubuntu:
When developing on the project you will want to install the Python package locally in editable format with all the extra requirements, this can be done like so:
uv syncLinting and formatting with ruff it is a replacement for tools like Flake8, isort, Black etc, and we us ty for type checking.
To run the linting:
make lintTo run the tests (uses pytest and coverage) and generate a coverage report:
make testAny data that is required to be saved locally will be saved within the ./data folder.
| Language | ISO 639-3 |
|---|---|
| English | eng |
| Dutch | nld |
| Spanish | spa |
| Hindi | hin |
| Danish | dan |
| Korean | kor |
| Italian | ita |
| Portuguese | por |
| Chinese | zho |
| Finnish | fin |
| Irish | gle |
| Welsh | cym |
All articles have to be rated as either "Good Articles" (GA) or "Featured Articles" (FA) by an editor, we hope that this will remove articles that might be incomplete or require additional editing. This filtering is inspired by Conia et al. 2024 whereby they found training on data from only "featured" and "good" articles performed similarly to training on the far larger Wikipedia articles that contained non-good and non-featured articles thus showing that training on smaller amounts of data is as affective and more efficient. The "Featured" and "Good" article can be defined differently for each Wikipedia language site as stated in the English site definition within the following article.
To get the metadata on which articles are "Good" or "Featured" we will use the "page_props" table, of which the data for English from 1st of March 2026 can be downloaded from (this will start downloading the data) https://dumps.wikimedia.org/enwiki/20260301/enwiki-20260301-page_props.sql.gz
SQL code generated by Claude, an AI, that might be of use:
-- Featured Articles
SELECT pp_page FROM page_props WHERE pp_propname = 'featured_article';
-- Good Articles
SELECT pp_page FROM page_props WHERE pp_propname = 'good_article';SELECT p.page_title, pp.pp_propname
FROM page_props pp
JOIN page p ON pp.pp_page = p.page_id
WHERE pp.pp_propname IN ('featured_article', 'good_article');If an article text is less than 200 words it will be removed.
We will remove the following:
- Titles and Headers
- Tables
- Equations
Once those have been removed we will remove the markdown formatting.
The code is licensed under Apache License Version 2.0.