LiNT-II is a readability assessment tool for Dutch. The library (a) calculates a readability score for a text using the LiNT-II formula, and (b) provides an analysis per sentence, based on the 4 features that are used in the formula.
LiNT-II is a new implementation of the original LiNT tool (see here). The main differences between LiNT and LiNT-II are:
- The NLP tools used to extract linguistic features from the text. LiNT has T-Scan under the hood, while LiNT-II uses spaCy.
- The coefficients (weights) used in the formula. Since the features are calculated differently, a new linear regression model was fitted on the original reading comprehension data from LiNT. This resulted in new coefficients. The performance of the LiNT-II model is the same as the original LiNT: Adjusted R2 = 0.74, meaning that the model explains 74% of the variance in the comprehension data.
For more information, please refer to What is LiNT-II? and LiNT-II documentation.
pip install lint_ii
python -m spacy download nl_core_news_lg>>> from lint_ii import ReadabilityAnalysis
>>> text = """De Oudegracht is het sfeervolle hart van de stad.
In de middeleeuwen was het hier een drukte van belang met de aan- en afvoer van goederen.
Nu is het een prachtige plek om te winkelen en te lunchen of te dineren in de oude stadskastelen."""
>>> analysis = ReadabilityAnalysis.from_text(text)
Loading Dutch language model from spaCy... ✓ nl_core_news_lgNOTE: LiNT-II can process plain text or markdown. Other formats (e.g.. html) or very "unclean" text might produce inaccurate results due to segmentation issues.
You can see the score and difficulty level for the whole document and/or per sentence:
>>> analysis.lint.score
48.20593518603563
>>> analysis.lint.level
3
>>> analysis.lint_scores_per_sentence
[18.511612982419507, 54.27056340066443, 63.24402181810589]For a detailed analysis, use the get_detailed_analysis() method:
>>> detailed_analysis = analysis.get_detailed_analysis()
>>> detailed_analysis.keys()
dict_keys(['document_stats', 'sentence_stats'])
>>> detailed_analysis['document_stats']
{'sentence_count': 3,
'document_lint_score': 48.20593518603563,
'document_difficulty_level': 3,
'min_lint_score': 18.511612982419507,
'max_lint_score': 63.24402181810589}
>>> detailed_analysis['sentence_stats'][0]
{'text': 'De Oudegracht is het sfeervolle hart van de stad.',
'score': 18.511612982419507,
'level': 1,
'mean_log_word_frequency': 5.364349123825101,
'top_n_least_freq_words': [('hart', 5.293120582960477),
('stad', 5.435577664689725)],
'proportion_concrete_nouns': 0.5,
'concrete_nouns': ['stad'],
'abstract_nouns': [],
'undefined_nouns': ['hart'],
'unknown_nouns': ['oudegracht'],
'max_sdl': 3,
'sdls': [{'token': 'de', 'dep_length': 0, 'heads': ['Oudegracht']},
{'token': 'oudegracht', 'dep_length': 3, 'heads': ['hart']},
{'token': 'is', 'dep_length': 2, 'heads': ['hart']},
{'token': 'het', 'dep_length': 1, 'heads': ['hart']},
{'token': 'sfeervolle', 'dep_length': 0, 'heads': ['hart']},
{'token': 'hart', 'dep_length': 0, 'heads': ['hart']},
{'token': 'van', 'dep_length': 1, 'heads': ['stad']},
{'token': 'de', 'dep_length': 0, 'heads': ['stad']},
{'token': 'stad', 'dep_length': 2, 'heads': ['hart']},
{'token': '.', 'dep_length': 0, 'heads': ['hart']}],
'content_words_per_clause': 4.0,
'content_words': ['oudegracht', 'sfeervolle', 'hart', 'stad'],
'finite_verbs': ['is']}All the linguistic features used in the analysis can be accessed on a text-level and a sentence-level. For example--
Getting the mean word frequency for the whole text:
>>> analysis.mean_log_word_frequency
4.208347333820788Getting the list of content words in each sentence:
>>> for sent in analysis.sentences:
print(sent.content_words)
['oudegracht', 'sfeervolle', 'hart', 'stad']
['middeleeuwen', 'drukte', 'belang', 'afvoer', 'goederen']
['prachtige', 'plek', 'winkelen', 'lunchen', 'dineren', 'oude', 'stadskastelen']For a list of available properties, refer to the documentation in readability_analysis.py and sentence_analysis.py.
To visualize your readability analysis, you can use this notebook.
Note: The Binder notebook takes a while to build (~2 minutes). Alternatively, you can download the repo and set up a conda environment to run the notebook locally.
LiNT-II is a Python implementation of LiNT (Leesbaarheidsinstrument voor Nederlandse Teksten), a readability assessment tool that analyzes Dutch texts and estimates their difficulty.
LiNT-II outputs a readability score based on 4 features:
| Feature | Description |
|---|---|
| word frequency | Mean word frequency of all the content words in the text (excluding proper nouns). ➡ Less frequent words make a text more difficult. |
| syntactic dependency length | Syntactic dependency length (SDL) is the number of words between a syntactic head and its dependent (e.g., verb-subject). We take the biggest SDL in each sentence, and calculate their mean value for the whole text. ➡ Bigger SDL's make a text more difficult. |
| content words per clause | Mean number of content words per clause. ➡ Larger number of content words indicates dense information and makes a text more difficult. |
| proportion concrete nouns | Mean proportion of concrete nouns out of all the nouns in the text. ➡ Smaller proportion of concrete nouns (i.e. many abstract nouns) makes a text more difficult. |
- Content words are words that possess semantic content and contribute to the meaning of the sentence. We consider a word as a content word if it belongs to one of the following part-of-speech (POS): nouns (NOUN), proper nouns (PROPN), lexical verbs (VERB), adjectives (ADJ), or if it's a manner adverb (based on a custom list).
- Clause: A clause is a group of words that contains a subject and a verb, functioning as a part of a sentence. In this library, the number of clauses is determined by the number of finite verbs (= verbs that show tense) in the sentence.
The readability score is calculated based on the following formula:
LiNT-II score =
100 - (
- 4.21
+ (17.28 * word frequency)
- (1.62 * syntactic dependency length)
- (2.54 * content words per clause)
+ (16.00 * proportion concrete nouns)
)
The formula's coefficients were estimated using a linear regression model fitted on empirical reading comprehension data from highschool students.
For more information about the empirical study (done for the original LiNT), please refer to the sources listed in Original LiNT.
For more information about the LiNT-II model, please refer to the LiNT-II documentation.
LiNT-II scores are mapped to 4 difficulty levels. For each level, it is estimated how many adult Dutch readers have difficulty understanding texts on this level.
| Score | Difficulty level | Proportion of adults who have diffuculty understanding this level |
|---|---|---|
| [0-34) | 1 | 14% |
| [34-46) | 2 | 29% |
| [46-58) | 3 | 53% |
| [58-100] | 4 | 78% |
For more information about how this estimation was done for the original LiNT, please refer to the sources listed in Original LiNT.
For more information about how the estimation was adapted for LiNT-II, please refer to the LiNT-II documentation.
LiNT-II was developed by Jenia Kim (Hogeschool Utrecht, VU Amsterdam), in collaboration with Henk Pander Maat (Utrecht University).
If you use this library, please cite as follows:
@software{lint_ii,
author = {Kim, Jenia and Pander Maat, Henk},
title = {{LiNT-II: readability assessment for Dutch}},
year = {2025},
url = {https://github.com/vanboefer/lint_ii},
version = {0.1.0},
note = {Python package}
}
- Special thanks to Antal van den Bosch (Utrecht University) for setting up and facilitating the collaboration.
- Special thanks to Lawrence Vriend for his work on the LiNT-II Visualizer and other help with the code.
- The code for LiNT-II was inspired by a spaCy implementation of LiNT by the City of Amsterdam: alletaal-lint.
The first version of LiNT was developed in the NWO project Toward a validated reading level tool for Dutch (2012-2017). Later versions were developed in the Digital Humanities Lab of Utrecht University.
More details about the original LiNT can be found on:
The readability research on which LiNT is based is described in the PhD thesis of Suzanne Kleijn (English) and in Pander Maat et al. 2023 (Dutch). Please cite as follows:
@article{pander2023lint,
title={{LiNT}: een leesbaarheidsformule en een leesbaarheidsinstrument},
author={Pander Maat, Henk and Kleijn, Suzanne and Frissen, Servaas},
journal={Tijdschrift voor Taalbeheersing},
volume={45},
number={1},
pages={2--39},
year={2023},
publisher={Amsterdam University Press Amsterdam}
}
@phdthesis{kleijn2018clozing,
title={Clozing in on readability: How linguistic features affect and predict text comprehension and on-line processing},
author={Kleijn, Suzanne},
year={2018},
school={Utrecht University}
}