Skip to content

LiNT-II is a readability assessment tool for Dutch. It is a new Python implementation of the original LiNT tool developed by Utrecht University.

License

Notifications You must be signed in to change notification settings

vanboefer/lint_ii

Repository files navigation

LiNT-II: readability assessment for Dutch

License: EUPL v1.2 Python 3.11 Binder

Table of contents

  1. Introduction
  2. Quick Start
  3. What is LiNT-II?
  4. References and Credits

Introduction

LiNT-II is a readability assessment tool for Dutch. The library (a) calculates a readability score for a text using the LiNT-II formula, and (b) provides an analysis per sentence, based on the 4 features that are used in the formula.

LiNT-II is a new implementation of the original LiNT tool (see here). The main differences between LiNT and LiNT-II are:

  • The NLP tools used to extract linguistic features from the text. LiNT has T-Scan under the hood, while LiNT-II uses spaCy.
  • The coefficients (weights) used in the formula. Since the features are calculated differently, a new linear regression model was fitted on the original reading comprehension data from LiNT. This resulted in new coefficients. The performance of the LiNT-II model is the same as the original LiNT: Adjusted R2 = 0.74, meaning that the model explains 74% of the variance in the comprehension data.

For more information, please refer to What is LiNT-II? and LiNT-II documentation.

Quick Start

Installation

pip install lint_ii
python -m spacy download nl_core_news_lg

Usage in Python

Create ReadabilityAnalysis object from text

>>> from lint_ii import ReadabilityAnalysis

>>> text = """De Oudegracht is het sfeervolle hart van de stad.
In de middeleeuwen was het hier een drukte van belang met de aan- en afvoer van goederen. 
Nu is het een prachtige plek om te winkelen en te lunchen of te dineren in de oude stadskastelen."""

>>> analysis = ReadabilityAnalysis.from_text(text)
Loading Dutch language model from spaCy... ✓ nl_core_news_lg

NOTE: LiNT-II can process plain text or markdown. Other formats (e.g.. html) or very "unclean" text might produce inaccurate results due to segmentation issues.

Get LiNT-II scores

You can see the score and difficulty level for the whole document and/or per sentence:

>>> analysis.lint.score
48.20593518603563

>>> analysis.lint.level
3

>>> analysis.lint_scores_per_sentence
[18.511612982419507, 54.27056340066443, 63.24402181810589]

Get detailed analysis

For a detailed analysis, use the get_detailed_analysis() method:

>>> detailed_analysis = analysis.get_detailed_analysis()

>>> detailed_analysis.keys()
dict_keys(['document_stats', 'sentence_stats'])

>>> detailed_analysis['document_stats']
{'sentence_count': 3,
 'document_lint_score': 48.20593518603563,
 'document_difficulty_level': 3,
 'min_lint_score': 18.511612982419507,
 'max_lint_score': 63.24402181810589}

>>> detailed_analysis['sentence_stats'][0]
{'text': 'De Oudegracht is het sfeervolle hart van de stad.',
 'score': 18.511612982419507,
 'level': 1,
 'mean_log_word_frequency': 5.364349123825101,
 'top_n_least_freq_words': [('hart', 5.293120582960477),
  ('stad', 5.435577664689725)],
 'proportion_concrete_nouns': 0.5,
 'concrete_nouns': ['stad'],
 'abstract_nouns': [],
 'undefined_nouns': ['hart'],
 'unknown_nouns': ['oudegracht'],
 'max_sdl': 3,
 'sdls': [{'token': 'de', 'dep_length': 0, 'heads': ['Oudegracht']},
  {'token': 'oudegracht', 'dep_length': 3, 'heads': ['hart']},
  {'token': 'is', 'dep_length': 2, 'heads': ['hart']},
  {'token': 'het', 'dep_length': 1, 'heads': ['hart']},
  {'token': 'sfeervolle', 'dep_length': 0, 'heads': ['hart']},
  {'token': 'hart', 'dep_length': 0, 'heads': ['hart']},
  {'token': 'van', 'dep_length': 1, 'heads': ['stad']},
  {'token': 'de', 'dep_length': 0, 'heads': ['stad']},
  {'token': 'stad', 'dep_length': 2, 'heads': ['hart']},
  {'token': '.', 'dep_length': 0, 'heads': ['hart']}],
 'content_words_per_clause': 4.0,
 'content_words': ['oudegracht', 'sfeervolle', 'hart', 'stad'],
 'finite_verbs': ['is']}

Access properties on sentence-level and text-level

All the linguistic features used in the analysis can be accessed on a text-level and a sentence-level. For example--

Getting the mean word frequency for the whole text:

>>> analysis.mean_log_word_frequency
4.208347333820788

Getting the list of content words in each sentence:

>>> for sent in analysis.sentences:
      print(sent.content_words)
['oudegracht', 'sfeervolle', 'hart', 'stad']
['middeleeuwen', 'drukte', 'belang', 'afvoer', 'goederen']
['prachtige', 'plek', 'winkelen', 'lunchen', 'dineren', 'oude', 'stadskastelen']

For a list of available properties, refer to the documentation in readability_analysis.py and sentence_analysis.py.

Visualization in Jupyter Notebook (Binder)

To visualize your readability analysis, you can use this notebook.

Note: The Binder notebook takes a while to build (~2 minutes). Alternatively, you can download the repo and set up a conda environment to run the notebook locally.

What is LiNT-II?

LiNT-II is a Python implementation of LiNT (Leesbaar­heids­instrument voor Nederlandse Teksten), a readability assessment tool that analyzes Dutch texts and estimates their difficulty.

LiNT-II outputs a readability score based on 4 features:

Feature Description
word frequency Mean word frequency of all the content words in the text (excluding proper nouns).
➡ Less frequent words make a text more difficult.
syntactic dependency length Syntactic dependency length (SDL) is the number of words between a syntactic head and its dependent (e.g., verb-subject). We take the biggest SDL in each sentence, and calculate their mean value for the whole text.
➡ Bigger SDL's make a text more difficult.
content words per clause Mean number of content words per clause.
➡ Larger number of content words indicates dense information and makes a text more difficult.
proportion concrete nouns Mean proportion of concrete nouns out of all the nouns in the text.
➡ Smaller proportion of concrete nouns (i.e. many abstract nouns) makes a text more difficult.

Definitions

  • Content words are words that possess semantic content and contribute to the meaning of the sentence. We consider a word as a content word if it belongs to one of the following part-of-speech (POS): nouns (NOUN), proper nouns (PROPN), lexical verbs (VERB), adjectives (ADJ), or if it's a manner adverb (based on a custom list).
  • Clause: A clause is a group of words that contains a subject and a verb, functioning as a part of a sentence. In this library, the number of clauses is determined by the number of finite verbs (= verbs that show tense) in the sentence.

LiNT-II score

The readability score is calculated based on the following formula:

LiNT-II score = 

  100 - (
      - 4.21
      + (17.28 * word frequency)
      - (1.62  * syntactic dependency length)
      - (2.54  * content words per clause)
      + (16.00 * proportion concrete nouns)
  )

The formula's coefficients were estimated using a linear regression model fitted on empirical reading comprehension data from highschool students.

For more information about the empirical study (done for the original LiNT), please refer to the sources listed in Original LiNT.

For more information about the LiNT-II model, please refer to the LiNT-II documentation.

Difficulty levels

LiNT-II scores are mapped to 4 difficulty levels. For each level, it is estimated how many adult Dutch readers have difficulty understanding texts on this level.

Score Difficulty level Proportion of adults who have diffuculty understanding this level
[0-34) 1 14%
[34-46) 2 29%
[46-58) 3 53%
[58-100] 4 78%

For more information about how this estimation was done for the original LiNT, please refer to the sources listed in Original LiNT.

For more information about how the estimation was adapted for LiNT-II, please refer to the LiNT-II documentation.

References and Credits

LiNT-II

LiNT-II was developed by Jenia Kim (Hogeschool Utrecht, VU Amsterdam), in collaboration with Henk Pander Maat (Utrecht University).

If you use this library, please cite as follows:

@software{lint_ii,
  author = {Kim, Jenia and Pander Maat, Henk},
  title = {{LiNT-II: readability assessment for Dutch}},
  year = {2025},
  url = {https://github.com/vanboefer/lint_ii},
  version = {0.1.0},
  note = {Python package}
}
  • Special thanks to Antal van den Bosch (Utrecht University) for setting up and facilitating the collaboration.
  • Special thanks to Lawrence Vriend for his work on the LiNT-II Visualizer and other help with the code.
  • The code for LiNT-II was inspired by a spaCy implementation of LiNT by the City of Amsterdam: alletaal-lint.

Original LiNT

The first version of LiNT was developed in the NWO project Toward a validated reading level tool for Dutch (2012-2017). Later versions were developed in the Digital Humanities Lab of Utrecht University.

More details about the original LiNT can be found on:

The readability research on which LiNT is based is described in the PhD thesis of Suzanne Kleijn (English) and in Pander Maat et al. 2023 (Dutch). Please cite as follows:

@article{pander2023lint,
  title={{LiNT}: een leesbaarheidsformule en een leesbaarheidsinstrument},
  author={Pander Maat, Henk and Kleijn, Suzanne and Frissen, Servaas},
  journal={Tijdschrift voor Taalbeheersing},
  volume={45},
  number={1},
  pages={2--39},
  year={2023},
  publisher={Amsterdam University Press Amsterdam}
}
@phdthesis{kleijn2018clozing,
  title={Clozing in on readability: How linguistic features affect and predict text comprehension and on-line processing},
  author={Kleijn, Suzanne},
  year={2018},
  school={Utrecht University}
}

About

LiNT-II is a readability assessment tool for Dutch. It is a new Python implementation of the original LiNT tool developed by Utrecht University.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •