Skip to content

fix: improve GDPR data quality and simplify article structure#9

Merged
KevinRabun merged 1 commit intomainfrom
bugfix/gdpr-data-quality
Feb 9, 2026
Merged

fix: improve GDPR data quality and simplify article structure#9
KevinRabun merged 1 commit intomainfrom
bugfix/gdpr-data-quality

Conversation

@KevinRabun
Copy link
Copy Markdown
Owner

Summary

This PR fixes data quality issues in the bundled GDPR text data and simplifies the article structure for improved usability.

Changes

Data Quality Fixes

  • Remove navigation artifacts from all 99 articles and 173 recitals (patterns like ← Art. N GDPR, Table of contents, Report error, ← Recital N Recital M → All recitals)
  • Fix recital 1 edge case - first recital has different navigation pattern without leading
  • Strip standalone paragraph numbers (1, 2, etc.) from article text for clean prose

Structure Simplification

  • Remove paragraphs array from articles - the source website (gdpr-info.eu) doesn't cleanly separate main paragraph numbers from sub-paragraph numbers, making reliable extraction impossible
  • Simplify article structure to just: number, title, text
  • Reduce JSON size from ~500KB to ~370KB (26% reduction)

Before

{
  "number": 33,
  "title": "Notification of a personal data breach...",
  "text": "1\nIn the case of...\n2\nWhere the notification...",
  "paragraphs": [{"number": 1, "text": "..."}, {"number": 2, "text": "..."}]
}

After

{
  "number": 33,
  "title": "Notification of a personal data breach...", 
  "text": "In the case of a personal data breach, the controller shall..."
}

Testing

  • All 294 tests pass
  • Verified all articles and recitals are artifact-free
  • Verified definitions extraction still works from Article 4

- Remove navigation artifacts from all articles and recitals
- Fix recital text extraction (clean ← Recital N → All recitals patterns)
- Remove unreliable 'paragraphs' array from articles (source website doesn't cleanly separate paragraph numbers from sub-paragraph numbers)
- Simplify article structure to just: number, title, text
- Strip standalone paragraph number markers (1, 2, etc.) from text for clean prose
- Reduce JSON file size from ~500KB to ~370KB (26% reduction)

All 294 tests pass.
@KevinRabun KevinRabun merged commit 339ddd0 into main Feb 9, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant