Add script to extract required phrases from rule files by Kaushik-Kumar-CEG · Pull Request #5105 · aboutcode-org/scancode-toolkit

Kaushik-Kumar-CEG · 2026-06-03T15:59:33Z

references #5077

Adds a script to extract required phrase annotations ({{ }} markers) from .RULE files and output them as JSONL for NER model training

The script handles:

marker extraction with correct character positions
phrase normalization (HTML entities, xml tags, backticks, whitespace)
unicode and line ending normalization
position validation

more features (BIOES labels, train/val/test split, plain text field) coming in follow-up commits

disclosure : used Claude to help review and clean up few bugs in script

Tasks

Reviewed contribution guidelines
PR is descriptively titled and links the original issue above
Tests pass
Commits are in uniquely-named feature branch and has no merge conflicts
Updated documentation pages (if applicable)
Updated CHANGELOG.rst (if applicable)

Parses .RULE files for {{ }} markers and outputs JSONL with character positions and normalized phrases for NER training. Signed-off-by: Kaushik <kaushikrjpm10@gmail.com>

Kaushik-Kumar-CEG · 2026-06-03T16:04:27Z

@AyanSinhaMahapatra ready for review when you get a chance. ill make the follow up changes to script as well

pombredanne

@Kaushik-Kumar-CEG why not reusing existing code for that https://github.com/aboutcode-org/scancode-toolkit/tree/develop/src/licensedcode like in

scancode-toolkit/src/licensedcode/required_phrases.py

Line 218 in ea42c1d

def collect_is_required_phrase_from_rules(rules_by_expression, verbose=False):

? this script feels redundant work, and it also is missing tests.

Kaushik-Kumar-CEG · 2026-06-07T11:32:27Z

thanks for the review @pombredanne

collect_is_required_phrase_from_rules is for rules with is_required_phrase: yes which is a different case from what this script handles (inline {{ }} phrases which is what the NER model trains on)

skipping is_required_phrase: yes rules is intentional. those rules already function as required phrases and don't need {{ }} annotation, so they're not in scope for what the model predicts at inference. also tested it in my prototype: including them in training made the model mark entire rule texts as the phrase instead of finding the actual span

on the redundancy, you're right. dropping the local regex in the next commit and using get_required_phrase_verbatim for phrase extraction.for BIOES labels in a later commit, required_phrase_tokenizer yields {{ and }} as tokens which is the natural fit and so Ill use them as well

Ill add the required tests once feature commits lands. thanks :)

drop local {{ }} regex and use scancode's get_required_phrase_verbatim. Add identifier, rule_type, text fields. Drop per phrase start/end since they arent needed for training References aboutcode-org#5077

Add script to extract required phrase annotations from rule files

5477224

Parses .RULE files for {{ }} markers and outputs JSONL with character positions and normalized phrases for NER training. Signed-off-by: Kaushik <kaushikrjpm10@gmail.com>

pombredanne requested changes Jun 5, 2026

View reviewed changes

refactor and add metadata fields to dataset output

68e94b8

drop local {{ }} regex and use scancode's get_required_phrase_verbatim. Add identifier, rule_type, text fields. Drop per phrase start/end since they arent needed for training References aboutcode-org#5077

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add script to extract required phrases from rule files#5105

Add script to extract required phrases from rule files#5105
Kaushik-Kumar-CEG wants to merge 2 commits into
aboutcode-org:developfrom
Kaushik-Kumar-CEG:gsoc/dataset-extraction-script

Kaushik-Kumar-CEG commented Jun 3, 2026 •

edited

Loading

Uh oh!

Kaushik-Kumar-CEG commented Jun 3, 2026

Uh oh!

pombredanne left a comment

Uh oh!

Kaushik-Kumar-CEG commented Jun 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Kaushik-Kumar-CEG commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tasks

Uh oh!

Kaushik-Kumar-CEG commented Jun 3, 2026

Uh oh!

pombredanne left a comment

Choose a reason for hiding this comment

Uh oh!

Kaushik-Kumar-CEG commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kaushik-Kumar-CEG commented Jun 3, 2026 •

edited

Loading

Kaushik-Kumar-CEG commented Jun 7, 2026 •

edited

Loading