Skip to content

Add script to extract required phrases from rule files#5105

Open
Kaushik-Kumar-CEG wants to merge 2 commits into
aboutcode-org:developfrom
Kaushik-Kumar-CEG:gsoc/dataset-extraction-script
Open

Add script to extract required phrases from rule files#5105
Kaushik-Kumar-CEG wants to merge 2 commits into
aboutcode-org:developfrom
Kaushik-Kumar-CEG:gsoc/dataset-extraction-script

Conversation

@Kaushik-Kumar-CEG
Copy link
Copy Markdown

@Kaushik-Kumar-CEG Kaushik-Kumar-CEG commented Jun 3, 2026

references #5077

Adds a script to extract required phrase annotations ({{ }} markers) from .RULE files and output them as JSONL for NER model training

The script handles:

  • marker extraction with correct character positions
  • phrase normalization (HTML entities, xml tags, backticks, whitespace)
  • unicode and line ending normalization
  • position validation

more features (BIOES labels, train/val/test split, plain text field) coming in follow-up commits

disclosure : used Claude to help review and clean up few bugs in script

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled and links the original issue above
  • Tests pass
  • Commits are in uniquely-named feature branch and has no merge conflicts
  • Updated documentation pages (if applicable)
  • Updated CHANGELOG.rst (if applicable)

Parses .RULE files for {{ }} markers and outputs JSONL with character positions and normalized phrases for NER training.

Signed-off-by: Kaushik <kaushikrjpm10@gmail.com>
@Kaushik-Kumar-CEG
Copy link
Copy Markdown
Author

@AyanSinhaMahapatra ready for review when you get a chance. ill make the follow up changes to script as well

Copy link
Copy Markdown
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kaushik-Kumar-CEG why not reusing existing code for that https://github.com/aboutcode-org/scancode-toolkit/tree/develop/src/licensedcode like in

def collect_is_required_phrase_from_rules(rules_by_expression, verbose=False):
? this script feels redundant work, and it also is missing tests.

@Kaushik-Kumar-CEG
Copy link
Copy Markdown
Author

Kaushik-Kumar-CEG commented Jun 7, 2026

thanks for the review @pombredanne

collect_is_required_phrase_from_rules is for rules with is_required_phrase: yes which is a different case from what this script handles (inline {{ }} phrases which is what the NER model trains on)

skipping is_required_phrase: yes rules is intentional. those rules already function as required phrases and don't need {{ }} annotation, so they're not in scope for what the model predicts at inference. also tested it in my prototype: including them in training made the model mark entire rule texts as the phrase instead of finding the actual span

on the redundancy, you're right. dropping the local regex in the next commit and using get_required_phrase_verbatim for phrase extraction.for BIOES labels in a later commit, required_phrase_tokenizer yields {{ and }} as tokens which is the natural fit and so Ill use them as well

Ill add the required tests once feature commits lands. thanks :)

drop local {{ }} regex and use scancode's get_required_phrase_verbatim.
Add identifier, rule_type, text fields. Drop per phrase start/end
since they arent needed for training

References aboutcode-org#5077
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants