Add script to extract required phrases from rule files#5105
Add script to extract required phrases from rule files#5105Kaushik-Kumar-CEG wants to merge 2 commits into
Conversation
Parses .RULE files for {{ }} markers and outputs JSONL with character positions and normalized phrases for NER training.
Signed-off-by: Kaushik <kaushikrjpm10@gmail.com>
|
@AyanSinhaMahapatra ready for review when you get a chance. ill make the follow up changes to script as well |
pombredanne
left a comment
There was a problem hiding this comment.
@Kaushik-Kumar-CEG why not reusing existing code for that https://github.com/aboutcode-org/scancode-toolkit/tree/develop/src/licensedcode like in
? this script feels redundant work, and it also is missing tests.|
thanks for the review @pombredanne
skipping on the redundancy, you're right. dropping the local regex in the next commit and using Ill add the required tests once feature commits lands. thanks :) |
drop local {{ }} regex and use scancode's get_required_phrase_verbatim.
Add identifier, rule_type, text fields. Drop per phrase start/end
since they arent needed for training
References aboutcode-org#5077
references #5077
Adds a script to extract required phrase annotations (
{{ }}markers) from.RULEfiles and output them as JSONL for NER model trainingThe script handles:
more features (BIOES labels, train/val/test split, plain text field) coming in follow-up commits
disclosure : used Claude to help review and clean up few bugs in script
Tasks