Skip to content

Worked on issue 33 : Extract and store ML features for each finding#41

Merged
ionfwsrijan merged 2 commits into
ionfwsrijan:mainfrom
Krish-Mishra:main
Jun 5, 2026
Merged

Worked on issue 33 : Extract and store ML features for each finding#41
ionfwsrijan merged 2 commits into
ionfwsrijan:mainfrom
Krish-Mishra:main

Conversation

@Krish-Mishra
Copy link
Copy Markdown
Contributor

Linked issue

Closes #33

What this PR does

This PR is created in order to resolve issue 33 which is Extract and store ML features for each finding.
here, Added a features column (JSON blob) to the findings table and populated with the required data

It meets all the acceptance criteria mentioned in the issue.

Type of change

  • Bug fix
  • New feature
  • [✅] ML model / training pipeline
  • Refactor (no behaviour change)
  • Documentation
  • Tests only

ML tier (if applicable)

  • [✅] Tier 1 — Triage
  • Tier 2 — Predictive
  • Tier 3 — Autonomous
  • Not ML-related

Changes

Backend

  • Added a centralized extract_features utility in app/utils/ml_features.py to calculate 7 structured ML features per finding (cwe_category, file_extension, path_depth, scanner, raw_severity, is_test_file, rule_id_prefix).
  • Injected the feature extraction logic into the data ingestion pipelines for gitleaks.py, osv.py, and semgrep.py prior to model validation.
  • Updated the Finding Pydantic model in app/models.py to include an optional features dictionary, which automatically exposes the extracted data to the /jobs/{job_id}/findings API endpoint payload.

Frontend

New dependencies

Database / schema changes

  • Modified the Finding Pydantic schema in app/models.py to include features: Optional[Dict[str, Any]] = Field(default_factory=dict) to store the JSON blob at insert time without breaking backward compatibility for older records.

Testing

How did you test this?

  • Executed the existing automated test suite (pytest tests/) to verify that the API response serialization remains intact with the new schema update.
  • Conducted a local end-to-end test by spinning up the backend (uvicorn) and running dummy scans to verify the features object populates correctly in the JSON response of the /jobs/{job_id}/findings endpoint.

Checklist

  • [✅] Tested locally end-to-end (upload ZIP or GitHub URL → scan → findings returned correctly)
  • [✅] New ML model falls back gracefully when model file is absent
  • [✅] No new console.error or unhandled Python exceptions introduced
  • [✅] Added or updated tests where applicable
  • [✅] requirements.txt / package.json updated if new dependencies added
  • [✅] New model files (.pkl, .pt, etc.) are gitignored, not committed

Anything reviewers should focus on

Screenshots (if UI changed)

@Tushar-sonawane06
Copy link
Copy Markdown

@ionfwsrijan PR #41 is ready to merge! All changes from issue #33 have been implemented and tested locally. The feature extraction system is working as expected with all 7 ML features being populated correctly. Just need your approval to trigger the workflow and complete the merge.

@ionfwsrijan
Copy link
Copy Markdown
Owner

@Krish-Mishra Pls ruff format the code to fix these failing checks.

@Krish-Mishra
Copy link
Copy Markdown
Contributor Author

@Krish-Mishra Pls ruff format the code to fix these failing checks.

ok will do

@Krish-Mishra
Copy link
Copy Markdown
Contributor Author

@Krish-Mishra Pls ruff format the code to fix these failing checks.

Done, now please check

Thank You

@ionfwsrijan
Copy link
Copy Markdown
Owner

@Krish-Mishra Pls ruff format the code to fix these failing checks.

Done, now please check

Thank You

LGTM merging it now

@ionfwsrijan ionfwsrijan merged commit 43203dd into ionfwsrijan:main Jun 5, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extract and store ML features for each finding

3 participants