Merged
Conversation
…rence selection to set it
This function queries the Ensembl API with exponential backoff as needed, returning a list of features which overlap the passed region.
Collaborator
Author
|
See #66 for information on computing this information for regulatory targets, which will be included in a future release. |
This was referenced Dec 13, 2025
sallybg
approved these changes
Jan 30, 2026
Collaborator
sallybg
left a comment
There was a problem hiding this comment.
This is great! The only comment I have is that there are some changes in the router function (src/api/routers/map.py) which we should also recreate in the corresponding command line function (save_mapped_output_json, which is in src/dcd_mapping/annotate.py). I think we just need to add the layers and gene_info properties to the reference_sequences dict when accessing the mapper from the command line, unless you had a reason for not including it there.
Computes a new `gene_info` property for all mapped targets. This property is defined by an `hgnc_symbol` and a `selection_method`. The hgnc symbol is the HGNC symbol of the gene to which this target relates. The selection method is the method by which this symbol was selected and may be: - `tx_selection`: via the selected transcript - `alignment_max_covered_bases`: based on the gene 'feature' (via Ensembl) which covered the most bases of the aligned target - `variants_max_covered_bases`: same as `alignment_max_covered_bases`, but based on variant bases rather than aligned bases - `target_metadata`: based on parsing the target metadata the user supplied - `target_category`: no gene info was selected because the target was not protein coding (see #66) Various helpers were added to `dcd_mapping.annotate` which support this calculation. Gene info selection should not cause job failures, and will simply fail to select gene info on failure.
5d52e0c to
9669507
Compare
Collaborator
|
The addition of gene info to the command line output looks good! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces a new mechanism for inferring and attaching gene symbol information to target annotations, with provenance, in the mapping pipeline. It adds the
compute_target_gene_infofunction, which determines a single gene symbol per target using a prioritized approach, and integrates this logic into the mapping API. Additionally, it restructures thereference_sequencesdata structure to use a newTargetAnnotationclass, and makes several supporting improvements and bug fixes.Gene symbol inference and annotation improvements:
compute_target_gene_infoasync function inannotate.py, which determines a single gene symbol per target using a prioritized strategy (selected transcript, alignment overlap, variant spans, or fallback to metadata), and returns aGeneInfoobject with provenance. Supporting helper functions for overlap-based inference and interval merging were also added.map_scoresetAPI route: for each target, the computed gene info is attached to itsTargetAnnotationin the response. [1] [2]Data structure and schema changes:
reference_sequencesstructure inmap_scoresetto useTargetAnnotationobjects, which now include alayersattribute and a newgene_infofield. Adjusted all code paths to reference the new structure. [1] [2] [3]TargetAnnotationandGeneInfoimports to relevant modules, ensuring the new schema types are properly used throughout. [1] [2]Supporting infrastructure and environment:
ENSEMBL_API_URLto the.env.devsettings file to support Ensembl API queries for gene overlap.request_with_backoffandENSEMBL_API_URLinlookup.pyto enable robust gene feature queries.Anyimport inlookup.pyfor type hinting.Bug fixes and code improvements:
is not Noneinstead of truthiness) in multiple locations (annotate.py). [1] [2] [3]These changes collectively improve the accuracy and transparency of gene symbol assignment in the API, and lay the groundwork for robust, provenance-aware gene annotation in downstream analyses.