-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Base editor data is a type of functional assay data that is similar to MAVE data in many respects. In these experiments, engineered CRISPR base editors are used to convert target bases without creating double strand breaks, introducing mutations in a given window based on their proximity to the CRISPR guide RNA binding location.
One important difference is that in a typical analysis, only the guide RNAs are quantified and used to calculate variant scores. The actual mutations introduced are inferred based on the guide location, relative position of the editing window, and the base editor's conversion profile. This is typically done by assuming all bases of the appropriate type in the window are converted with 100% efficiency and calculating the variant DNA sequence (and variant amino acid sequence, if relevant) that way. Because of this, it is possible that two different guide RNAs could introduce the same sequence variant if they bind near each other.]
To address this and more faithfully represent the way the data was generated and analyzed, we should add a new score set type in MaveDB specifically for base editor data. Key features and implementation notes are outlined below:
- Base editors should be able to use existing experiment and experiment set records with no changes.
- Instead of an
hgvs_index column, base editor score sets use a new columnguide_sequenceas the index column.- Guides should be validated to ensure that they are unique, like all index columns.
- Guides should be validated to ensure that they contain only
ACGT.
hgvs_ntis required but does not need to be unique.hgvs_prois optional.- Base editor score sets do not have a target sequence. All
hgvs_columns should be specified with a versioned chromosome scaffold, transcript, or protein identifier from RefSeq as was introduced for the SGE data.- Many base editor experiments hit multiple functional elements (many genes) so this should be supported.
- Base editor experiments may have multiple score sets, with each score set describing the variants with a different editing window width.
- Many base editor variants will be multi-nucleotide variants, since the assumption is that all nucleotides in the editing window have been changed.
- It will be useful to generate VRS objects for the base editor variants, but this can wait if the mapping service will struggle with this type of multi-nucleotide and multi-amino acid data.
- The most extensive changes will be required in the validator code to accept this new type of uploaded data table.
- We may be able to auto-detect whether a score set is a base editor score set simply by the presence of the
guide_sequencecolumn, but it is probably better to not do this and force an explicit choice. We may want to use the same column name for CRISPR-based technologies like SGE where including the guide is valuable metadata but not an index value.