Skip to content

Commit 36ae780

Browse files
authored
Feature/unified variants (#7)
* **chore(variants): Remove unused legacy variant group and alias code** - Deleted templates, models, repositories, and controllers related to old variant group and alias handling. - Refactored dependent views and logic to remove references to deprecated components. - Cleaned up accompanying documentation and comments. * Template for maintenance page * DU Naming Authority API * **refactor(variants): Update forms and backend for simplified reference genome and variant metadata handling** - Replaced `genbankContigId` with `refGenome` and `contig` in forms and backend logic. - Enhanced `editForm` and `createForm` to support clearer metadata fields and additional variant types. - Made genomic coordinates immutable, restricting edits to metadata fields only. - Added improved validation and context for editable and optional fields. - Streamlined templates and controller logic to align with `VariantV2` schema. * **chore(curator): Remove legacy Cytoband and STR Marker templates and models** - Deleted unused templates (`createForm`, `editForm`, `detailPanel`) for Cytobands and STR Markers. - Removed associated Scala models (`Cytoband`, `StrMarker`), streamlining codebase maintenance. - Cleaned up controllers and routes referencing deprecated Cytoband and STR Marker logic. * **chore(curator): Remove legacy Cytoband and STR Marker templates and models** - Deleted unused templates (`createForm`, `editForm`, `detailPanel`) for Cytobands and STR Markers. - Removed associated Scala models (`Cytoband`, `StrMarker`), streamlining codebase maintenance. - Cleaned up controllers and routes referencing deprecated Cytoband and STR Marker logic. * **refactor(views/utils): Centralize badge and variant rendering utilities** - Extracted reusable badge class and variant formatting logic into `CuratorViewUtils` and `VariantViewUtils`. - Updated multiple views to utilize the new utilities, reducing redundancy and improving maintainability. - Replaced inline helper methods with calls to centralized utility functions across various curator and variant templates. * **refactor(views): Extract and reuse breadcrumb, flash message, and search input components** - Added reusable templates for breadcrumbs, flash messages, and search inputs in `fragments`. - Updated variant browser and curator views to utilize the new components, reducing redundancy. - Simplified and centralized layout structures for improved maintainability. * **refactor(views): Extract and reuse breadcrumb, flash message, and search input components** - Added reusable templates for breadcrumbs, flash messages, and search inputs in `fragments`. - Updated variant browser and curator views to utilize the new components, reducing redundancy. - Simplified and centralized layout structures for improved maintainability. * feat(tree): add user opt-in preference for block layout This feature allows users to toggle between the standard tree layout and the block layout on the main tree pages. The preference is saved in a cookie and respected across sessions. * **refactor(sql): Optimize variant migration script for performance and clarity** - Added a temporary index to speed up grouping and joining during variant migration. - Combined variant name and coordinates insertion into a single pass. - Simplified alias aggregation using filtered JSONB operations. - Improved maintainability by removing redundant steps and optimizing query structure. - Dropped temporary index after migration steps to clean up. * **refactor(views/controllers): Simplify variant and haplogroup pagination with dynamic fragment loading** - Replaced server-side pagination logic for variants and haplogroups with HTMX-driven fragment updates for improved responsiveness. - Simplified controller actions and templates, removing unused parameters and reducing complexity. - Added loading spinner and placeholder text for better user experience during content updates. * **chore(models): Remove deprecated STR Marker model and table** - Deleted `StrMarker` model and its corresponding table definition. - Cleaned up related dependencies and imports, reflecting the shift away from legacy STR marker handling. * **feat(variants): Add smart ingestion and deduplication for YBrowse variants** - Introduced `findMatches` method in `VariantV2Repository` for matching variants by coordinates or aliases. - Enhanced `YBrowseVariantIngestionService` to support GFF3 ingestion with coordinate normalization, alias grouping, and liftover handling. - Added `smartUpsertVariant` logic to merge or create variants based on matching results. - Improved metadata merging for aliases, coordinates, evidence, and primers during upsert process. - Included support for parsing and processing Y-DNA variants from both VCF and GFF files, with batch processing for efficiency. * **refactor(sql): Optimize and consolidate variant migration script for better performance and maintainability** - Enhanced migration with GIN index management for faster inserts. - Redesigned aggregation queries using Common Table Expressions (CTEs) for aliases and coordinates. - Simplified deduplication steps for `haplogroup_variant` FK updates. - Added verification and manual cleanup steps for improved integrity and safety. * Fixing some SQL issues * Enhance the block tree layout to use SVG and similar design language to the original. * Search the tree by SNP name as well as subclade name * Using the more correct terminology for cladograms * Fixing the download location. Was still trying to refer to the VCF. * Reducing the number of connections from the pool we allow the variant ingestion to use. Prioritizing the user experience. * Fixing some UI odditities * Remove dead code * Split the DB side OR to live in the application code. It was causing problems with PostgreSQL's planner and inducing full table scans. * Try optimizing the queries. * Comment out the logging to reduce noise. Need to rethink it so we still see progress, but not tons of 0 new variants. This is the DEFAULT case. * Back to logging total records examined, since the Perf issues seems resolved. * Split the universal variant details into smaller chunks retaining the original as a link source to the final status.
1 parent fec7079 commit 36ae780

File tree

131 files changed

+5936
-7643
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

131 files changed

+5936
-7643
lines changed

app/actors/YBrowseVariantUpdateActor.scala

Lines changed: 55 additions & 155 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,9 @@ import org.apache.pekko.actor.Actor
55
import play.api.Logging
66
import services.genomics.YBrowseVariantIngestionService
77

8-
import java.io.{BufferedInputStream, BufferedReader, FileOutputStream, InputStreamReader}
8+
import java.io.{BufferedInputStream, FileOutputStream}
99
import java.net.{HttpURLConnection, URI}
1010
import java.nio.file.Files
11-
import java.util.zip.{GZIPInputStream, GZIPOutputStream}
1211
import scala.concurrent.{ExecutionContext, Future}
1312
import scala.util.{Failure, Success, Try}
1413

@@ -68,180 +67,81 @@ class YBrowseVariantUpdateActor @javax.inject.Inject()(
6867

6968
private def runUpdate(): Future[UpdateResult] = {
7069
Future {
71-
downloadVcfFile()
70+
downloadGffFile()
7271
}.flatMap {
7372
case Success(_) =>
74-
logger.info("VCF file downloaded successfully, sanitizing VCF")
75-
Future(sanitizeVcfFile()).flatMap {
76-
case Success(skipped) =>
77-
logger.info(s"VCF sanitized (removed $skipped malformed records), starting ingestion")
78-
ingestionService.ingestVcf(genomicsConfig.ybrowseVcfStoragePath).map { count =>
79-
UpdateResult(success = true, variantsIngested = count, s"Successfully ingested $count variants (skipped $skipped malformed records)")
80-
}
81-
case Failure(ex) =>
82-
Future.successful(UpdateResult(success = false, variantsIngested = 0, s"Sanitization failed: ${ex.getMessage}"))
73+
logger.info("GFF file downloaded successfully, starting ingestion")
74+
ingestionService.ingestGff(genomicsConfig.ybrowseGffStoragePath).map { count =>
75+
UpdateResult(success = true, variantsIngested = count, s"Successfully ingested $count variants from GFF")
8376
}
8477
case Failure(ex) =>
8578
Future.successful(UpdateResult(success = false, variantsIngested = 0, s"Download failed: ${ex.getMessage}"))
8679
}
8780
}
8881

89-
private def downloadVcfFile(): Try[Unit] = Try {
90-
val url = URI.create(genomicsConfig.ybrowseVcfUrl).toURL
91-
val targetFile = genomicsConfig.ybrowseVcfStoragePath
92-
93-
// Ensure parent directory exists
94-
val parentDir = targetFile.getParentFile
95-
if (parentDir != null && !parentDir.exists()) {
96-
Files.createDirectories(parentDir.toPath)
97-
logger.info(s"Created directory: ${parentDir.getAbsolutePath}")
98-
}
99-
100-
// Download to a temp file first, then rename (atomic operation)
101-
val tempFile = new java.io.File(targetFile.getAbsolutePath + ".tmp")
102-
103-
logger.info(s"Downloading VCF from ${genomicsConfig.ybrowseVcfUrl} to ${tempFile.getAbsolutePath}")
82+
private def downloadGffFile(): Try[Unit] = Try {
83+
val url = URI.create(genomicsConfig.ybrowseGffUrl).toURL
84+
val targetFile = genomicsConfig.ybrowseGffStoragePath
85+
86+
// Check for fresh local file (cache for 24 hours)
87+
val cacheDuration = 24 * 60 * 60 * 1000L // 24 hours in millis
88+
if (targetFile.exists() && (System.currentTimeMillis() - targetFile.lastModified() < cacheDuration)) {
89+
logger.info(s"Local GFF file is fresh (< 24 hours old), skipping download: ${targetFile.getAbsolutePath}")
90+
} else {
91+
// Ensure parent directory exists
92+
val parentDir = targetFile.getParentFile
93+
if (parentDir != null && !parentDir.exists()) {
94+
Files.createDirectories(parentDir.toPath)
95+
logger.info(s"Created directory: ${parentDir.getAbsolutePath}")
96+
}
10497

105-
val connection = url.openConnection().asInstanceOf[HttpURLConnection]
106-
connection.setRequestMethod("GET")
107-
connection.setConnectTimeout(30000) // 30 seconds
108-
connection.setReadTimeout(300000) // 5 minutes for large file
98+
// Download to a temp file first, then rename (atomic operation)
99+
val tempFile = new java.io.File(targetFile.getAbsolutePath + ".tmp")
109100

110-
try {
111-
val responseCode = connection.getResponseCode
112-
if (responseCode != HttpURLConnection.HTTP_OK) {
113-
throw new RuntimeException(s"HTTP request failed with status $responseCode")
114-
}
101+
logger.info(s"Downloading GFF from ${genomicsConfig.ybrowseGffUrl} to ${tempFile.getAbsolutePath}")
115102

116-
val inputStream = new BufferedInputStream(connection.getInputStream)
117-
val outputStream = new FileOutputStream(tempFile)
103+
val connection = url.openConnection().asInstanceOf[HttpURLConnection]
104+
connection.setRequestMethod("GET")
105+
connection.setConnectTimeout(30000) // 30 seconds
106+
connection.setReadTimeout(300000) // 5 minutes for large file
118107

119108
try {
120-
val buffer = new Array[Byte](8192)
121-
var bytesRead = 0
122-
var totalBytes = 0L
123-
124-
while ({ bytesRead = inputStream.read(buffer); bytesRead != -1 }) {
125-
outputStream.write(buffer, 0, bytesRead)
126-
totalBytes += bytesRead
109+
val responseCode = connection.getResponseCode
110+
if (responseCode != HttpURLConnection.HTTP_OK) {
111+
throw new RuntimeException(s"HTTP request failed with status $responseCode")
127112
}
128113

129-
logger.info(s"Downloaded $totalBytes bytes")
130-
} finally {
131-
inputStream.close()
132-
outputStream.close()
133-
}
134-
135-
// Atomic rename
136-
if (targetFile.exists()) {
137-
targetFile.delete()
138-
}
139-
if (!tempFile.renameTo(targetFile)) {
140-
throw new RuntimeException(s"Failed to rename temp file to ${targetFile.getAbsolutePath}")
141-
}
114+
val inputStream = new BufferedInputStream(connection.getInputStream)
115+
val outputStream = new FileOutputStream(tempFile)
142116

143-
logger.info(s"VCF file saved to ${targetFile.getAbsolutePath}")
144-
} finally {
145-
connection.disconnect()
146-
}
147-
}
117+
try {
118+
val buffer = new Array[Byte](8192)
119+
var bytesRead = 0
120+
var totalBytes = 0L
148121

149-
/**
150-
* Sanitizes the VCF file by removing malformed records that HTSJDK cannot parse.
151-
* Specifically filters out records with duplicate alleles (REF == ALT or duplicate ALT alleles).
152-
*
153-
* @return Try containing the number of skipped records
154-
*/
155-
private def sanitizeVcfFile(): Try[Int] = Try {
156-
val sourceFile = genomicsConfig.ybrowseVcfStoragePath
157-
val tempFile = new java.io.File(sourceFile.getAbsolutePath + ".sanitized.tmp")
158-
159-
logger.info(s"Sanitizing VCF file: ${sourceFile.getAbsolutePath}")
160-
161-
val inputStream = new BufferedReader(
162-
new InputStreamReader(
163-
new GZIPInputStream(
164-
new BufferedInputStream(
165-
new java.io.FileInputStream(sourceFile)
166-
)
167-
)
168-
)
169-
)
170-
171-
val outputStream = new java.io.PrintWriter(
172-
new java.io.OutputStreamWriter(
173-
new GZIPOutputStream(
174-
new FileOutputStream(tempFile)
175-
)
176-
)
177-
)
178-
179-
var skippedCount = 0
180-
var lineNumber = 0
181-
182-
try {
183-
var line: String = null
184-
while ({ line = inputStream.readLine(); line != null }) {
185-
lineNumber += 1
186-
if (line.startsWith("#")) {
187-
// Header line - pass through
188-
outputStream.println(line)
189-
} else {
190-
// Data line - check for duplicate alleles
191-
if (isValidVcfDataLine(line)) {
192-
outputStream.println(line)
193-
} else {
194-
skippedCount += 1
195-
if (skippedCount <= 10) {
196-
logger.warn(s"Skipping malformed VCF record at line $lineNumber: ${line.take(100)}...")
197-
}
122+
while ({ bytesRead = inputStream.read(buffer); bytesRead != -1 }) {
123+
outputStream.write(buffer, 0, bytesRead)
124+
totalBytes += bytesRead
198125
}
126+
127+
logger.info(s"Downloaded $totalBytes bytes")
128+
} finally {
129+
inputStream.close()
130+
outputStream.close()
199131
}
200-
}
201132

202-
if (skippedCount > 10) {
203-
logger.warn(s"Skipped ${skippedCount - 10} additional malformed records (warnings suppressed)")
204-
}
205-
} finally {
206-
inputStream.close()
207-
outputStream.close()
208-
}
133+
// Atomic rename
134+
if (targetFile.exists()) {
135+
targetFile.delete()
136+
}
137+
if (!tempFile.renameTo(targetFile)) {
138+
throw new RuntimeException(s"Failed to rename temp file to ${targetFile.getAbsolutePath}")
139+
}
209140

210-
// Replace original with sanitized version
211-
if (sourceFile.exists()) {
212-
sourceFile.delete()
213-
}
214-
if (!tempFile.renameTo(sourceFile)) {
215-
throw new RuntimeException(s"Failed to rename sanitized file to ${sourceFile.getAbsolutePath}")
141+
logger.info(s"GFF file saved to ${targetFile.getAbsolutePath}")
142+
} finally {
143+
connection.disconnect()
144+
}
216145
}
217-
218-
logger.info(s"VCF sanitization complete. Processed $lineNumber lines, skipped $skippedCount malformed records.")
219-
skippedCount
220-
}
221-
222-
/**
223-
* Validates a VCF data line for common issues that break HTSJDK parsing.
224-
* Checks for:
225-
* - Duplicate alleles (REF appearing in ALT, or duplicate ALT alleles)
226-
* - Empty required fields
227-
*/
228-
private def isValidVcfDataLine(line: String): Boolean = {
229-
val fields = line.split("\t", 6) // Only need first 5 fields: CHROM, POS, ID, REF, ALT
230-
if (fields.length < 5) return false
231-
232-
val ref = fields(3).toUpperCase
233-
val altField = fields(4)
234-
235-
// Handle missing ALT (just ".")
236-
if (altField == ".") return true
237-
238-
val alts = altField.split(",").map(_.toUpperCase)
239-
240-
// Check for duplicate alleles
241-
val allAlleles = ref +: alts
242-
val uniqueAlleles = allAlleles.distinct
243-
244-
// If we have fewer unique alleles than total, there are duplicates
245-
uniqueAlleles.length == allAlleles.length
246146
}
247147
}

app/config/FeatureFlags.scala

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,4 +17,9 @@ class FeatureFlags @Inject()(config: Configuration) {
1717
* Disabled by default until age data is populated.
1818
*/
1919
val showBranchAgeEstimates: Boolean = featuresConfig.getOptional[Boolean]("tree.showBranchAgeEstimates").getOrElse(false)
20+
21+
/**
22+
* Show the alternative "Block Layout" (ytree.net style) for the tree.
23+
*/
24+
val showVerticalTree: Boolean = featuresConfig.getOptional[Boolean]("tree.showVerticalTree").getOrElse(false)
2025
}

app/config/GenomicsConfig.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,8 @@ class GenomicsConfig @Inject()(config: Configuration) {
2222
}
2323

2424
// YBrowse configuration
25-
val ybrowseVcfUrl: String = genomicsConfig.get[String]("ybrowse.vcf_url")
26-
val ybrowseVcfStoragePath: File = new File(genomicsConfig.get[String]("ybrowse.vcf_storage_path"))
25+
val ybrowseGffUrl: String = genomicsConfig.get[String]("ybrowse.gff_url")
26+
val ybrowseGffStoragePath: File = new File(genomicsConfig.get[String]("ybrowse.gff_storage_path"))
2727

2828
/**
2929
* Retrieves the path to a liftover chain file for a given source and target genome.

0 commit comments

Comments
 (0)