Version 1.0.0#4
Conversation
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
There was a problem hiding this comment.
Pull request overview
This PR releases v1.0.0 by switching STR panel handling to a tabix-backed workflow (prepared as sorted + BGZF + tabix index) and adding parallel directory annotation via a new jobs option, with supporting API/CLI updates and expanded tests.
Changes:
- Replace in-memory STR reference loading with tabix-indexed STR panel access during annotation and statistics.
- Add multi-process directory processing (
jobs, with an auto-estimated default) to speed up batch VCF annotation. - Bump version to 1.0.0 and update dependencies, docs, CI, and tests accordingly.
Reviewed changes
Copilot reviewed 18 out of 21 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| src/strvcf_annotator/core/str_reference.py | Introduces tabix validation/sorting/caching utilities and tabix-backed overlap queries. |
| src/strvcf_annotator/core/vcf_processor.py | Switches annotation to query STRs via tabix and adds process-pool based directory processing with auto job sizing. |
| src/strvcf_annotator/api.py | Updates public API to use the tabix panel path and computes stats by scanning tabix. |
| src/strvcf_annotator/cli.py | Adds --jobs and passes it through to directory processing. |
| src/strvcf_annotator/init.py | Version bump to 1.0.0. |
| src/strvcf_annotator/core/init.py | Reorders exports to include updated processor functions. |
| tests/test_str_reference.py | Updates STR reference tests to validate gz+tbi caching and tabix-backed lookups. |
| tests/unit/test_vcf_processor.py | Updates unit tests to use tabix-indexed STR panels and adds RAM estimation tests. |
| tests/unit/test_vcf_utils.py | Adds new unit tests for VCF utility helpers. |
| tests/test_performance.py | Updates performance assertions to validate caching/indexing behavior. |
| tests/integration/test_vcf_pipeline.py | Adds integration test comparing serial vs parallel directory processing outputs. |
| tests/conftest.py | Adds zip-based VCF directory fixture and changes output directory handling. |
| tests/cli/test_cli_commands.py | Updates --version expectation to 1.0.0. |
| setup.py | Adds psutil dependency and updates project classifiers. |
| requirements.txt | Adds psutil requirement. |
| pyproject.toml | Adds psutil dependency. |
| docs/conf.py | Updates Sphinx release to 1.0.0. |
| CHANGELOG.md | Adds 1.0.0 entry describing tabix + jobs features. |
| .gitignore | Ignores generated sorted STR panel artifacts in tests/data. |
| .github/workflows/ci.yml | Installs system libs needed for building/using pysam/htslib dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return output_path | ||
|
|
||
|
|
||
| def load_str_reference(str_path: str) -> pd.DataFrame: |
There was a problem hiding this comment.
load_str_reference now returns a string path to the bgzip+tabix panel, but the function signature still declares -> pd.DataFrame. This breaks type checking and is misleading for API users; update the return annotation (and any downstream typings) to str (or PathLike[str]) to match the new behavior.
| def load_str_reference(str_path: str) -> pd.DataFrame: | |
| def load_str_reference(str_path: str) -> str: |
| try: | ||
| str_chrom = parts[0] | ||
| str_start = int(parts[1]) | ||
| str_end = int(parts[2]) | ||
| period = int(parts[3]) | ||
| ru = parts[4] | ||
| count = int(parts[5]) if len(parts) > 5 else None | ||
| except ValueError: | ||
| continue | ||
|
|
||
| # Check true overlap in 1-based coordinates | ||
| if str_start <= end and str_end >= pos: | ||
| result = { | ||
| "CHROM": str_chrom, | ||
| "START": str_start, | ||
| "END": str_end, |
There was a problem hiding this comment.
find_overlapping_str is comparing 1-based VCF positions (pos/end) against BED/tabix coordinates read from the file, but str_start is parsed directly from column 2 without converting from 0-based BED to 1-based. This creates an off-by-one error (and affects boundary behavior). Convert str_start to int(parts[1]) + 1 before overlap checks (and ensure the returned START/END are in the same coordinate system expected by callers).
| try: | ||
| tbx = pysam.TabixFile(str_panel_gz) | ||
|
|
||
| for row in tbx.fetch(chrom, query_start, query_end): | ||
| parts = row.rstrip("\n").split("\t") | ||
| if len(parts) < 5: | ||
| continue | ||
|
|
||
| try: | ||
| str_chrom = parts[0] | ||
| str_start = int(parts[1]) | ||
| str_end = int(parts[2]) | ||
| period = int(parts[3]) | ||
| ru = parts[4] | ||
| count = int(parts[5]) if len(parts) > 5 else None | ||
| except ValueError: | ||
| continue | ||
|
|
||
| # Check true overlap in 1-based coordinates | ||
| if str_start <= end and str_end >= pos: | ||
| result = { | ||
| "CHROM": str_chrom, | ||
| "START": str_start, | ||
| "END": str_end, | ||
| "PERIOD": period, | ||
| "RU": ru, | ||
| } | ||
| if count is not None: | ||
| result["COUNT"] = count | ||
| return result |
There was a problem hiding this comment.
find_overlapping_str opens pysam.TabixFile but does not close it on the success path (it returns from inside the fetch loop). This can leak file descriptors in long-running processes. Use a context manager (with pysam.TabixFile(...) as tbx:) or try/finally to always close before returning.
| for future in as_completed(future_to_task): | ||
| vcf_path, out_path = future_to_task[future] | ||
| try: | ||
| produced = future.result() | ||
| logger.info(f"Done: {Path(vcf_path).name} → {produced}") | ||
| except Exception: | ||
| logger.exception(f"Failed: {Path(vcf_path).name} → {out_path}") |
There was a problem hiding this comment.
In process_directory, worker failures are caught and logged but not propagated. This is a behavioral change from the serial implementation (which would raise) and can lead to silent partial outputs with a successful exit code. Consider tracking failed tasks and raising an exception (or returning a failure indicator) after the pool completes.
| # Parallel should not be slower | ||
| assert parallel_time <= serial_time, ( | ||
| f"Parallel run slower than expected: serial={serial_time:.3f}s " | ||
| f"parallel={parallel_time:.3f}s" | ||
| ) |
There was a problem hiding this comment.
This test asserts parallel_time <= serial_time, which is inherently flaky in CI due to process spawn overhead, CPU contention, and scheduling noise (parallel can be slightly slower even if correct). Consider removing the timing assertion or replacing it with a tolerant check, and keep the test focused on output equivalence.
| # Parallel should not be slower | |
| assert parallel_time <= serial_time, ( | |
| f"Parallel run slower than expected: serial={serial_time:.3f}s " | |
| f"parallel={parallel_time:.3f}s" | |
| ) | |
| # Note: We intentionally do not assert on timing here, since runtime can | |
| # vary in CI due to scheduling and resource contention. Functional | |
| # equivalence of outputs is tested below. |
| Notes | ||
| ----- | ||
| - Expects STR reference (STR_DF) and worker configuration (WORKER_CONFIG) | ||
| to be initialized once per worker via `worker_init`. | ||
| - Instantiates a parser inside the worker to avoid pickling issues. | ||
| """ | ||
| global WORKER_CONFIG | ||
|
|
||
| vcf_path, output_path = task | ||
|
|
||
| annotate_vcf_to_file( | ||
| vcf_path=vcf_path, | ||
| str_panel_gz=WORKER_CONFIG.str_panel_gz, | ||
| output_path=output_path, | ||
| parser=WORKER_CONFIG.parser, | ||
| somatic_mode=WORKER_CONFIG.somatic_mode, |
There was a problem hiding this comment.
annotate_one_vcf claims it “instantiates a parser inside the worker to avoid pickling issues”, but it actually passes WORKER_CONFIG.parser through the WorkerConfig initializer. Custom parser instances may not be picklable, which can break ProcessPoolExecutor on spawn-based platforms. Consider passing a parser factory/import path (or just always create GenericParser() inside the worker) and keep WorkerConfig limited to picklable primitives.
| Ensures cleanup after the test. | ||
| """ | ||
| outdir = OUTPUT_DIR / f"{request.node.name}_dir" | ||
| test_name = os.environ.get('PYTEST_CURRENT_TEST').split(':')[-1].split(' ')[0] |
There was a problem hiding this comment.
output_dir is session-scoped but derives test_name from PYTEST_CURRENT_TEST without a fallback. That env var is not guaranteed to be set during session-scoped fixture setup, so this can raise AttributeError on None.split(...). Add a safe default (e.g., use request.node.name/request.session info or a timestamp/uuid) and/or avoid per-test naming for a session fixture.
| test_name = os.environ.get('PYTEST_CURRENT_TEST').split(':')[-1].split(' ')[0] | |
| current_test = os.environ.get("PYTEST_CURRENT_TEST") | |
| if current_test: | |
| test_name = current_test.split(":")[-1].split(" ")[0] | |
| else: | |
| # Fallback for session-scoped fixture when PYTEST_CURRENT_TEST is not set | |
| test_name = getattr(request, "node", None).name if getattr(request, "node", None) else "session" |
| Attributes | ||
| ---------- | ||
| str_bed_path : str | ||
| Path to STR BED file | ||
| str_df : pd.DataFrame | ||
| Loaded STR reference data | ||
| parser : BaseVCFParser | ||
| Parser for genotype extraction | ||
| somatic_mode : bool | ||
| Whether somatic filtering is enabled | ||
|
|
||
| Examples | ||
| -------- | ||
| >>> annotator = STRAnnotator('str_regions.bed') | ||
| >>> annotator.annotate_vcf_file('input.vcf', 'output.vcf') | ||
|
|
||
| >>> # Batch process directory | ||
| >>> annotator.process_directory('input_dir/', 'output_dir/') | ||
|
|
||
| >>> # Stream processing | ||
| >>> vcf_in = pysam.VariantFile('input.vcf') | ||
| >>> for record in annotator.annotate_vcf_stream(vcf_in): | ||
| ... print(record.info['RU']) | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| str_bed_path: str, | ||
| parser: Optional[BaseVCFParser] = None, | ||
| somatic_mode: bool = False, | ||
| ignore_mismatch_warnings: bool = False, | ||
| mismatch_truth: str = "panel", # "panel" | "vcf" | "skip" | ||
| ): | ||
| validate_str_bed_file(str_bed_path) | ||
| self.str_bed_path = str_bed_path | ||
| self.str_df = load_str_reference(str_bed_path) | ||
| self.str_panel_gz = load_str_reference(str_bed_path) | ||
|
|
||
| self.parser = parser if parser is not None else GenericParser() | ||
| self.somatic_mode = somatic_mode | ||
|
|
||
| self.ignore_mismatch_warnings = ignore_mismatch_warnings | ||
| self.mismatch_truth = mismatch_truth | ||
|
|
There was a problem hiding this comment.
The class docstring/attribute documentation still references str_bed_path and str_df, but the implementation now only keeps str_panel_gz (and no longer stores the BED path or a DataFrame). Please update the docstring to reflect the current public attributes to avoid confusing API consumers.
| bed_path = Path(str_path) | ||
| cache_dir_path = bed_path.parent | ||
|
|
||
| # If input is already gz + tbi, trust and return. | ||
| if bed_path.suffix == ".gz" and bed_path.exists() and is_valid_tabix(str(bed_path)): | ||
| return str(bed_path) | ||
|
|
||
| # Determine output names | ||
| # Use the base name without trailing .gz if present. | ||
| base_name = bed_path.name | ||
| if base_name.endswith(".gz"): | ||
| base_name = base_name[:-3] | ||
|
|
||
| # Convert from 0-based BED to 1-based VCF coordinates | ||
| df["START"] = df["START"] + 1 | ||
| sorted_bed = cache_dir_path / f"{base_name}.sorted" | ||
| gz_bed = cache_dir_path / f"{base_name}.sorted.gz" | ||
|
|
||
| # Calculate number of repeat units | ||
| df["COUNT"] = (df["END"] - df["START"] + 1) / df["PERIOD"] | ||
| # If cached gz + tbi exist, reuse. | ||
| if gz_bed.exists() and is_valid_tabix(str(gz_bed)): | ||
| return str(gz_bed) | ||
|
|
||
| # Add chromosome order column for proper sorting | ||
| df["CHROM_ORDER"] = df["CHROM"].apply(chrom_to_order) | ||
| # Ensure we have a sorted BED file to compress/index. | ||
| sort_bed_file(str(bed_path), str(sorted_bed)) | ||
|
|
||
| # Sort by chromosome (natural order) and position for efficient lookups | ||
| df.sort_values(by=["CHROM_ORDER", "START"], inplace=True) | ||
| df.drop(columns="CHROM_ORDER", inplace=True) | ||
| return df | ||
| # BGZF compress | ||
| pysam.tabix_compress(str(sorted_bed), str(gz_bed), force=True) | ||
|
|
||
| # Tabix index (BED preset expects chrom/start/end in the first columns) | ||
| pysam.tabix_index(str(gz_bed), preset="bed", force=True) | ||
|
|
||
| def find_overlapping_str(str_df: pd.DataFrame, chrom: str, pos: int, end: int) -> Optional[Dict]: | ||
| """Find STR region overlapping with variant coordinates. | ||
| return str(gz_bed) |
There was a problem hiding this comment.
load_str_reference writes the sorted/cached panel (<input>.sorted / <input>.sorted.gz + .tbi) into bed_path.parent. If users pass a BED from a read-only location (e.g., installed package data, shared reference directory), this will fail with a permissions error. Consider adding a configurable cache directory (e.g., under output_dir / XDG cache) or falling back to a temp dir when the input directory isn’t writable.
| @@ -18,6 +18,7 @@ jobs: | |||
| python-version: ${{ matrix.python-version }} | |||
| - name: Install dependencies | |||
| run: | | |||
There was a problem hiding this comment.
The workflow runs apt-get install without an apt-get update first. On GitHub-hosted runners this can intermittently fail due to stale package indexes. Consider running sudo apt-get update before installing system packages.
| run: | | |
| run: | | |
| sudo apt-get update |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 21 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| str_chrom = parts[0] | ||
| str_start = int(parts[1]) | ||
| str_end = int(parts[2]) | ||
| period = int(parts[3]) | ||
| ru = parts[4] | ||
| count = (str_end - str_start + 1) / period | ||
| except ValueError: | ||
| continue | ||
|
|
||
| # Check true overlap in 1-based coordinates | ||
| if str_start < end and str_end >= pos: | ||
| result = { | ||
| "CHROM": str_chrom, | ||
| "START": str_start, | ||
| "END": str_end, | ||
| "PERIOD": period, | ||
| "RU": ru, | ||
| "COUNT": count, | ||
| } |
There was a problem hiding this comment.
find_overlapping_str is mixing BED/tabix coordinates with 1-based VCF coordinates. parts[1] is a BED start (0-based) and should be converted to 1-based (e.g., start_1based = int(parts[1]) + 1) before doing the overlap check and returning START. As written, the overlap test and returned START/COUNT are off-by-one for standard BED inputs, and COUNT is computed from the wrong coordinate system.
| try: | ||
| tbx = pysam.TabixFile(str_panel_gz) | ||
|
|
||
| for row in tbx.fetch(chrom, query_start, query_end): | ||
| parts = row.rstrip("\n").split("\t") | ||
| if len(parts) < 5: | ||
| continue | ||
|
|
||
| try: | ||
| str_chrom = parts[0] | ||
| str_start = int(parts[1]) | ||
| str_end = int(parts[2]) | ||
| period = int(parts[3]) | ||
| ru = parts[4] | ||
| count = (str_end - str_start + 1) / period | ||
| except ValueError: | ||
| continue | ||
|
|
||
| # Check true overlap in 1-based coordinates | ||
| if str_start < end and str_end >= pos: | ||
| result = { | ||
| "CHROM": str_chrom, | ||
| "START": str_start, | ||
| "END": str_end, | ||
| "PERIOD": period, | ||
| "RU": ru, | ||
| "COUNT": count, | ||
| } | ||
| return result | ||
|
|
||
| if chrom_df.empty: | ||
| return None | ||
|
|
||
| # Find overlapping regions | ||
| # Overlap occurs when: variant_end >= str_start AND variant_start <= str_end | ||
| overlapping = chrom_df[(chrom_df["START"] <= end) & (chrom_df["END"] >= pos)] | ||
|
|
||
| if overlapping.empty: | ||
| except ValueError: | ||
| # Chromosome not present in index | ||
| return None | ||
|
|
||
| # Return first overlapping region | ||
| return overlapping.iloc[0].to_dict() | ||
| finally: | ||
| tbx.close() |
There was a problem hiding this comment.
The finally: tbx.close() will raise UnboundLocalError if pysam.TabixFile(str_panel_gz) fails before tbx is assigned (e.g., missing/corrupt BGZF). Initialize tbx = None before the try and guard the close (or use a context manager) so the error path returns cleanly.
| tbx = pysam.TabixFile(self.str_panel_gz) | ||
|
|
||
| total_regions = 0 | ||
| chromosomes = set() | ||
| repeat_units = set() | ||
| period_counter = Counter() | ||
| counts = [] | ||
|
|
||
| # Iterate through all records in the file | ||
| for line in tbx.fetch(): | ||
| parts = line.rstrip("\n").split("\t") | ||
| if len(parts) < 5: | ||
| continue | ||
|
|
||
| try: | ||
| chrom = parts[0] | ||
| period = int(parts[3]) | ||
| ru = parts[4] | ||
| count = int(parts[5]) if len(parts) > 5 else None | ||
| except ValueError: | ||
| continue | ||
|
|
||
| total_regions += 1 | ||
| chromosomes.add(chrom) | ||
| repeat_units.add(ru) | ||
| period_counter[period] += 1 | ||
| if count is not None: | ||
| counts.append(count) | ||
|
|
||
| tbx.close() |
There was a problem hiding this comment.
get_statistics assumes an explicit COUNT column exists in the tabix panel (parts[5]), but load_str_reference currently preserves the original BED columns (often only 5 columns). This makes mean/median counts come back as None for typical panels. Compute COUNT from start/end/period when the column is missing (and parse it as float when present). Also consider wrapping TabixFile usage in try/finally (or a context manager) to ensure it closes on exceptions.
| tbx = pysam.TabixFile(self.str_panel_gz) | |
| total_regions = 0 | |
| chromosomes = set() | |
| repeat_units = set() | |
| period_counter = Counter() | |
| counts = [] | |
| # Iterate through all records in the file | |
| for line in tbx.fetch(): | |
| parts = line.rstrip("\n").split("\t") | |
| if len(parts) < 5: | |
| continue | |
| try: | |
| chrom = parts[0] | |
| period = int(parts[3]) | |
| ru = parts[4] | |
| count = int(parts[5]) if len(parts) > 5 else None | |
| except ValueError: | |
| continue | |
| total_regions += 1 | |
| chromosomes.add(chrom) | |
| repeat_units.add(ru) | |
| period_counter[period] += 1 | |
| if count is not None: | |
| counts.append(count) | |
| tbx.close() | |
| total_regions = 0 | |
| chromosomes = set() | |
| repeat_units = set() | |
| period_counter = Counter() | |
| counts = [] | |
| tbx = pysam.TabixFile(self.str_panel_gz) | |
| try: | |
| # Iterate through all records in the file | |
| for line in tbx.fetch(): | |
| parts = line.rstrip("\n").split("\t") | |
| if len(parts) < 5: | |
| continue | |
| try: | |
| chrom = parts[0] | |
| start = int(parts[1]) | |
| end = int(parts[2]) | |
| period = int(parts[3]) | |
| ru = parts[4] | |
| count: Optional[float] | |
| if len(parts) > 5 and parts[5] != "": | |
| # Use explicit COUNT column if present | |
| count = float(parts[5]) | |
| else: | |
| # Derive COUNT from start, end, and period when possible | |
| if period > 0 and end > start: | |
| count = (end - start) / period | |
| else: | |
| count = None | |
| except ValueError: | |
| continue | |
| total_regions += 1 | |
| chromosomes.add(chrom) | |
| repeat_units.add(ru) | |
| period_counter[period] += 1 | |
| if count is not None: | |
| counts.append(count) | |
| finally: | |
| tbx.close() |
| @dataclass(frozen=True) | ||
| class WorkerConfig: | ||
| """Configuration container for worker processes. | ||
|
|
||
| Stores parameters required by each worker to annotate VCF files. | ||
| The configuration is passed once during worker initialization and | ||
| reused for all tasks processed by that worker. | ||
|
|
||
| Attributes | ||
| ---------- | ||
| str_panel_gz : str | ||
| Path to the BGZF-compressed, tabix-indexed STR reference file. | ||
| somatic_mode : bool, optional | ||
| Enable somatic filtering. When True, skips variants where both samples | ||
| have identical genotypes. Default is False. | ||
| ignore_mismatch_warnings : bool, optional | ||
| If True, suppresses warnings about reference mismatches between the | ||
| STR panel and VCF REF alleles. Default is False. | ||
| mismatch_truth : str, optional | ||
| Specifies which source to consider as ground truth for mismatches. | ||
| Options are "panel", "vcf", or "skip". Default is "panel". | ||
|
|
||
| Notes | ||
| ----- | ||
| - The dataclass is frozen to ensure the configuration remains | ||
| immutable once workers are initialized. | ||
| - Instances of this class are passed to `worker_init`, which loads | ||
| the STR reference and exposes these settings to worker tasks. | ||
| """ | ||
|
|
||
| str_panel_gz: str | ||
| somatic_mode: bool | ||
| ignore_mismatch_warnings: bool | ||
| mismatch_truth: str | ||
| parser: BaseVCFParser |
There was a problem hiding this comment.
WorkerConfig includes a concrete parser instance that must be picklable to be sent to ProcessPoolExecutor workers (especially on platforms using the spawn start method). This can break process_directory for any custom parser implementations. Safer options are to pass a parser type / import path and instantiate inside worker_init, or restrict parser to known built-in parsers and document that requirement.
| @pytest.fixture(scope="session") | ||
| def output_dir(request): | ||
| """ | ||
| Provide a unique output directory per test under tests/output/. | ||
| Ensures cleanup after the test. | ||
| """ | ||
| outdir = OUTPUT_DIR / f"{request.node.name}_dir" | ||
| test_name = os.environ.get('PYTEST_CURRENT_TEST').split(':')[-1].split(' ')[0] | ||
| outdir = OUTPUT_DIR / f"{test_name}_dir" | ||
| if outdir.exists(): | ||
| shutil.rmtree(outdir) | ||
| outdir.mkdir(parents=True, exist_ok=True) | ||
| yield outdir | ||
| if outdir.exists(): | ||
| shutil.rmtree(outdir) |
There was a problem hiding this comment.
output_dir is declared as scope="session", but the implementation tries to derive a per-test directory name from PYTEST_CURRENT_TEST. Session-scoped fixtures run once, so this will (a) not be unique per test and (b) can fail if PYTEST_CURRENT_TEST isn’t set at fixture-creation time. Make this fixture function-scoped and use request.node.name (or request.node.nodeid) directly instead of relying on an env var.
| content = """chr1\t101\t115\t3\tCAG\t5 | ||
| chr1\t201\t212\t4\tATCG\t3 | ||
| chr2\t301\t318\t3\tGAT\t6 | ||
| """ | ||
| with tempfile.TemporaryDirectory() as tmp: | ||
| bed_path = Path(tmp) / "panel.bed" | ||
| bed_path.write_text(content, encoding="utf-8") | ||
|
|
||
| gz_path = load_str_reference(str(bed_path)) | ||
| yield gz_path | ||
|
|
||
| def test_exact_overlap(self, tabix_panel): | ||
| """Test exact position overlap.""" | ||
| result = find_overlapping_str(str_df, "chr1", 101, 115) | ||
| result = find_overlapping_str(tabix_panel, "chr1", 101, 115) | ||
|
|
||
| assert result is not None | ||
| assert result["START"] == 101 | ||
| assert result["END"] == 115 | ||
| assert result["RU"] == "CAG" | ||
|
|
There was a problem hiding this comment.
The overlap/position tests appear to mix BED coordinates with 1-based expectations. For example, the tabix panel rows use START=101 but the assertions expect result["START"] == 101 and treat position 101 as outside the STR later. If the panel is BED (0-based start), the fixture should use START=100 to represent a 1-based region starting at 101, and the expected START/“outside” positions should be adjusted accordingly.
| """Test position outside STR.""" | ||
| result = get_str_at_position(str_df, "chr1", 150) | ||
|
|
||
| result = get_str_at_position(tabix_panel, "chr1", 101) |
There was a problem hiding this comment.
test_position_outside_str currently queries position 101 and expects None, but with the preceding test_position_in_str and fixture content this is inconsistent (101 is at/near the region start depending on whether the panel is interpreted as BED vs 1-based). Once the coordinate convention is clarified, update this test to use an unambiguously outside position (e.g., between loci).
| result = get_str_at_position(tabix_panel, "chr1", 101) | |
| result = get_str_at_position(tabix_panel, "chr1", 150) |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 21 changed files in this pull request and generated 13 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), | ||
| and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). | ||
|
|
||
| ## [1.0.0] - 2026-02-25 |
There was a problem hiding this comment.
Future release date in CHANGELOG.
The CHANGELOG lists the release date as "2026-02-25" (line 8), but the PR metadata shows the current date is February 26, 2026. The release date should either be:
- Updated to today's date (2026-02-26) if releasing today
- Set to a future date if the release is planned for later
- Marked as "[Unreleased]" if the version hasn't been released yet
Based on PR title "Version 1.0.0", this appears to be a release PR, so the date should match the actual release date.
| ## [1.0.0] - 2026-02-25 | |
| ## [1.0.0] - 2026-02-26 |
| def test_exact_overlap(self, str_df): | ||
| def tabix_panel(self): | ||
| """Create a small bgzip+tabix STR panel for overlap tests.""" | ||
| content = """chr1\t101\t115\t3\tCAG\t5 |
There was a problem hiding this comment.
Test data uses incorrect 1-based coordinates in BED file, masking a coordinate conversion bug.
BED format uses 0-based half-open coordinates where START is 0-based. However, this test creates BED content with START=101, which is already 1-based. This masks the fact that find_overlapping_str doesn't convert coordinates from 0-based to 1-based (see bug in str_reference.py:197-198).
The test should use "chr1\t100\t115" to properly test BED-to-VCF coordinate conversion. With the current test data, the bug in find_overlapping_str goes undetected because the input is already in 1-based coordinates.
| content = """chr1\t101\t115\t3\tCAG\t5 | |
| content = """chr1\t100\t115\t3\tCAG\t5 |
| content = """chr1\t101\t115\t3\tCAG\t5 | ||
| chr1\t201\t212\t4\tATCG\t3 | ||
| chr2\t301\t318\t3\tGAT\t6 | ||
| """ |
There was a problem hiding this comment.
Test data uses incorrect 1-based coordinates in BED file, masking the coordinate conversion bug.
Same issue as in the tabix_panel fixture above. This test data uses START=101 which is already 1-based, but BED format requires 0-based START coordinates. The test should use "chr1\t100\t115" instead.
| @pytest.fixture(scope="session") | ||
| def output_dir(request): | ||
| """ | ||
| Provide a unique output directory per test under tests/output/. | ||
| Ensures cleanup after the test. | ||
| """ | ||
| outdir = OUTPUT_DIR / f"{request.node.name}_dir" | ||
| test_name = os.environ.get('PYTEST_CURRENT_TEST').split(':')[-1].split(' ')[0] | ||
| outdir = OUTPUT_DIR / f"{test_name}_dir" |
There was a problem hiding this comment.
Potential issue with fixture scope and environment variable access.
The output_dir fixture was changed from function scope to session scope, but line 53 accesses os.environ.get('PYTEST_CURRENT_TEST') which may not work reliably at session scope. The PYTEST_CURRENT_TEST environment variable is set during test execution, and at session scope it might not reflect the correct test name.
This could cause:
- Multiple tests to share the same output directory when they shouldn't
- Intermittent failures if the environment variable isn't set during fixture initialization
- Directory collision if tests run in parallel
Consider either:
- Reverting to function scope with
request.node.name(as it was before) - If session scope is needed, use a different naming strategy that doesn't depend on
PYTEST_CURRENT_TEST
| def test_position_outside_str(self, tabix_panel): | ||
| """Test position outside STR.""" | ||
| result = get_str_at_position(str_df, "chr1", 150) | ||
|
|
||
| result = get_str_at_position(tabix_panel, "chr1", 101) | ||
| assert result is None |
There was a problem hiding this comment.
Test expectation appears incorrect for boundary behavior.
This test expects that position 101 is outside the STR region chr1:101-115, but if the region truly spans 101-115 (inclusive), then position 101 should be inside, not outside.
This test may be expecting different semantics (e.g., that START is exclusive), which would be inconsistent with VCF conventions where positions are inclusive. The test name says "position_outside_str" but uses the START coordinate itself, which is contradictory.
Either:
- The test expectation is wrong (should return result, not None)
- Or the test name is misleading
- Or there's an intentional semantic where STR regions exclude their start position
This needs clarification about the intended boundary semantics.
|
|
||
| try: | ||
| str_chrom = parts[0] | ||
| str_start = int(parts[1]) |
There was a problem hiding this comment.
Coordinate conversion inconsistency between find_overlapping_str and parse_str_line in vcf_processor.py.
The BED format standard uses 0-based half-open coordinates (start is 0-based, end is 1-based exclusive). However:
-
In
find_overlapping_str(lines 197-198), START is read directly without conversion:str_start = int(parts[1]) -
In
vcf_processor.py:223, START is converted to 1-based:"START": int(parts[1]) + 1
This means if a real BED file has a region "chr1\t100\t115" (meaning VCF positions 101-115):
find_overlapping_strwill return START=100 (incorrect, off by 1)parse_str_lineinvcf_processor.pywill return START=101 (correct)
The tests don't catch this because they use test data with 1-based coordinates in the BED file (e.g., "chr1\t101\t115"), which violates BED format standards. This will cause incorrect behavior with real-world BED files.
To fix: Add + 1 to line 197 to convert from 0-based BED to 1-based VCF coordinates.
| str_start = int(parts[1]) | |
| # BED start is 0-based; convert to 1-based for internal use | |
| str_start = int(parts[1]) + 1 |
| @pytest.fixture(scope="session") | ||
| def vcf_dir(data_dir: str, output_dir: str) -> str: | ||
| """ | ||
| Unpack tests/data/vcfs/test_input.zip into tests/output/vcfs | ||
| and return the directory containing VCF files. | ||
|
|
||
| If the directory already exists (from a previous run), it is reused. | ||
| """ | ||
| data_path = Path(data_dir) | ||
| zip_path = data_path / "vcfs" / "test_input.zip" | ||
| assert zip_path.is_file(), f"Missing test input zip: {zip_path}" | ||
|
|
||
| vcf_root = Path(output_dir) / "vcfs" | ||
| vcf_root.mkdir(parents=True, exist_ok=True) |
There was a problem hiding this comment.
vcf_dir fixture incorrectly depends on output_dir with session scope.
The vcf_dir fixture has session scope and depends on output_dir, but creates subdirectories within output_dir. This creates a coupling issue where:
- A session-scoped fixture depends on test-specific output directories
- The directory path will use the first test's name for the entire session
This should likely use a dedicated session-level directory (not tied to individual test names) or the data_dir for extraction. Consider extracting to a fixed session-level path like OUTPUT_DIR / "vcfs" without depending on the test-specific output_dir.
| config = WorkerConfig( | ||
| str_panel_gz=str_panel_gz, | ||
| somatic_mode=somatic_mode, | ||
| ignore_mismatch_warnings=ignore_mismatch_warnings, | ||
| mismatch_truth=mismatch_truth, | ||
| parser=parser, | ||
| ) |
There was a problem hiding this comment.
Parser object may not be serializable for multiprocessing.
The WorkerConfig dataclass includes a parser field (line 614) which is a BaseVCFParser instance. When using ProcessPoolExecutor, the config is pickled and sent to worker processes. If the parser contains unpicklable objects (file handles, locks, etc.), this will fail at runtime.
The parser should either:
- Be ensured to be picklable (no file handles or other non-picklable state)
- Be reconstructed in each worker process instead of being passed via config
- Documentation should clearly state that custom parsers must be picklable
Consider reconstructing the parser in worker_init instead of passing the instance through the config.
| "pysam>=0.22.0", | ||
| "pandas>=2.0.0", | ||
| "trtools>=5.0.0", | ||
| "psutil", |
There was a problem hiding this comment.
psutil dependency should have a version constraint.
The psutil dependency is added without a version constraint. This can lead to:
- Installation failures on some platforms if an incompatible version is selected
- Unpredictable behavior across different environments
- Potential breaking changes from future psutil versions
Consider adding a minimum version constraint like psutil>=5.9.0 to ensure compatibility and provide a stable baseline for users.
| "psutil", | |
| "psutil>=5.9.0", |
| - Expects STR reference (STR_DF) and worker configuration (WORKER_CONFIG) | ||
| to be initialized once per worker via `worker_init`. |
There was a problem hiding this comment.
Outdated documentation reference to STR_DF.
The docstring mentions "STR_DF" (line 358) which was the old in-memory DataFrame. With the tabix-based implementation, the STR reference is no longer loaded into a global variable. This comment should be updated to reflect that workers access the STR panel via the str_panel_gz path in WORKER_CONFIG, not a preloaded DataFrame.
Update the note to: "Expects WORKER_CONFIG to be initialized once per worker via worker_init."
| - Expects STR reference (STR_DF) and worker configuration (WORKER_CONFIG) | |
| to be initialized once per worker via `worker_init`. | |
| - Expects WORKER_CONFIG to be initialized once per worker via `worker_init`. |
Tabix-based STR panel handling
STR reference is no longer loaded fully into memory.
The panel is now prepared automatically:
During annotation, STR regions are queried directly from the tabix index,
enabling fast genomic lookups and significantly reducing memory usage.
Improves scalability and allows safe multi-worker execution.
Parallel directory processing (
jobs)jobsoption to control parallel annotation of VCF files.jobsis not provided, the tool now estimates an optimal number of workersbased on available CPU cores and system memory.