Version 1.0.0 by KondratievaOlesya · Pull Request #4 · acg-team/strvcf_annotator

KondratievaOlesya · 2026-02-25T14:49:34Z

Tabix-based STR panel handling
- STR reference is no longer loaded fully into memory.
- The panel is now prepared automatically:
  - sorted by genomic chromosome order,
  - BGZF-compressed,
  - tabix-indexed.
- During annotation, STR regions are queried directly from the tabix index,
  enabling fast genomic lookups and significantly reducing memory usage.
- Improves scalability and allows safe multi-worker execution.
Parallel directory processing (jobs)
- Added new jobs option to control parallel annotation of VCF files.
- Each worker processes one VCF file independently.
- If jobs is not provided, the tool now estimates an optimal number of workers
  based on available CPU cores and system memory.
- This can substantially speed up processing of large VCF directories.

codecov · 2026-02-25T14:58:26Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

Copilot

Pull request overview

This PR releases v1.0.0 by switching STR panel handling to a tabix-backed workflow (prepared as sorted + BGZF + tabix index) and adding parallel directory annotation via a new jobs option, with supporting API/CLI updates and expanded tests.

Changes:

Replace in-memory STR reference loading with tabix-indexed STR panel access during annotation and statistics.
Add multi-process directory processing (jobs, with an auto-estimated default) to speed up batch VCF annotation.
Bump version to 1.0.0 and update dependencies, docs, CI, and tests accordingly.

Reviewed changes

Copilot reviewed 18 out of 21 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
src/strvcf_annotator/core/str_reference.py	Introduces tabix validation/sorting/caching utilities and tabix-backed overlap queries.
src/strvcf_annotator/core/vcf_processor.py	Switches annotation to query STRs via tabix and adds process-pool based directory processing with auto job sizing.
src/strvcf_annotator/api.py	Updates public API to use the tabix panel path and computes stats by scanning tabix.
src/strvcf_annotator/cli.py	Adds `--jobs` and passes it through to directory processing.
src/strvcf_annotator/init.py	Version bump to 1.0.0.
src/strvcf_annotator/core/init.py	Reorders exports to include updated processor functions.
tests/test_str_reference.py	Updates STR reference tests to validate gz+tbi caching and tabix-backed lookups.
tests/unit/test_vcf_processor.py	Updates unit tests to use tabix-indexed STR panels and adds RAM estimation tests.
tests/unit/test_vcf_utils.py	Adds new unit tests for VCF utility helpers.
tests/test_performance.py	Updates performance assertions to validate caching/indexing behavior.
tests/integration/test_vcf_pipeline.py	Adds integration test comparing serial vs parallel directory processing outputs.
tests/conftest.py	Adds zip-based VCF directory fixture and changes output directory handling.
tests/cli/test_cli_commands.py	Updates `--version` expectation to 1.0.0.
setup.py	Adds `psutil` dependency and updates project classifiers.
requirements.txt	Adds `psutil` requirement.
pyproject.toml	Adds `psutil` dependency.
docs/conf.py	Updates Sphinx release to 1.0.0.
CHANGELOG.md	Adds 1.0.0 entry describing tabix + jobs features.
.gitignore	Ignores generated sorted STR panel artifacts in tests/data.
.github/workflows/ci.yml	Installs system libs needed for building/using pysam/htslib dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-25T15:08:01Z

+    return output_path
+
+
 def load_str_reference(str_path: str) -> pd.DataFrame:


load_str_reference now returns a string path to the bgzip+tabix panel, but the function signature still declares -> pd.DataFrame. This breaks type checking and is misleading for API users; update the return annotation (and any downstream typings) to str (or PathLike[str]) to match the new behavior.

Suggested change

def load_str_reference(str_path: str) -> pd.DataFrame:

def load_str_reference(str_path: str) -> str:

Copilot · 2026-02-25T15:08:02Z

+            try:
+                str_chrom = parts[0]
+                str_start = int(parts[1])
+                str_end = int(parts[2])
+                period = int(parts[3])
+                ru = parts[4]
+                count = int(parts[5]) if len(parts) > 5 else None
+            except ValueError:
+                continue
+
+            # Check true overlap in 1-based coordinates
+            if str_start <= end and str_end >= pos:
+                result = {
+                    "CHROM": str_chrom,
+                    "START": str_start,
+                    "END": str_end,


find_overlapping_str is comparing 1-based VCF positions (pos/end) against BED/tabix coordinates read from the file, but str_start is parsed directly from column 2 without converting from 0-based BED to 1-based. This creates an off-by-one error (and affects boundary behavior). Convert str_start to int(parts[1]) + 1 before overlap checks (and ensure the returned START/END are in the same coordinate system expected by callers).

Copilot · 2026-02-25T15:08:02Z

+    try:
+        tbx = pysam.TabixFile(str_panel_gz)
+
+        for row in tbx.fetch(chrom, query_start, query_end):
+            parts = row.rstrip("\n").split("\t")
+            if len(parts) < 5:
+                continue
+
+            try:
+                str_chrom = parts[0]
+                str_start = int(parts[1])
+                str_end = int(parts[2])
+                period = int(parts[3])
+                ru = parts[4]
+                count = int(parts[5]) if len(parts) > 5 else None
+            except ValueError:
+                continue
+
+            # Check true overlap in 1-based coordinates
+            if str_start <= end and str_end >= pos:
+                result = {
+                    "CHROM": str_chrom,
+                    "START": str_start,
+                    "END": str_end,
+                    "PERIOD": period,
+                    "RU": ru,
+                }
+                if count is not None:
+                    result["COUNT"] = count
+                return result


find_overlapping_str opens pysam.TabixFile but does not close it on the success path (it returns from inside the fetch loop). This can leak file descriptors in long-running processes. Use a context manager (with pysam.TabixFile(...) as tbx:) or try/finally to always close before returning.

Copilot · 2026-02-25T15:08:02Z

+        for future in as_completed(future_to_task):
+            vcf_path, out_path = future_to_task[future]
+            try:
+                produced = future.result()
+                logger.info(f"Done: {Path(vcf_path).name} → {produced}")
+            except Exception:
+                logger.exception(f"Failed: {Path(vcf_path).name} → {out_path}")


In process_directory, worker failures are caught and logged but not propagated. This is a behavioral change from the serial implementation (which would raise) and can lead to silent partial outputs with a successful exit code. Consider tracking failed tasks and raising an exception (or returning a failure indicator) after the pool completes.

Copilot · 2026-02-25T15:08:03Z

+        # Parallel should not be slower
+        assert parallel_time <= serial_time, (
+            f"Parallel run slower than expected: serial={serial_time:.3f}s "
+            f"parallel={parallel_time:.3f}s"
+        )


This test asserts parallel_time <= serial_time, which is inherently flaky in CI due to process spawn overhead, CPU contention, and scheduling noise (parallel can be slightly slower even if correct). Consider removing the timing assertion or replacing it with a tolerant check, and keep the test focused on output equivalence.

Suggested change

# Parallel should not be slower

assert parallel_time <= serial_time, (

f"Parallel run slower than expected: serial={serial_time:.3f}s "

f"parallel={parallel_time:.3f}s"

)

# Note: We intentionally do not assert on timing here, since runtime can

# vary in CI due to scheduling and resource contention. Functional

# equivalence of outputs is tested below.

Copilot · 2026-02-25T15:08:04Z

+    Notes
+    -----
+    - Expects STR reference (STR_DF) and worker configuration (WORKER_CONFIG)
+      to be initialized once per worker via `worker_init`.
+    - Instantiates a parser inside the worker to avoid pickling issues.
+    """
+    global WORKER_CONFIG
+
+    vcf_path, output_path = task
+
+    annotate_vcf_to_file(
+        vcf_path=vcf_path,
+        str_panel_gz=WORKER_CONFIG.str_panel_gz,
+        output_path=output_path,
+        parser=WORKER_CONFIG.parser,
+        somatic_mode=WORKER_CONFIG.somatic_mode,


annotate_one_vcf claims it “instantiates a parser inside the worker to avoid pickling issues”, but it actually passes WORKER_CONFIG.parser through the WorkerConfig initializer. Custom parser instances may not be picklable, which can break ProcessPoolExecutor on spawn-based platforms. Consider passing a parser factory/import path (or just always create GenericParser() inside the worker) and keep WorkerConfig limited to picklable primitives.

Copilot · 2026-02-25T15:08:04Z

    Ensures cleanup after the test.
    """
-    outdir = OUTPUT_DIR / f"{request.node.name}_dir"
+    test_name = os.environ.get('PYTEST_CURRENT_TEST').split(':')[-1].split(' ')[0]


output_dir is session-scoped but derives test_name from PYTEST_CURRENT_TEST without a fallback. That env var is not guaranteed to be set during session-scoped fixture setup, so this can raise AttributeError on None.split(...). Add a safe default (e.g., use request.node.name/request.session info or a timestamp/uuid) and/or avoid per-test naming for a session fixture.

Suggested change

test_name = os.environ.get('PYTEST_CURRENT_TEST').split(':')[-1].split(' ')[0]

current_test = os.environ.get("PYTEST_CURRENT_TEST")

if current_test:

test_name = current_test.split(":")[-1].split(" ")[0]

else:

# Fallback for session-scoped fixture when PYTEST_CURRENT_TEST is not set

test_name = getattr(request, "node", None).name if getattr(request, "node", None) else "session"

Copilot · 2026-02-25T15:08:04Z

    Attributes
    ----------
    str_bed_path : str
        Path to STR BED file
    str_df : pd.DataFrame
        Loaded STR reference data
    parser : BaseVCFParser
        Parser for genotype extraction
    somatic_mode : bool
        Whether somatic filtering is enabled

    Examples
    --------
    >>> annotator = STRAnnotator('str_regions.bed')
    >>> annotator.annotate_vcf_file('input.vcf', 'output.vcf')

    >>> # Batch process directory
    >>> annotator.process_directory('input_dir/', 'output_dir/')

    >>> # Stream processing
    >>> vcf_in = pysam.VariantFile('input.vcf')
    >>> for record in annotator.annotate_vcf_stream(vcf_in):
    ...     print(record.info['RU'])
    """

    def __init__(
        self,
        str_bed_path: str,
        parser: Optional[BaseVCFParser] = None,
        somatic_mode: bool = False,
        ignore_mismatch_warnings: bool = False,
        mismatch_truth: str = "panel",  # "panel" | "vcf" | "skip"
    ):
        validate_str_bed_file(str_bed_path)
-        self.str_bed_path = str_bed_path
-        self.str_df = load_str_reference(str_bed_path)
+        self.str_panel_gz = load_str_reference(str_bed_path)

        self.parser = parser if parser is not None else GenericParser()
        self.somatic_mode = somatic_mode

        self.ignore_mismatch_warnings = ignore_mismatch_warnings
        self.mismatch_truth = mismatch_truth



The class docstring/attribute documentation still references str_bed_path and str_df, but the implementation now only keeps str_panel_gz (and no longer stores the BED path or a DataFrame). Please update the docstring to reflect the current public attributes to avoid confusing API consumers.

Copilot · 2026-02-25T15:08:04Z

+    bed_path = Path(str_path)
+    cache_dir_path = bed_path.parent
+
+    # If input is already gz + tbi, trust and return.
+    if bed_path.suffix == ".gz" and bed_path.exists() and is_valid_tabix(str(bed_path)):
+        return str(bed_path)
+
+    # Determine output names
+    # Use the base name without trailing .gz if present.
+    base_name = bed_path.name
+    if base_name.endswith(".gz"):
+        base_name = base_name[:-3]

-    # Convert from 0-based BED to 1-based VCF coordinates
-    df["START"] = df["START"] + 1
+    sorted_bed = cache_dir_path / f"{base_name}.sorted"
+    gz_bed = cache_dir_path / f"{base_name}.sorted.gz"

-    # Calculate number of repeat units
-    df["COUNT"] = (df["END"] - df["START"] + 1) / df["PERIOD"]
+    # If cached gz + tbi exist, reuse.
+    if gz_bed.exists() and is_valid_tabix(str(gz_bed)):
+        return str(gz_bed)

-    # Add chromosome order column for proper sorting
-    df["CHROM_ORDER"] = df["CHROM"].apply(chrom_to_order)
+    # Ensure we have a sorted BED file to compress/index.
+    sort_bed_file(str(bed_path), str(sorted_bed))

-    # Sort by chromosome (natural order) and position for efficient lookups
-    df.sort_values(by=["CHROM_ORDER", "START"], inplace=True)
-    df.drop(columns="CHROM_ORDER", inplace=True)
-    return df
+    # BGZF compress
+    pysam.tabix_compress(str(sorted_bed), str(gz_bed), force=True)

+    # Tabix index (BED preset expects chrom/start/end in the first columns)
+    pysam.tabix_index(str(gz_bed), preset="bed", force=True)

-def find_overlapping_str(str_df: pd.DataFrame, chrom: str, pos: int, end: int) -> Optional[Dict]:
-    """Find STR region overlapping with variant coordinates.
+    return str(gz_bed)


load_str_reference writes the sorted/cached panel (<input>.sorted / <input>.sorted.gz + .tbi) into bed_path.parent. If users pass a BED from a read-only location (e.g., installed package data, shared reference directory), this will fail with a permissions error. Consider adding a configurable cache directory (e.g., under output_dir / XDG cache) or falling back to a temp dir when the input directory isn’t writable.

Copilot · 2026-02-25T15:08:05Z

@@ -18,6 +18,7 @@ jobs:
        python-version: ${{ matrix.python-version }}
    - name: Install dependencies
      run: |


The workflow runs apt-get install without an apt-get update first. On GitHub-hosted runners this can intermittently fail due to stale package indexes. Consider running sudo apt-get update before installing system packages.

Suggested change

run: |

run: |

sudo apt-get update

Copilot

Pull request overview

Copilot reviewed 18 out of 21 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-26T10:09:28Z

+            try:
+                str_chrom = parts[0]
+                str_start = int(parts[1])
+                str_end = int(parts[2])
+                period = int(parts[3])
+                ru = parts[4]
+                count = (str_end - str_start + 1) / period
+            except ValueError:
+                continue
+
+            # Check true overlap in 1-based coordinates
+            if str_start < end and str_end >= pos:
+                result = {
+                    "CHROM": str_chrom,
+                    "START": str_start,
+                    "END": str_end,
+                    "PERIOD": period,
+                    "RU": ru,
+                    "COUNT": count,
+                }


find_overlapping_str is mixing BED/tabix coordinates with 1-based VCF coordinates. parts[1] is a BED start (0-based) and should be converted to 1-based (e.g., start_1based = int(parts[1]) + 1) before doing the overlap check and returning START. As written, the overlap test and returned START/COUNT are off-by-one for standard BED inputs, and COUNT is computed from the wrong coordinate system.

Copilot · 2026-02-26T10:09:28Z

+    try:
+        tbx = pysam.TabixFile(str_panel_gz)
+
+        for row in tbx.fetch(chrom, query_start, query_end):
+            parts = row.rstrip("\n").split("\t")
+            if len(parts) < 5:
+                continue
+
+            try:
+                str_chrom = parts[0]
+                str_start = int(parts[1])
+                str_end = int(parts[2])
+                period = int(parts[3])
+                ru = parts[4]
+                count = (str_end - str_start + 1) / period
+            except ValueError:
+                continue
+
+            # Check true overlap in 1-based coordinates
+            if str_start < end and str_end >= pos:
+                result = {
+                    "CHROM": str_chrom,
+                    "START": str_start,
+                    "END": str_end,
+                    "PERIOD": period,
+                    "RU": ru,
+                    "COUNT": count,
+                }
+                return result

-    if chrom_df.empty:
        return None

-    # Find overlapping regions
-    # Overlap occurs when: variant_end >= str_start AND variant_start <= str_end
-    overlapping = chrom_df[(chrom_df["START"] <= end) & (chrom_df["END"] >= pos)]
-
-    if overlapping.empty:
+    except ValueError:
+        # Chromosome not present in index
        return None
-
-    # Return first overlapping region
-    return overlapping.iloc[0].to_dict()
+    finally:
+        tbx.close()


The finally: tbx.close() will raise UnboundLocalError if pysam.TabixFile(str_panel_gz) fails before tbx is assigned (e.g., missing/corrupt BGZF). Initialize tbx = None before the try and guard the close (or use a context manager) so the error path returns cleanly.

Copilot · 2026-02-26T10:09:29Z

+        tbx = pysam.TabixFile(self.str_panel_gz)
+
+        total_regions = 0
+        chromosomes = set()
+        repeat_units = set()
+        period_counter = Counter()
+        counts = []
+
+        # Iterate through all records in the file
+        for line in tbx.fetch():
+            parts = line.rstrip("\n").split("\t")
+            if len(parts) < 5:
+                continue
+
+            try:
+                chrom = parts[0]
+                period = int(parts[3])
+                ru = parts[4]
+                count = int(parts[5]) if len(parts) > 5 else None
+            except ValueError:
+                continue
+
+            total_regions += 1
+            chromosomes.add(chrom)
+            repeat_units.add(ru)
+            period_counter[period] += 1
+            if count is not None:
+                counts.append(count)
+
+        tbx.close()


get_statistics assumes an explicit COUNT column exists in the tabix panel (parts[5]), but load_str_reference currently preserves the original BED columns (often only 5 columns). This makes mean/median counts come back as None for typical panels. Compute COUNT from start/end/period when the column is missing (and parse it as float when present). Also consider wrapping TabixFile usage in try/finally (or a context manager) to ensure it closes on exceptions.

Suggested change

tbx = pysam.TabixFile(self.str_panel_gz)

total_regions = 0

chromosomes = set()

repeat_units = set()

period_counter = Counter()

counts = []

# Iterate through all records in the file

for line in tbx.fetch():

parts = line.rstrip("\n").split("\t")

if len(parts) < 5:

continue

try:

chrom = parts[0]

period = int(parts[3])

ru = parts[4]

count = int(parts[5]) if len(parts) > 5 else None

except ValueError:

continue

total_regions += 1

chromosomes.add(chrom)

repeat_units.add(ru)

period_counter[period] += 1

if count is not None:

counts.append(count)

tbx.close()

total_regions = 0

chromosomes = set()

repeat_units = set()

period_counter = Counter()

counts = []

tbx = pysam.TabixFile(self.str_panel_gz)

try:

# Iterate through all records in the file

for line in tbx.fetch():

parts = line.rstrip("\n").split("\t")

if len(parts) < 5:

continue

try:

chrom = parts[0]

start = int(parts[1])

end = int(parts[2])

period = int(parts[3])

ru = parts[4]

count: Optional[float]

if len(parts) > 5 and parts[5] != "":

# Use explicit COUNT column if present

count = float(parts[5])

else:

# Derive COUNT from start, end, and period when possible

if period > 0 and end > start:

count = (end - start) / period

else:

count = None

except ValueError:

continue

total_regions += 1

chromosomes.add(chrom)

repeat_units.add(ru)

period_counter[period] += 1

if count is not None:

counts.append(count)

finally:

tbx.close()

Copilot · 2026-02-26T10:09:29Z

+@dataclass(frozen=True)
+class WorkerConfig:
+    """Configuration container for worker processes.
+
+    Stores parameters required by each worker to annotate VCF files.
+    The configuration is passed once during worker initialization and
+    reused for all tasks processed by that worker.
+
+    Attributes
+    ----------
+    str_panel_gz : str
+        Path to the BGZF-compressed, tabix-indexed STR reference file.
+    somatic_mode : bool, optional
+        Enable somatic filtering. When True, skips variants where both samples
+        have identical genotypes. Default is False.
+    ignore_mismatch_warnings : bool, optional
+        If True, suppresses warnings about reference mismatches between the
+        STR panel and VCF REF alleles. Default is False.
+    mismatch_truth : str, optional
+        Specifies which source to consider as ground truth for mismatches.
+        Options are "panel", "vcf", or "skip". Default is "panel".
+
+    Notes
+    -----
+    - The dataclass is frozen to ensure the configuration remains
+      immutable once workers are initialized.
+    - Instances of this class are passed to `worker_init`, which loads
+      the STR reference and exposes these settings to worker tasks.
+    """
+
+    str_panel_gz: str
+    somatic_mode: bool
+    ignore_mismatch_warnings: bool
+    mismatch_truth: str
+    parser: BaseVCFParser


WorkerConfig includes a concrete parser instance that must be picklable to be sent to ProcessPoolExecutor workers (especially on platforms using the spawn start method). This can break process_directory for any custom parser implementations. Safer options are to pass a parser type / import path and instantiate inside worker_init, or restrict parser to known built-in parsers and document that requirement.

Copilot · 2026-02-26T10:09:29Z

+@pytest.fixture(scope="session")
 def output_dir(request):
    """
    Provide a unique output directory per test under tests/output/.
    Ensures cleanup after the test.
    """
-    outdir = OUTPUT_DIR / f"{request.node.name}_dir"
+    test_name = os.environ.get('PYTEST_CURRENT_TEST').split(':')[-1].split(' ')[0]
+    outdir = OUTPUT_DIR / f"{test_name}_dir"
    if outdir.exists():
        shutil.rmtree(outdir)
    outdir.mkdir(parents=True, exist_ok=True)
    yield outdir
    if outdir.exists():
        shutil.rmtree(outdir)


output_dir is declared as scope="session", but the implementation tries to derive a per-test directory name from PYTEST_CURRENT_TEST. Session-scoped fixtures run once, so this will (a) not be unique per test and (b) can fail if PYTEST_CURRENT_TEST isn’t set at fixture-creation time. Make this fixture function-scoped and use request.node.name (or request.node.nodeid) directly instead of relying on an env var.

Copilot · 2026-02-26T10:09:30Z

+        content = """chr1\t101\t115\t3\tCAG\t5
+chr1\t201\t212\t4\tATCG\t3
+chr2\t301\t318\t3\tGAT\t6
+"""
+        with tempfile.TemporaryDirectory() as tmp:
+            bed_path = Path(tmp) / "panel.bed"
+            bed_path.write_text(content, encoding="utf-8")
+
+            gz_path = load_str_reference(str(bed_path))
+            yield gz_path
+
+    def test_exact_overlap(self, tabix_panel):
        """Test exact position overlap."""
-        result = find_overlapping_str(str_df, "chr1", 101, 115)
+        result = find_overlapping_str(tabix_panel, "chr1", 101, 115)

        assert result is not None
        assert result["START"] == 101
+        assert result["END"] == 115
        assert result["RU"] == "CAG"



The overlap/position tests appear to mix BED coordinates with 1-based expectations. For example, the tabix panel rows use START=101 but the assertions expect result["START"] == 101 and treat position 101 as outside the STR later. If the panel is BED (0-based start), the fixture should use START=100 to represent a 1-based region starting at 101, and the expected START/“outside” positions should be adjusted accordingly.

Copilot · 2026-02-26T10:09:30Z

        """Test position outside STR."""
-        result = get_str_at_position(str_df, "chr1", 150)
-
+        result = get_str_at_position(tabix_panel, "chr1", 101)


test_position_outside_str currently queries position 101 and expects None, but with the preceding test_position_in_str and fixture content this is inconsistent (101 is at/near the region start depending on whether the panel is interpreted as BED vs 1-based). Once the coordinate convention is clarified, update this test to use an unambiguously outside position (e.g., between loci).

Suggested change

result = get_str_at_position(tabix_panel, "chr1", 101)

result = get_str_at_position(tabix_panel, "chr1", 150)

Copilot

Pull request overview

Copilot reviewed 18 out of 21 changed files in this pull request and generated 13 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-26T10:53:16Z

 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

+## [1.0.0] - 2026-02-25


Future release date in CHANGELOG.

The CHANGELOG lists the release date as "2026-02-25" (line 8), but the PR metadata shows the current date is February 26, 2026. The release date should either be:

Updated to today's date (2026-02-26) if releasing today

Set to a future date if the release is planned for later

Marked as "[Unreleased]" if the version hasn't been released yet

Based on PR title "Version 1.0.0", this appears to be a release PR, so the date should match the actual release date.

Suggested change

## [1.0.0] - 2026-02-25

## [1.0.0] - 2026-02-26

Copilot · 2026-02-26T10:53:17Z

-    def test_exact_overlap(self, str_df):
+    def tabix_panel(self):
+        """Create a small bgzip+tabix STR panel for overlap tests."""
+        content = """chr1\t101\t115\t3\tCAG\t5


Test data uses incorrect 1-based coordinates in BED file, masking a coordinate conversion bug.

BED format uses 0-based half-open coordinates where START is 0-based. However, this test creates BED content with START=101, which is already 1-based. This masks the fact that find_overlapping_str doesn't convert coordinates from 0-based to 1-based (see bug in str_reference.py:197-198).

The test should use "chr1\t100\t115" to properly test BED-to-VCF coordinate conversion. With the current test data, the bug in find_overlapping_str goes undetected because the input is already in 1-based coordinates.

Suggested change

content = """chr1\t101\t115\t3\tCAG\t5

content = """chr1\t100\t115\t3\tCAG\t5

Copilot · 2026-02-26T10:53:17Z

+        content = """chr1\t101\t115\t3\tCAG\t5
+chr1\t201\t212\t4\tATCG\t3
+chr2\t301\t318\t3\tGAT\t6
+"""


Test data uses incorrect 1-based coordinates in BED file, masking the coordinate conversion bug.

Same issue as in the tabix_panel fixture above. This test data uses START=101 which is already 1-based, but BED format requires 0-based START coordinates. The test should use "chr1\t100\t115" instead.

Copilot · 2026-02-26T10:53:17Z

+@pytest.fixture(scope="session")
 def output_dir(request):
    """
    Provide a unique output directory per test under tests/output/.
    Ensures cleanup after the test.
    """
-    outdir = OUTPUT_DIR / f"{request.node.name}_dir"
+    test_name = os.environ.get('PYTEST_CURRENT_TEST').split(':')[-1].split(' ')[0]
+    outdir = OUTPUT_DIR / f"{test_name}_dir"


Potential issue with fixture scope and environment variable access.

The output_dir fixture was changed from function scope to session scope, but line 53 accesses os.environ.get('PYTEST_CURRENT_TEST') which may not work reliably at session scope. The PYTEST_CURRENT_TEST environment variable is set during test execution, and at session scope it might not reflect the correct test name.

This could cause:

Multiple tests to share the same output directory when they shouldn't

Intermittent failures if the environment variable isn't set during fixture initialization

Directory collision if tests run in parallel

Consider either:

Reverting to function scope with request.node.name (as it was before)

If session scope is needed, use a different naming strategy that doesn't depend on PYTEST_CURRENT_TEST

Copilot · 2026-02-26T10:53:17Z

+    def test_position_outside_str(self, tabix_panel):
        """Test position outside STR."""
-        result = get_str_at_position(str_df, "chr1", 150)
-
+        result = get_str_at_position(tabix_panel, "chr1", 101)
        assert result is None


Test expectation appears incorrect for boundary behavior.

This test expects that position 101 is outside the STR region chr1:101-115, but if the region truly spans 101-115 (inclusive), then position 101 should be inside, not outside.

This test may be expecting different semantics (e.g., that START is exclusive), which would be inconsistent with VCF conventions where positions are inclusive. The test name says "position_outside_str" but uses the START coordinate itself, which is contradictory.

Either:

The test expectation is wrong (should return result, not None)

Or the test name is misleading

Or there's an intentional semantic where STR regions exclude their start position

This needs clarification about the intended boundary semantics.

Copilot · 2026-02-26T10:53:18Z

+
+            try:
+                str_chrom = parts[0]
+                str_start = int(parts[1])


Coordinate conversion inconsistency between find_overlapping_str and parse_str_line in vcf_processor.py.

The BED format standard uses 0-based half-open coordinates (start is 0-based, end is 1-based exclusive). However:

In find_overlapping_str (lines 197-198), START is read directly without conversion: str_start = int(parts[1])

In vcf_processor.py:223, START is converted to 1-based: "START": int(parts[1]) + 1

This means if a real BED file has a region "chr1\t100\t115" (meaning VCF positions 101-115):

find_overlapping_str will return START=100 (incorrect, off by 1)

parse_str_line in vcf_processor.py will return START=101 (correct)

The tests don't catch this because they use test data with 1-based coordinates in the BED file (e.g., "chr1\t101\t115"), which violates BED format standards. This will cause incorrect behavior with real-world BED files.

To fix: Add + 1 to line 197 to convert from 0-based BED to 1-based VCF coordinates.

Suggested change

str_start = int(parts[1])

# BED start is 0-based; convert to 1-based for internal use

str_start = int(parts[1]) + 1

Copilot · 2026-02-26T10:53:18Z

+@pytest.fixture(scope="session")
+def vcf_dir(data_dir: str, output_dir: str) -> str:
+    """
+    Unpack tests/data/vcfs/test_input.zip into tests/output/vcfs
+    and return the directory containing VCF files.
+
+    If the directory already exists (from a previous run), it is reused.
+    """
+    data_path = Path(data_dir)
+    zip_path = data_path / "vcfs" / "test_input.zip"
+    assert zip_path.is_file(), f"Missing test input zip: {zip_path}"
+
+    vcf_root = Path(output_dir) / "vcfs"
+    vcf_root.mkdir(parents=True, exist_ok=True)


vcf_dir fixture incorrectly depends on output_dir with session scope.

The vcf_dir fixture has session scope and depends on output_dir, but creates subdirectories within output_dir. This creates a coupling issue where:

A session-scoped fixture depends on test-specific output directories

The directory path will use the first test's name for the entire session

This should likely use a dedicated session-level directory (not tied to individual test names) or the data_dir for extraction. Consider extracting to a fixed session-level path like OUTPUT_DIR / "vcfs" without depending on the test-specific output_dir.

Copilot · 2026-02-26T10:53:19Z

+    config = WorkerConfig(
+        str_panel_gz=str_panel_gz,
+        somatic_mode=somatic_mode,
+        ignore_mismatch_warnings=ignore_mismatch_warnings,
+        mismatch_truth=mismatch_truth,
+        parser=parser,
+    )


Parser object may not be serializable for multiprocessing.

The WorkerConfig dataclass includes a parser field (line 614) which is a BaseVCFParser instance. When using ProcessPoolExecutor, the config is pickled and sent to worker processes. If the parser contains unpicklable objects (file handles, locks, etc.), this will fail at runtime.

The parser should either:

Be ensured to be picklable (no file handles or other non-picklable state)

Be reconstructed in each worker process instead of being passed via config

Documentation should clearly state that custom parsers must be picklable

Consider reconstructing the parser in worker_init instead of passing the instance through the config.

Copilot · 2026-02-26T10:53:19Z

    "pysam>=0.22.0",
    "pandas>=2.0.0",
+    "trtools>=5.0.0",
+    "psutil",


psutil dependency should have a version constraint.

The psutil dependency is added without a version constraint. This can lead to:

Installation failures on some platforms if an incompatible version is selected

Unpredictable behavior across different environments

Potential breaking changes from future psutil versions

Consider adding a minimum version constraint like psutil>=5.9.0 to ensure compatibility and provide a stable baseline for users.

Suggested change

"psutil",

"psutil>=5.9.0",

Copilot · 2026-02-26T10:53:19Z

+    - Expects STR reference (STR_DF) and worker configuration (WORKER_CONFIG)
+      to be initialized once per worker via `worker_init`.


Outdated documentation reference to STR_DF.

The docstring mentions "STR_DF" (line 358) which was the old in-memory DataFrame. With the tabix-based implementation, the STR reference is no longer loaded into a global variable. This comment should be updated to reflect that workers access the STR panel via the str_panel_gz path in WORKER_CONFIG, not a preloaded DataFrame.

Update the note to: "Expects WORKER_CONFIG to be initialized once per worker via worker_init."

Suggested change

- Expects STR reference (STR_DF) and worker configuration (WORKER_CONFIG)

to be initialized once per worker via `worker_init`.

- Expects WORKER_CONFIG to be initialized once per worker via `worker_init`.

KondratievaOlesya added 6 commits February 25, 2026 13:55

New option for jobs to run in parallel and STR panel tabix handle

254cf9e

CI python3.8 cyvcf2 error fix and test update

9adc350

psutil dependecy

6151e12

list typo

e233350

list typo

98fa1a3

List typo

be13f72

Copilot AI review requested due to automatic review settings February 25, 2026 14:49

Copilot started reviewing on behalf of KondratievaOlesya February 25, 2026 14:50 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

fix small typos

349bc3d

KondratievaOlesya requested a review from Copilot February 26, 2026 10:00

Copilot started reviewing on behalf of KondratievaOlesya February 26, 2026 10:01 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

fix pull request

059a3d1

KondratievaOlesya requested a review from Copilot February 26, 2026 10:45

Copilot started reviewing on behalf of KondratievaOlesya February 26, 2026 10:46 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

KondratievaOlesya merged commit f5b70fc into main Feb 26, 2026
12 checks passed

KondratievaOlesya deleted the parallel_run branch February 26, 2026 12:01

		return output_path


		def load_str_reference(str_path: str) -> pd.DataFrame:

-    test_name = os.environ.get('PYTEST_CURRENT_TEST').split(':')[-1].split(' ')[0]
+    current_test = os.environ.get("PYTEST_CURRENT_TEST")
+    if current_test:
+        test_name = current_test.split(":")[-1].split(" ")[0]
+    else:
+        # Fallback for session-scoped fixture when PYTEST_CURRENT_TEST is not set
+        test_name = getattr(request, "node", None).name if getattr(request, "node", None) else "session"

	result = get_str_at_position(tabix_panel, "chr1", 101)
	result = get_str_at_position(tabix_panel, "chr1", 150)

	content = """chr1\t101\t115\t3\tCAG\t5
	content = """chr1\t100\t115\t3\tCAG\t5

	str_start = int(parts[1])
	# BED start is 0-based; convert to 1-based for internal use
	str_start = int(parts[1]) + 1

		- Expects STR reference (STR_DF) and worker configuration (WORKER_CONFIG)
		to be initialized once per worker via `worker_init`.

	- Expects STR reference (STR_DF) and worker configuration (WORKER_CONFIG)
	to be initialized once per worker via `worker_init`.
	- Expects WORKER_CONFIG to be initialized once per worker via `worker_init`.

Conversation

KondratievaOlesya commented Feb 25, 2026

Uh oh!

codecov Bot commented Feb 25, 2026

Welcome to Codecov 🎉

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026