Added submodule to read vcf files. by pfarndt · Pull Request #78 · ArndtLab/IBSpector.jl

pfarndt · 2026-04-10T17:00:59Z

Added test vcf files and corresponding tests.

Adds a few test vcf files and additional dependencies from DataFrames, CSV ...

Added test vcf files and corresponding tests.

Copilot

Pull request overview

Adds a new VCF submodule to IBSpector for reading VCFs into a DataFrame and computing IBS tract lengths, along with new VCF-focused tests and updated dependencies.

Changes:

Introduces src/VCF.jl (masking utilities, VCF reader, and IBS tract-length computation).
Wires the new submodule into the main package module/export surface (src/IBSpector.jl) and adds new dependencies (CSV, DataFrames, DataFramesMeta).
Adds a dedicated VCF test file and test fixtures, and makes the existing “fitting procedure” tests skippable via OMIT_LONG_TESTS.

Reviewed changes

Copilot reviewed 6 out of 13 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`src/VCF.jl`	New VCF submodule implementation (read/masking/IBS tract lengths).
`src/IBSpector.jl`	Includes and exports the new `VCF` submodule.
`Project.toml`	Adds VCF-related dependencies and compat bounds.
`test/runtest-VCF.jl`	New VCF test suite using bundled `.gz` fixtures.
`test/runtests.jl`	Adds env-gated skipping for long “fitting procedure” tests.
`test/Project.toml`	Adds `DataFrames` as a test dependency.
`test/Manifest.toml`	New pinned test environment manifest.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-10T17:12:48Z

+# Test the fitting procedure on multiple simulated datasets. 
+# This is a long test, so we allow it to be skipped by setting 
+# the environment variable OMIT_LONG_TESTS.
+
+if haskey(ENV, "OMIT_LONG_TESTS")
+    @info "Omitting long tests"
+else
+
+    @testset verbose = true "fitting procedure" begin


The new VCF test suite in test/runtest-VCF.jl is not included anywhere, so it will not run under the standard Pkg.test entrypoint (test/runtests.jl). Add an include("runtest-VCF.jl") (likely near the other includes at the top) so these tests execute in CI.

Copilot · 2026-04-10T17:12:48Z

+        segments = dictify(df, segments)
+        check_keyypes(df, segments)
+        segments = map(keys(segments), values(segments)) do k, v
+            if v == :refnotN
+                v = generate_mask(map(i -> uppercase(i) == 'N', collect(refs[k])))
+            end


segments = :refnotN dereferences refs[k] but refs can be nothing (e.g., if the caller requests segments=:refnotN without providing refs). This currently throws a low-level error; it should explicitly validate refs is provided and raise a clear ArgumentError when it is required for :refnotN.

Copilot · 2026-04-10T17:12:48Z

+
+function isphased(vcf::VCFdata, indv::AbstractString)
+    c = colofindividual(vcf, indv)
+    phased = ["1|0", "0|1", "1|1", "0|0", "0/0", "1/1"]


isphased currently returns true for unphased genotypes like 0/0 and 1/1 because they are included in the allowed set. This makes the function name/behavior inconsistent; if the intent is to check phasing, it should reject / genotypes (or be renamed to reflect what it actually validates).

Suggested change

phased = ["1|0", "0|1", "1|1", "0|0", "0/0", "1/1"]

phased = ["1|0", "0|1", "1|1", "0|0"]

Copilot · 2026-04-10T17:12:49Z

+function check_keyypes(df, dict)
+    if isnothing(dict)
+        return
+    end
+    if !(
+        (keytype(dict) <: AbstractString) && (eltype(df[!, 1]) <: AbstractString) ||
+        (keytype(dict) <: Integer) && (eltype(df[!, 1]) <: Integer)
+        )
+        throw(ArgumentError("dict has not compatible key type with chromosome names in VCF."))
+    end


Function name check_keyypes looks like a typo (missing 't'), and the thrown message is grammatically incorrect ("has not compatible"). Consider renaming to check_keytypes and adjusting the error text for clarity.

Copilot · 2026-04-10T17:12:49Z

+    # generate and read small file
+    file2 = joinpath(vcfexamplesdir, "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.first100000.small.gz")
+    if !isfile(file2)
+        run(pipeline(file1, Cmd(`gzip -dc`, ignorestatus = true), ` head -2000`, `gzip -c`, file2))
+    end


These tests assume external CLI tools (gzip, head) are available on PATH. That can make the test suite fail on platforms/environments without those utilities (e.g., Windows CI). Prefer Julia-native decompression/IO (e.g., CodecZlib/GzipDecompressorStream) or gate these tests behind an explicit opt-in env var.

Copilot · 2026-04-10T17:12:49Z

+    vcfexamplesdir = "vcf-examples"
+
+    file1 = joinpath(vcfexamplesdir, "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.first100000.vcf.gz")
+    @test isfile(file1)
+


The vcf-examples directory is referenced via a relative path string. Test working directory is not guaranteed, so this can be brittle. Prefer anchoring to the test file location (e.g., joinpath(@__DIR__, "vcf-examples")) so tests can run from any CWD.

Copilot · 2026-04-10T17:12:49Z

+@testset "Double check with data we know already" begin
+    using DataFrames
+
+    vcfexamplesdir = "vcf-examples"
+
+    file1 = joinpath(vcfexamplesdir, "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.first100000.vcf.gz")
+    @test isfile(file1)
+
+    seq = read(`gzip -dc $(joinpath(vcfexamplesdir, "seq.chr1-40M.txt.gz"))`, String)
+    @test length(seq) == 40_000_000
+
+    mask = read(`gzip -dc $(joinpath(vcfexamplesdir, "mask.chr1-40M.txt.gz"))`, String)
+    @test length(mask) == 40_000_000
+
+    ibs = read(`gzip -dc $(joinpath(vcfexamplesdir, "ilsHG00096.txt.gz"))`, String) |> split |> x -> parse.(Int, x)
+    @test ibs[1:15] == [84, 57, 76, 1829, 1487, 279, 146, 3113, 542, 1895, 860, 193, 162, 1177, 4050]
+
+    #generate mask from string
+    gmask = VCF.generate_mask(map(==('P'), collect(mask)))
+
+    vcf = VCF.read(file1, 
+        refs = seq, 
+        masks = gmask,
+        segments = :refnotN,
+        select = 1:10
+        )
+    @test nrow(vcf.df) > 1000
+    @test "HG00096" in VCF.individuals(vcf)
+
+
+    cibs = VCF.IBStractlength(vcf, "HG00096")
+    ncibs = length(cibs)
+    @test cibs == ibs[1:ncibs]
+
+    indvs = VCF.individuals(vcf)
+    @test length(indvs) == 1
+
+    cibs = VCF.IBStractlength(vcf, [indvs[1], indvs[1]])
+    @test cibs == vcat(ibs[1:ncibs], ibs[1:ncibs])
+
+    cibs = VCF.IBStractlength(vcf)
+    @test cibs == ibs[1:ncibs]


This block decompresses and loads very large fixtures (40M-character reference/mask strings) and reads a large VCF, which can significantly slow CI and increase memory usage. Consider marking this as a long test (similar to OMIT_LONG_TESTS handling in runtests.jl) or reducing fixture size while still exercising the logic.

Suggested change

@testset "Double check with data we know already" begin

using DataFrames

vcfexamplesdir = "vcf-examples"

file1 = joinpath(vcfexamplesdir, "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.first100000.vcf.gz")

@test isfile(file1)

seq = read(`gzip -dc $(joinpath(vcfexamplesdir, "seq.chr1-40M.txt.gz"))`, String)

@test length(seq) == 40_000_000

mask = read(`gzip -dc $(joinpath(vcfexamplesdir, "mask.chr1-40M.txt.gz"))`, String)

@test length(mask) == 40_000_000

ibs = read(`gzip -dc $(joinpath(vcfexamplesdir, "ilsHG00096.txt.gz"))`, String) |> split |> x -> parse.(Int, x)

@test ibs[1:15] == [84, 57, 76, 1829, 1487, 279, 146, 3113, 542, 1895, 860, 193, 162, 1177, 4050]

#generate mask from string

gmask = VCF.generate_mask(map(==('P'), collect(mask)))

vcf = VCF.read(file1,

refs = seq,

masks = gmask,

segments = :refnotN,

select = 1:10

)

@test nrow(vcf.df) > 1000

@test "HG00096" in VCF.individuals(vcf)

cibs = VCF.IBStractlength(vcf, "HG00096")

ncibs = length(cibs)

@test cibs == ibs[1:ncibs]

indvs = VCF.individuals(vcf)

@test length(indvs) == 1

cibs = VCF.IBStractlength(vcf, [indvs[1], indvs[1]])

@test cibs == vcat(ibs[1:ncibs], ibs[1:ncibs])

cibs = VCF.IBStractlength(vcf)

@test cibs == ibs[1:ncibs]

if get(ENV, "OMIT_LONG_TESTS", "false") in ("true", "1", "yes")

@info "Skipping long VCF fixture test because OMIT_LONG_TESTS is enabled"

else

@testset "Double check with data we know already" begin

using DataFrames

vcfexamplesdir = "vcf-examples"

file1 = joinpath(vcfexamplesdir, "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.first100000.vcf.gz")

@test isfile(file1)

seq = read(`gzip -dc $(joinpath(vcfexamplesdir, "seq.chr1-40M.txt.gz"))`, String)

@test length(seq) == 40_000_000

mask = read(`gzip -dc $(joinpath(vcfexamplesdir, "mask.chr1-40M.txt.gz"))`, String)

@test length(mask) == 40_000_000

ibs = read(`gzip -dc $(joinpath(vcfexamplesdir, "ilsHG00096.txt.gz"))`, String) |> split |> x -> parse.(Int, x)

@test ibs[1:15] == [84, 57, 76, 1829, 1487, 279, 146, 3113, 542, 1895, 860, 193, 162, 1177, 4050]

#generate mask from string

gmask = VCF.generate_mask(map(==('P'), collect(mask)))

vcf = VCF.read(file1,

refs = seq,

masks = gmask,

segments = :refnotN,

select = 1:10

)

@test nrow(vcf.df) > 1000

@test "HG00096" in VCF.individuals(vcf)

cibs = VCF.IBStractlength(vcf, "HG00096")

ncibs = length(cibs)

@test cibs == ibs[1:ncibs]

indvs = VCF.individuals(vcf)

@test length(indvs) == 1

cibs = VCF.IBStractlength(vcf, [indvs[1], indvs[1]])

@test cibs == vcat(ibs[1:ncibs], ibs[1:ncibs])

cibs = VCF.IBStractlength(vcf)

@test cibs == ibs[1:ncibs]

end

Copilot · 2026-04-10T17:12:50Z

+# This file is machine-generated - editing it directly is not advised
+
+julia_version = "1.12.5"
+manifest_format = "2.0"
+project_hash = "5d86b394662752ad5c13db72e4bca135fb91dea1"
+


test/Manifest.toml appears out of sync with the package/test dependency set: it does not contain entries for CSV / DataFramesMeta even though IBSpector now includes VCF.jl which does using CSV/DataFramesMeta. If this Manifest is meant to be committed and used for CI/reproducible tests, it should be regenerated so Pkg.instantiate --project=test brings in all required deps (or consider not committing a test Manifest at all).

Added submodule to read vcf files.

4676368

Added test vcf files and corresponding tests.

pfarndt requested review from TommasoStentella and Copilot April 10, 2026 17:00

Copilot started reviewing on behalf of pfarndt April 10, 2026 17:01 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

TommasoStentella closed this Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added submodule to read vcf files.#78

Added submodule to read vcf files.#78
pfarndt wants to merge 1 commit intomainfrom
adding-vcf-support

pfarndt commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	phased = ["1\|0", "0\|1", "1\|1", "0\|0", "0/0", "1/1"]
	phased = ["1\|0", "0\|1", "1\|1", "0\|0"]

Conversation

pfarndt commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants