Skip to content

Added submodule to read vcf files.#78

Closed
pfarndt wants to merge 1 commit intomainfrom
adding-vcf-support
Closed

Added submodule to read vcf files.#78
pfarndt wants to merge 1 commit intomainfrom
adding-vcf-support

Conversation

@pfarndt
Copy link
Copy Markdown
Member

@pfarndt pfarndt commented Apr 10, 2026

Added test vcf files and corresponding tests.

Adds a few test vcf files and additional dependencies from DataFrames, CSV ...

Added test vcf files and corresponding tests.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new VCF submodule to IBSpector for reading VCFs into a DataFrame and computing IBS tract lengths, along with new VCF-focused tests and updated dependencies.

Changes:

  • Introduces src/VCF.jl (masking utilities, VCF reader, and IBS tract-length computation).
  • Wires the new submodule into the main package module/export surface (src/IBSpector.jl) and adds new dependencies (CSV, DataFrames, DataFramesMeta).
  • Adds a dedicated VCF test file and test fixtures, and makes the existing “fitting procedure” tests skippable via OMIT_LONG_TESTS.

Reviewed changes

Copilot reviewed 6 out of 13 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/VCF.jl New VCF submodule implementation (read/masking/IBS tract lengths).
src/IBSpector.jl Includes and exports the new VCF submodule.
Project.toml Adds VCF-related dependencies and compat bounds.
test/runtest-VCF.jl New VCF test suite using bundled .gz fixtures.
test/runtests.jl Adds env-gated skipping for long “fitting procedure” tests.
test/Project.toml Adds DataFrames as a test dependency.
test/Manifest.toml New pinned test environment manifest.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/runtests.jl
Comment on lines +159 to +167
# Test the fitting procedure on multiple simulated datasets.
# This is a long test, so we allow it to be skipped by setting
# the environment variable OMIT_LONG_TESTS.

if haskey(ENV, "OMIT_LONG_TESTS")
@info "Omitting long tests"
else

@testset verbose = true "fitting procedure" begin
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new VCF test suite in test/runtest-VCF.jl is not included anywhere, so it will not run under the standard Pkg.test entrypoint (test/runtests.jl). Add an include("runtest-VCF.jl") (likely near the other includes at the top) so these tests execute in CI.

Copilot uses AI. Check for mistakes.
Comment thread src/VCF.jl
Comment on lines +61 to +66
segments = dictify(df, segments)
check_keyypes(df, segments)
segments = map(keys(segments), values(segments)) do k, v
if v == :refnotN
v = generate_mask(map(i -> uppercase(i) == 'N', collect(refs[k])))
end
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

segments = :refnotN dereferences refs[k] but refs can be nothing (e.g., if the caller requests segments=:refnotN without providing refs). This currently throws a low-level error; it should explicitly validate refs is provided and raise a clear ArgumentError when it is required for :refnotN.

Copilot uses AI. Check for mistakes.
Comment thread src/VCF.jl

function isphased(vcf::VCFdata, indv::AbstractString)
c = colofindividual(vcf, indv)
phased = ["1|0", "0|1", "1|1", "0|0", "0/0", "1/1"]
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isphased currently returns true for unphased genotypes like 0/0 and 1/1 because they are included in the allowed set. This makes the function name/behavior inconsistent; if the intent is to check phasing, it should reject / genotypes (or be renamed to reflect what it actually validates).

Suggested change
phased = ["1|0", "0|1", "1|1", "0|0", "0/0", "1/1"]
phased = ["1|0", "0|1", "1|1", "0|0"]

Copilot uses AI. Check for mistakes.
Comment thread src/VCF.jl
Comment on lines +98 to +107
function check_keyypes(df, dict)
if isnothing(dict)
return
end
if !(
(keytype(dict) <: AbstractString) && (eltype(df[!, 1]) <: AbstractString) ||
(keytype(dict) <: Integer) && (eltype(df[!, 1]) <: Integer)
)
throw(ArgumentError("dict has not compatible key type with chromosome names in VCF."))
end
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function name check_keyypes looks like a typo (missing 't'), and the thrown message is grammatically incorrect ("has not compatible"). Consider renaming to check_keytypes and adjusting the error text for clarity.

Copilot uses AI. Check for mistakes.
Comment thread test/runtest-VCF.jl
Comment on lines +33 to +37
# generate and read small file
file2 = joinpath(vcfexamplesdir, "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.first100000.small.gz")
if !isfile(file2)
run(pipeline(file1, Cmd(`gzip -dc`, ignorestatus = true), ` head -2000`, `gzip -c`, file2))
end
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests assume external CLI tools (gzip, head) are available on PATH. That can make the test suite fail on platforms/environments without those utilities (e.g., Windows CI). Prefer Julia-native decompression/IO (e.g., CodecZlib/GzipDecompressorStream) or gate these tests behind an explicit opt-in env var.

Copilot uses AI. Check for mistakes.
Comment thread test/runtest-VCF.jl
Comment on lines +28 to +32
vcfexamplesdir = "vcf-examples"

file1 = joinpath(vcfexamplesdir, "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.first100000.vcf.gz")
@test isfile(file1)

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vcf-examples directory is referenced via a relative path string. Test working directory is not guaranteed, so this can be brittle. Prefer anchoring to the test file location (e.g., joinpath(@__DIR__, "vcf-examples")) so tests can run from any CWD.

Copilot uses AI. Check for mistakes.
Comment thread test/runtest-VCF.jl
Comment on lines +188 to +229
@testset "Double check with data we know already" begin
using DataFrames

vcfexamplesdir = "vcf-examples"

file1 = joinpath(vcfexamplesdir, "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.first100000.vcf.gz")
@test isfile(file1)

seq = read(`gzip -dc $(joinpath(vcfexamplesdir, "seq.chr1-40M.txt.gz"))`, String)
@test length(seq) == 40_000_000

mask = read(`gzip -dc $(joinpath(vcfexamplesdir, "mask.chr1-40M.txt.gz"))`, String)
@test length(mask) == 40_000_000

ibs = read(`gzip -dc $(joinpath(vcfexamplesdir, "ilsHG00096.txt.gz"))`, String) |> split |> x -> parse.(Int, x)
@test ibs[1:15] == [84, 57, 76, 1829, 1487, 279, 146, 3113, 542, 1895, 860, 193, 162, 1177, 4050]

#generate mask from string
gmask = VCF.generate_mask(map(==('P'), collect(mask)))

vcf = VCF.read(file1,
refs = seq,
masks = gmask,
segments = :refnotN,
select = 1:10
)
@test nrow(vcf.df) > 1000
@test "HG00096" in VCF.individuals(vcf)


cibs = VCF.IBStractlength(vcf, "HG00096")
ncibs = length(cibs)
@test cibs == ibs[1:ncibs]

indvs = VCF.individuals(vcf)
@test length(indvs) == 1

cibs = VCF.IBStractlength(vcf, [indvs[1], indvs[1]])
@test cibs == vcat(ibs[1:ncibs], ibs[1:ncibs])

cibs = VCF.IBStractlength(vcf)
@test cibs == ibs[1:ncibs]
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block decompresses and loads very large fixtures (40M-character reference/mask strings) and reads a large VCF, which can significantly slow CI and increase memory usage. Consider marking this as a long test (similar to OMIT_LONG_TESTS handling in runtests.jl) or reducing fixture size while still exercising the logic.

Suggested change
@testset "Double check with data we know already" begin
using DataFrames
vcfexamplesdir = "vcf-examples"
file1 = joinpath(vcfexamplesdir, "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.first100000.vcf.gz")
@test isfile(file1)
seq = read(`gzip -dc $(joinpath(vcfexamplesdir, "seq.chr1-40M.txt.gz"))`, String)
@test length(seq) == 40_000_000
mask = read(`gzip -dc $(joinpath(vcfexamplesdir, "mask.chr1-40M.txt.gz"))`, String)
@test length(mask) == 40_000_000
ibs = read(`gzip -dc $(joinpath(vcfexamplesdir, "ilsHG00096.txt.gz"))`, String) |> split |> x -> parse.(Int, x)
@test ibs[1:15] == [84, 57, 76, 1829, 1487, 279, 146, 3113, 542, 1895, 860, 193, 162, 1177, 4050]
#generate mask from string
gmask = VCF.generate_mask(map(==('P'), collect(mask)))
vcf = VCF.read(file1,
refs = seq,
masks = gmask,
segments = :refnotN,
select = 1:10
)
@test nrow(vcf.df) > 1000
@test "HG00096" in VCF.individuals(vcf)
cibs = VCF.IBStractlength(vcf, "HG00096")
ncibs = length(cibs)
@test cibs == ibs[1:ncibs]
indvs = VCF.individuals(vcf)
@test length(indvs) == 1
cibs = VCF.IBStractlength(vcf, [indvs[1], indvs[1]])
@test cibs == vcat(ibs[1:ncibs], ibs[1:ncibs])
cibs = VCF.IBStractlength(vcf)
@test cibs == ibs[1:ncibs]
if get(ENV, "OMIT_LONG_TESTS", "false") in ("true", "1", "yes")
@info "Skipping long VCF fixture test because OMIT_LONG_TESTS is enabled"
else
@testset "Double check with data we know already" begin
using DataFrames
vcfexamplesdir = "vcf-examples"
file1 = joinpath(vcfexamplesdir, "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.first100000.vcf.gz")
@test isfile(file1)
seq = read(`gzip -dc $(joinpath(vcfexamplesdir, "seq.chr1-40M.txt.gz"))`, String)
@test length(seq) == 40_000_000
mask = read(`gzip -dc $(joinpath(vcfexamplesdir, "mask.chr1-40M.txt.gz"))`, String)
@test length(mask) == 40_000_000
ibs = read(`gzip -dc $(joinpath(vcfexamplesdir, "ilsHG00096.txt.gz"))`, String) |> split |> x -> parse.(Int, x)
@test ibs[1:15] == [84, 57, 76, 1829, 1487, 279, 146, 3113, 542, 1895, 860, 193, 162, 1177, 4050]
#generate mask from string
gmask = VCF.generate_mask(map(==('P'), collect(mask)))
vcf = VCF.read(file1,
refs = seq,
masks = gmask,
segments = :refnotN,
select = 1:10
)
@test nrow(vcf.df) > 1000
@test "HG00096" in VCF.individuals(vcf)
cibs = VCF.IBStractlength(vcf, "HG00096")
ncibs = length(cibs)
@test cibs == ibs[1:ncibs]
indvs = VCF.individuals(vcf)
@test length(indvs) == 1
cibs = VCF.IBStractlength(vcf, [indvs[1], indvs[1]])
@test cibs == vcat(ibs[1:ncibs], ibs[1:ncibs])
cibs = VCF.IBStractlength(vcf)
@test cibs == ibs[1:ncibs]
end

Copilot uses AI. Check for mistakes.
Comment thread test/Manifest.toml
Comment on lines +1 to +6
# This file is machine-generated - editing it directly is not advised

julia_version = "1.12.5"
manifest_format = "2.0"
project_hash = "5d86b394662752ad5c13db72e4bca135fb91dea1"

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test/Manifest.toml appears out of sync with the package/test dependency set: it does not contain entries for CSV / DataFramesMeta even though IBSpector now includes VCF.jl which does using CSV/DataFramesMeta. If this Manifest is meant to be committed and used for CI/reproducible tests, it should be regenerated so Pkg.instantiate --project=test brings in all required deps (or consider not committing a test Manifest at all).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants