Conversation
Added test vcf files and corresponding tests.
There was a problem hiding this comment.
Pull request overview
Adds a new VCF submodule to IBSpector for reading VCFs into a DataFrame and computing IBS tract lengths, along with new VCF-focused tests and updated dependencies.
Changes:
- Introduces
src/VCF.jl(masking utilities, VCF reader, and IBS tract-length computation). - Wires the new submodule into the main package module/export surface (
src/IBSpector.jl) and adds new dependencies (CSV,DataFrames,DataFramesMeta). - Adds a dedicated VCF test file and test fixtures, and makes the existing “fitting procedure” tests skippable via
OMIT_LONG_TESTS.
Reviewed changes
Copilot reviewed 6 out of 13 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/VCF.jl |
New VCF submodule implementation (read/masking/IBS tract lengths). |
src/IBSpector.jl |
Includes and exports the new VCF submodule. |
Project.toml |
Adds VCF-related dependencies and compat bounds. |
test/runtest-VCF.jl |
New VCF test suite using bundled .gz fixtures. |
test/runtests.jl |
Adds env-gated skipping for long “fitting procedure” tests. |
test/Project.toml |
Adds DataFrames as a test dependency. |
test/Manifest.toml |
New pinned test environment manifest. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Test the fitting procedure on multiple simulated datasets. | ||
| # This is a long test, so we allow it to be skipped by setting | ||
| # the environment variable OMIT_LONG_TESTS. | ||
|
|
||
| if haskey(ENV, "OMIT_LONG_TESTS") | ||
| @info "Omitting long tests" | ||
| else | ||
|
|
||
| @testset verbose = true "fitting procedure" begin |
There was a problem hiding this comment.
The new VCF test suite in test/runtest-VCF.jl is not included anywhere, so it will not run under the standard Pkg.test entrypoint (test/runtests.jl). Add an include("runtest-VCF.jl") (likely near the other includes at the top) so these tests execute in CI.
| segments = dictify(df, segments) | ||
| check_keyypes(df, segments) | ||
| segments = map(keys(segments), values(segments)) do k, v | ||
| if v == :refnotN | ||
| v = generate_mask(map(i -> uppercase(i) == 'N', collect(refs[k]))) | ||
| end |
There was a problem hiding this comment.
segments = :refnotN dereferences refs[k] but refs can be nothing (e.g., if the caller requests segments=:refnotN without providing refs). This currently throws a low-level error; it should explicitly validate refs is provided and raise a clear ArgumentError when it is required for :refnotN.
|
|
||
| function isphased(vcf::VCFdata, indv::AbstractString) | ||
| c = colofindividual(vcf, indv) | ||
| phased = ["1|0", "0|1", "1|1", "0|0", "0/0", "1/1"] |
There was a problem hiding this comment.
isphased currently returns true for unphased genotypes like 0/0 and 1/1 because they are included in the allowed set. This makes the function name/behavior inconsistent; if the intent is to check phasing, it should reject / genotypes (or be renamed to reflect what it actually validates).
| phased = ["1|0", "0|1", "1|1", "0|0", "0/0", "1/1"] | |
| phased = ["1|0", "0|1", "1|1", "0|0"] |
| function check_keyypes(df, dict) | ||
| if isnothing(dict) | ||
| return | ||
| end | ||
| if !( | ||
| (keytype(dict) <: AbstractString) && (eltype(df[!, 1]) <: AbstractString) || | ||
| (keytype(dict) <: Integer) && (eltype(df[!, 1]) <: Integer) | ||
| ) | ||
| throw(ArgumentError("dict has not compatible key type with chromosome names in VCF.")) | ||
| end |
There was a problem hiding this comment.
Function name check_keyypes looks like a typo (missing 't'), and the thrown message is grammatically incorrect ("has not compatible"). Consider renaming to check_keytypes and adjusting the error text for clarity.
| # generate and read small file | ||
| file2 = joinpath(vcfexamplesdir, "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.first100000.small.gz") | ||
| if !isfile(file2) | ||
| run(pipeline(file1, Cmd(`gzip -dc`, ignorestatus = true), ` head -2000`, `gzip -c`, file2)) | ||
| end |
There was a problem hiding this comment.
These tests assume external CLI tools (gzip, head) are available on PATH. That can make the test suite fail on platforms/environments without those utilities (e.g., Windows CI). Prefer Julia-native decompression/IO (e.g., CodecZlib/GzipDecompressorStream) or gate these tests behind an explicit opt-in env var.
| vcfexamplesdir = "vcf-examples" | ||
|
|
||
| file1 = joinpath(vcfexamplesdir, "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.first100000.vcf.gz") | ||
| @test isfile(file1) | ||
|
|
There was a problem hiding this comment.
The vcf-examples directory is referenced via a relative path string. Test working directory is not guaranteed, so this can be brittle. Prefer anchoring to the test file location (e.g., joinpath(@__DIR__, "vcf-examples")) so tests can run from any CWD.
| @testset "Double check with data we know already" begin | ||
| using DataFrames | ||
|
|
||
| vcfexamplesdir = "vcf-examples" | ||
|
|
||
| file1 = joinpath(vcfexamplesdir, "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.first100000.vcf.gz") | ||
| @test isfile(file1) | ||
|
|
||
| seq = read(`gzip -dc $(joinpath(vcfexamplesdir, "seq.chr1-40M.txt.gz"))`, String) | ||
| @test length(seq) == 40_000_000 | ||
|
|
||
| mask = read(`gzip -dc $(joinpath(vcfexamplesdir, "mask.chr1-40M.txt.gz"))`, String) | ||
| @test length(mask) == 40_000_000 | ||
|
|
||
| ibs = read(`gzip -dc $(joinpath(vcfexamplesdir, "ilsHG00096.txt.gz"))`, String) |> split |> x -> parse.(Int, x) | ||
| @test ibs[1:15] == [84, 57, 76, 1829, 1487, 279, 146, 3113, 542, 1895, 860, 193, 162, 1177, 4050] | ||
|
|
||
| #generate mask from string | ||
| gmask = VCF.generate_mask(map(==('P'), collect(mask))) | ||
|
|
||
| vcf = VCF.read(file1, | ||
| refs = seq, | ||
| masks = gmask, | ||
| segments = :refnotN, | ||
| select = 1:10 | ||
| ) | ||
| @test nrow(vcf.df) > 1000 | ||
| @test "HG00096" in VCF.individuals(vcf) | ||
|
|
||
|
|
||
| cibs = VCF.IBStractlength(vcf, "HG00096") | ||
| ncibs = length(cibs) | ||
| @test cibs == ibs[1:ncibs] | ||
|
|
||
| indvs = VCF.individuals(vcf) | ||
| @test length(indvs) == 1 | ||
|
|
||
| cibs = VCF.IBStractlength(vcf, [indvs[1], indvs[1]]) | ||
| @test cibs == vcat(ibs[1:ncibs], ibs[1:ncibs]) | ||
|
|
||
| cibs = VCF.IBStractlength(vcf) | ||
| @test cibs == ibs[1:ncibs] |
There was a problem hiding this comment.
This block decompresses and loads very large fixtures (40M-character reference/mask strings) and reads a large VCF, which can significantly slow CI and increase memory usage. Consider marking this as a long test (similar to OMIT_LONG_TESTS handling in runtests.jl) or reducing fixture size while still exercising the logic.
| @testset "Double check with data we know already" begin | |
| using DataFrames | |
| vcfexamplesdir = "vcf-examples" | |
| file1 = joinpath(vcfexamplesdir, "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.first100000.vcf.gz") | |
| @test isfile(file1) | |
| seq = read(`gzip -dc $(joinpath(vcfexamplesdir, "seq.chr1-40M.txt.gz"))`, String) | |
| @test length(seq) == 40_000_000 | |
| mask = read(`gzip -dc $(joinpath(vcfexamplesdir, "mask.chr1-40M.txt.gz"))`, String) | |
| @test length(mask) == 40_000_000 | |
| ibs = read(`gzip -dc $(joinpath(vcfexamplesdir, "ilsHG00096.txt.gz"))`, String) |> split |> x -> parse.(Int, x) | |
| @test ibs[1:15] == [84, 57, 76, 1829, 1487, 279, 146, 3113, 542, 1895, 860, 193, 162, 1177, 4050] | |
| #generate mask from string | |
| gmask = VCF.generate_mask(map(==('P'), collect(mask))) | |
| vcf = VCF.read(file1, | |
| refs = seq, | |
| masks = gmask, | |
| segments = :refnotN, | |
| select = 1:10 | |
| ) | |
| @test nrow(vcf.df) > 1000 | |
| @test "HG00096" in VCF.individuals(vcf) | |
| cibs = VCF.IBStractlength(vcf, "HG00096") | |
| ncibs = length(cibs) | |
| @test cibs == ibs[1:ncibs] | |
| indvs = VCF.individuals(vcf) | |
| @test length(indvs) == 1 | |
| cibs = VCF.IBStractlength(vcf, [indvs[1], indvs[1]]) | |
| @test cibs == vcat(ibs[1:ncibs], ibs[1:ncibs]) | |
| cibs = VCF.IBStractlength(vcf) | |
| @test cibs == ibs[1:ncibs] | |
| if get(ENV, "OMIT_LONG_TESTS", "false") in ("true", "1", "yes") | |
| @info "Skipping long VCF fixture test because OMIT_LONG_TESTS is enabled" | |
| else | |
| @testset "Double check with data we know already" begin | |
| using DataFrames | |
| vcfexamplesdir = "vcf-examples" | |
| file1 = joinpath(vcfexamplesdir, "1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.first100000.vcf.gz") | |
| @test isfile(file1) | |
| seq = read(`gzip -dc $(joinpath(vcfexamplesdir, "seq.chr1-40M.txt.gz"))`, String) | |
| @test length(seq) == 40_000_000 | |
| mask = read(`gzip -dc $(joinpath(vcfexamplesdir, "mask.chr1-40M.txt.gz"))`, String) | |
| @test length(mask) == 40_000_000 | |
| ibs = read(`gzip -dc $(joinpath(vcfexamplesdir, "ilsHG00096.txt.gz"))`, String) |> split |> x -> parse.(Int, x) | |
| @test ibs[1:15] == [84, 57, 76, 1829, 1487, 279, 146, 3113, 542, 1895, 860, 193, 162, 1177, 4050] | |
| #generate mask from string | |
| gmask = VCF.generate_mask(map(==('P'), collect(mask))) | |
| vcf = VCF.read(file1, | |
| refs = seq, | |
| masks = gmask, | |
| segments = :refnotN, | |
| select = 1:10 | |
| ) | |
| @test nrow(vcf.df) > 1000 | |
| @test "HG00096" in VCF.individuals(vcf) | |
| cibs = VCF.IBStractlength(vcf, "HG00096") | |
| ncibs = length(cibs) | |
| @test cibs == ibs[1:ncibs] | |
| indvs = VCF.individuals(vcf) | |
| @test length(indvs) == 1 | |
| cibs = VCF.IBStractlength(vcf, [indvs[1], indvs[1]]) | |
| @test cibs == vcat(ibs[1:ncibs], ibs[1:ncibs]) | |
| cibs = VCF.IBStractlength(vcf) | |
| @test cibs == ibs[1:ncibs] | |
| end |
| # This file is machine-generated - editing it directly is not advised | ||
|
|
||
| julia_version = "1.12.5" | ||
| manifest_format = "2.0" | ||
| project_hash = "5d86b394662752ad5c13db72e4bca135fb91dea1" | ||
|
|
There was a problem hiding this comment.
test/Manifest.toml appears out of sync with the package/test dependency set: it does not contain entries for CSV / DataFramesMeta even though IBSpector now includes VCF.jl which does using CSV/DataFramesMeta. If this Manifest is meant to be committed and used for CI/reproducible tests, it should be regenerated so Pkg.instantiate --project=test brings in all required deps (or consider not committing a test Manifest at all).
Added test vcf files and corresponding tests.
Adds a few test vcf files and additional dependencies from DataFrames, CSV ...