There are several types of genomics data file formats. (Bold: readable format)
- Variant call format:
.vcf- may be paired with an index file (
.tbior.csi)
- may be paired with an index file (
- Binary variant call format:
.bcf - PLINK format:
.pedand.map - PLINK binary format:
.bed,.bim, and.fam - Oxford format:
.bgen- may be paired with an index file (
.bgi)
- may be paired with an index file (
- Sparse allele vectors:
.sav- may be paired with an index file (
.s1r)
- may be paired with an index file (
- PLINK
- Tabix
- qctool
- bcftools
- bgenix
- and many other softwares...
Today's genomics data contains tens (or hundreds) of thousands of samples.
Therefore, files with binary format are becoming popular.
However, some data (such as KoGES genotype data) are still distributed in the form of .vcf.
If we want to convert a single .vcf file to a PLINK binary dataset, we can use the command:
plink2 \
--vcf /media/leelabsg-storage0/KBN/유전정보/KCHIP_72298/CHR1_annoINFO_filINFO0.8_72K.vcf.gz \
--make-bed \
--out /media/leelabsg-storage0/GWAS_tutorial/convert_test/CHR1_test
We can use a shell script to convert multiple files with certain patterns.
#!/bin/bash
for (( i=1; i<=22; i++ ))
do
nohup plink2 \
--vcf /media/leelabsg-storage0/KBN/유전정보/KCHIP_72298/CHR${i}_annoINFO_filINFO0.8_72K.vcf.gz \
--make-bed \
--out /media/leelabsg-storage0/GWAS_tutorial/convert_test/CHR${i}_test &
done
If we want to convert our genotype data to Oxford formatted files (.bgen), we can use the command:
#!/bin/bash
for (( i=1; i<=22; i++ ))
do
nohup plink2 \
--vcf /media/leelabsg-storage0/KBN/유전정보/KCHIP_72298/CHR${i}_annoINFO_filINFO0.8_72K.vcf.gz dosage=DS \
--export bgen-1.2 ref-first bits=8 \
--out /media/leelabsg-storage0/GWAS_tutorial/convert_test/CHR${i}_test_bgen &
done
Sometimes the index file for corresponding genotype data file is missing. We can generate the index files using softwares.