Split Vcf File Using Gatk
2
2
Entering edit mode
11.9 years ago
Rubal7 ▴ 830

Hi all,

Just a quick question, input is appreciated:

Why does the GATK SelectVariants option require a reference genome to split a VCF file by individuals. Surely this is essentially just a file parsing problem?

Cheers!

genome gatk vcf parsing • 4.2k views
ADD COMMENT
2
Entering edit mode

can't you just use vcftools to split the file?

ADD REPLY
0
Entering edit mode

The command is called vcf-subset in vcftools with -c option giving the panel subset. I found it a bit slow, so I made a quick perl script that just splits the rows as in tab-separated file and select the right columns.

ADD REPLY
0
Entering edit mode

Thanks yes vcf-subset works great.

ADD REPLY
1
Entering edit mode

It probably requires one so that it knows the max chromosomal position.

ADD REPLY
0
Entering edit mode
11.9 years ago
Johan ▴ 890

I believe that this has to do with the central dogma of GATK:

"All datasets (reads, alignments, quality scores, variants, dbSNP information, gene tracks, interval lists - everything) must be sorted in order of one of the canonical references sequences."

The motivation for this is nicely explained in their FAQ: http://www.broadinstitute.org/gsa/wiki/index.php/Frequently_Asked_Questions#What_is_the_Central_Dogma_of_the_GATK.3F

ADD COMMENT
0
Entering edit mode
24 months ago
jihosac954 • 0

If you are facing problems with dividing your heavy VCF files and not managed properly, then you can take the help of VCF Split Software to split large size VCF files according to date, year, folder, and size.

Visit at : https://www.wholeclear.com/split/vcard/

ADD COMMENT

Login before adding your answer.

Traffic: 2863 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6