Question: Split Vcf File Using Gatk
2
gravatar for Rubal7
6.9 years ago by
Rubal7760
Rubal7760 wrote:

Hi all,

Just a quick question, input is appreciated:

Why does the GATK SelectVariants option require a reference genome to split a VCF file by individuals. Surely this is essentially just a file parsing problem?

Cheers!

genome gatk vcf parsing • 2.4k views
ADD COMMENTlink written 6.9 years ago by Rubal7760
2

can't you just use vcftools to split the file?

ADD REPLYlink written 6.9 years ago by Giovanni M Dall'Olio26k

The command is called vcf-subset in vcftools with -c option giving the panel subset. I found it a bit slow, so I made a quick perl script that just splits the rows as in tab-separated file and select the right columns.

ADD REPLYlink written 6.9 years ago by Michael Dondrup46k

Thanks yes vcf-subset works great.

ADD REPLYlink written 6.9 years ago by Rubal7760
1

It probably requires one so that it knows the max chromosomal position.

ADD REPLYlink written 6.9 years ago by Zev.Kronenberg11k
0
gravatar for Johan
6.9 years ago by
Johan840
Sweden
Johan840 wrote:

I believe that this has to do with the central dogma of GATK:

"All datasets (reads, alignments, quality scores, variants, dbSNP information, gene tracks, interval lists - everything) must be sorted in order of one of the canonical references sequences."

The motivation for this is nicely explained in their FAQ: http://www.broadinstitute.org/gsa/wiki/index.php/Frequently_Asked_Questions#What_is_the_Central_Dogma_of_the_GATK.3F

ADD COMMENTlink written 6.9 years ago by Johan840
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1515 users visited in the last hour