Hi all,
Just a quick question, input is appreciated:
Why does the GATK SelectVariants option require a reference genome to split a VCF file by individuals. Surely this is essentially just a file parsing problem?
Cheers!
Hi all,
Just a quick question, input is appreciated:
Why does the GATK SelectVariants option require a reference genome to split a VCF file by individuals. Surely this is essentially just a file parsing problem?
Cheers!
I believe that this has to do with the central dogma of GATK:
"All datasets (reads, alignments, quality scores, variants, dbSNP information, gene tracks, interval lists - everything) must be sorted in order of one of the canonical references sequences."
The motivation for this is nicely explained in their FAQ: http://www.broadinstitute.org/gsa/wiki/index.php/Frequently_Asked_Questions#What_is_the_Central_Dogma_of_the_GATK.3F
If you are facing problems with dividing your heavy VCF files and not managed properly, then you can take the help of VCF Split Software to split large size VCF files according to date, year, folder, and size.
Visit at : https://www.wholeclear.com/split/vcard/
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
can't you just use vcftools to split the file?
The command is called vcf-subset in vcftools with -c option giving the panel subset. I found it a bit slow, so I made a quick perl script that just splits the rows as in tab-separated file and select the right columns.
Thanks yes vcf-subset works great.
It probably requires one so that it knows the max chromosomal position.