split 1000genomes .vcf per individual
2
0
Entering edit mode
6.9 years ago
User6891 ▴ 290

I want to make a .vcf.gz from every individual in the 1000 genomes data. So I've downloaded all the .vcf.gz for all the chromosomes. I merged all the chromosomes into one big .vcf.gz. Now I want to create for every individual a separate .vcf

I normally use GATKSelectVariants to do that. However you also need to specify a human reference genome when using this GATK option & I think that's where I created a problem. Since all the single sample .vcf.gz come out empty (except for the header). Is there another option besides GATK? I used vcf-tools for other purposes before, but I noticed that this sometimes makes mistakes in the allele frequency when it splits a multi-sample file. Or which reference genome should I use if I want to make GATKSelectVariants work for the 1000genomes data?

NGS 1000genomes vcf • 2.6k views
ADD COMMENT
0
Entering edit mode
6.9 years ago

from the vcf format documentation, each individual comes in a single column after the first 9 ones (#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT) which contain the shared variant information, so a simple cut would do:

N=`zcat single.vcf.gz | grep ^#CHROM | awk '{print NF-9}'`
for i in `seq 1 $N`; do
zcat single.vcf.gz | cut -f1-9,$((N+9)) > sample$i.vcf; done

ADD COMMENT
0
Entering edit mode
6.8 years ago
User6891 ▴ 290

I'm not sure this will work ... will there not be a problem with the recalculation of allele frequencies & depth if you just do a simple cut? We experienced this problem already with vcf-tools.

ADD COMMENT
0
Entering edit mode

sure all the INFO column values that relate to all samples should be recalculated in order to be informative.

by the way, this should be a comment, not an answer.

ADD REPLY

Login before adding your answer.

Traffic: 1631 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6