split 1000genomes .vcf per individual
2
0
Entering edit mode
6.9 years ago
User6891 ▴ 290

I want to make a .vcf.gz from every individual in the 1000 genomes data. So I've downloaded all the .vcf.gz for all the chromosomes. I merged all the chromosomes into one big .vcf.gz. Now I want to create for every individual a separate .vcf

I normally use GATKSelectVariants to do that. However you also need to specify a human reference genome when using this GATK option & I think that's where I created a problem. Since all the single sample .vcf.gz come out empty (except for the header). Is there another option besides GATK? I used vcf-tools for other purposes before, but I noticed that this sometimes makes mistakes in the allele frequency when it splits a multi-sample file. Or which reference genome should I use if I want to make GATKSelectVariants work for the 1000genomes data?

NGS 1000genomes vcf • 2.6k views
0
Entering edit mode
6.9 years ago

from the vcf format documentation, each individual comes in a single column after the first 9 ones (#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT) which contain the shared variant information, so a simple cut would do:

N=zcat single.vcf.gz | grep ^#CHROM | awk '{print NF-9}' for i in seq 1 $N; do zcat single.vcf.gz | cut -f1-9,$((N+9)) > sample\$i.vcf; done

0
Entering edit mode
6.8 years ago
User6891 ▴ 290

I'm not sure this will work ... will there not be a problem with the recalculation of allele frequencies & depth if you just do a simple cut? We experienced this problem already with vcf-tools.

0
Entering edit mode

sure all the INFO column values that relate to all samples should be recalculated in order to be informative.

by the way, this should be a comment, not an answer.