Question: How to get individual chromosome sequence in fasta format from vcf.gz and its vcf.gz.tbi file of 1000 genome project?
gravatar for David Simmons
3.3 years ago by
David Simmons10 wrote:


I am new to the field of Bioinformatics. I have downloaded files "ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz" and its tabix file "ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz.tbi" from 

1) The above vcf.gz contains compressed chromosome 1 sequence of 1092 individuals. How can get separate 1092 sequences in fasta format?

I read about vcf-consensus script but i am confused how to use it here ?

cat ref.fa | vcf-consensus file.vcf.gz > out.fa

Does The phase 1 release of 1000 genome project use the following reference genome (as mentioned in

2) How can get entire genome sequence in fasta format for HG00096, HG00097 etc ?

Thanking you in advance.

dna 1000 genome sequence vcf fasta • 2.0k views
ADD COMMENTlink modified 3.3 years ago by geek_y9.4k • written 3.3 years ago by David Simmons10
gravatar for geek_y
3.3 years ago by
geek_y9.4k wrote:

You can get the VCF for each individual by:

vcf-subset --exclude-ref –c sample_ID in.vcf > sample_ID.vcf

Something like:​

parallel --jobs <int> "vcf-subset --exclude-ref -c {} in.vcf > {}.vcf " ::: `grep -m1 "CHROM" <in.vcf> | cut -f 10-`

Once you have the individual VCF file, you could use FastaAlternateReferenceMaker to get the alternate genome for each individual. 

parallel --jobs <int> "java -jar GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R reference.fasta -o {}.fasta -V {}.vcf" ::: `grep -m1 "CHROM" <in.vcf> | cut -f 10-`

Hope this is what you are looking for.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by geek_y9.4k

thanks vcf-subset will work to get individual samples in vcf format. After getting individual samples, which sequence do I have to use as reference to extract fasta sequence from obtained vcf file? Is it entire genome mentioned for phase1 1000 genome project "" or are there any references mentioned chromosome wise by 1000 genome project ?

ADD REPLYlink written 3.3 years ago by David Simmons10

The human_g1k_v37.fasta.gz should be used as the variants are called using that genome. You can split the fasta into individual chromosomes if you are only looking for chr1. There are several posts about splitting a fasta chromosome wise.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by geek_y9.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 880 users visited in the last hour