Question: How to get individual chromosome sequence in fasta format from vcf.gz and its vcf.gz.tbi file of 1000 genome project?
1
gravatar for David Simmons
23 months ago by
David Simmons10 wrote:

Hi,

I am new to the field of Bioinformatics. I have downloaded files "ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz" and its tabix file "ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz.tbi" from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/. 

1) The above vcf.gz contains compressed chromosome 1 sequence of 1092 individuals. How can get separate 1092 sequences in fasta format?

I read about vcf-consensus script but i am confused how to use it here ?

cat ref.fa | vcf-consensus file.vcf.gz > out.fa

Does The phase 1 release of 1000 genome project use the following reference genome (as mentioned in ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/README.phase1_integrated_release_version3_20120430)

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz

2) How can get entire genome sequence in fasta format for HG00096, HG00097 etc ?

Thanking you in advance.

dna 1000 genome sequence vcf fasta • 1.5k views
ADD COMMENTlink modified 23 months ago by geek_y8.1k • written 23 months ago by David Simmons10
2
gravatar for geek_y
23 months ago by
geek_y8.1k
Barcelona/London
geek_y8.1k wrote:

You can get the VCF for each individual by:

vcf-subset --exclude-ref –c sample_ID in.vcf > sample_ID.vcf

Something like:​

parallel --jobs <int> "vcf-subset --exclude-ref -c {} in.vcf > {}.vcf " ::: `grep -m1 "CHROM" <in.vcf> | cut -f 10-`

Once you have the individual VCF file, you could use FastaAlternateReferenceMaker to get the alternate genome for each individual. 

parallel --jobs <int> "java -jar GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R reference.fasta -o {}.fasta -V {}.vcf" ::: `grep -m1 "CHROM" <in.vcf> | cut -f 10-`

Hope this is what you are looking for.

ADD COMMENTlink modified 23 months ago • written 23 months ago by geek_y8.1k

thanks vcf-subset will work to get individual samples in vcf format. After getting individual samples, which sequence do I have to use as reference to extract fasta sequence from obtained vcf file? Is it entire genome mentioned for phase1 1000 genome project "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz" or are there any references mentioned chromosome wise by 1000 genome project ?

ADD REPLYlink written 23 months ago by David Simmons10

The human_g1k_v37.fasta.gz should be used as the variants are called using that genome. You can split the fasta into individual chromosomes if you are only looking for chr1. There are several posts about splitting a fasta chromosome wise.

ADD REPLYlink modified 23 months ago • written 23 months ago by geek_y8.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 957 users visited in the last hour