How to get individual chromosome sequence in fasta format from vcf.gz and its vcf.gz.tbi file of 1000 genome project?
1
1
Entering edit mode
8.3 years ago

Hi,

I am new to the field of Bioinformatics. I have downloaded files ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz and its tabix file ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz.tbi from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/.

1) The above vcf.gz contains compressed chromosome 1 sequence of 1092 individuals. How can get separate 1092 sequences in fasta format?

I read about vcf-consensus script but I am confused how to use it here?

cat ref.fa | vcf-consensus file.vcf.gz > out.fa

Does The phase 1 release of 1000 genome project use the following reference genome (as mentioned in ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/README.phase1_integrated_release_version3_20120430): ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz?

2) How can get entire genome sequence in fasta format for HG00096, HG00097 etc?

Thanking you in advance

vcf DNA 1000genome sequence fasta • 3.7k views
ADD COMMENT
2
Entering edit mode
8.3 years ago

You can get the VCF for each individual by:

vcf-subset --exclude-ref -c sample_ID in.vcf > sample_ID.vcf

Something like:​

parallel --jobs <int> "vcf-subset --exclude-ref -c {} in.vcf > {}.vcf " ::: `grep -m1 "CHROM" <in.vcf> | cut -f 10-`

Once you have the individual VCF file, you could use FastaAlternateReferenceMaker to get the alternate genome for each individual.

parallel --jobs <int> "java -jar GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R reference.fasta -o {}.fasta -V {}.vcf" ::: `grep -m1 "CHROM" <in.vcf> | cut -f 10-`

Hope this is what you are looking for.

ADD COMMENT
0
Entering edit mode

thanks vcf-subset will work to get individual samples in vcf format. After getting individual samples, which sequence do I have to use as reference to extract fasta sequence from obtained vcf file? Is it entire genome mentioned for phase1 1000 genome project "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz" or are there any references mentioned chromosome wise by 1000 genome project?

ADD REPLY
0
Entering edit mode

The human_g1k_v37.fasta.gz should be used as the variants are called using that genome. You can split the fasta into individual chromosomes if you are only looking for chr1. There are several posts about splitting a fasta chromosome wise.

ADD REPLY

Login before adding your answer.

Traffic: 1512 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6