Haplotype frequencies from 1000 genomes
1
4
Entering edit mode
9.6 years ago
whittlemr ▴ 40

I'm trying to pull out haplotype frequencies from the 1000 genomes dataset. Suppose I have the interval Chr21: 20,548,907 - 20,549,196 in which there are about 10 SNPs. I wish to identify all the different phased haplotypes in the dataset (1092 individuals, or a subset of them) for this 300 bp region and then count them so as to determine their frequencies.

I've downloaded the Chr21 data (ALL.chr21.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz) and can visualize the genotypes using IGV or Genome Browse but do not know how to manipulate them so as to count the different haplotypes.

Any help on this would be great, thanks.

snp genome • 5.5k views
ADD COMMENT
0
Entering edit mode

You should use 1000 Genomes phase 3 instead of phase 1 data for this, since it includes phased haplotype information.

ADD REPLY
0
Entering edit mode

Many thanks. I've now downloaded three equivalent vcf files (phase 3, phase 1, phase 1 no SHAPEIT) from 1000 genomes and am looking at one SNP using GenomeBrowse (Golden Helix), see attached xls. Is it not columns C, G and K which hold the phase information? So samples HG00097 and HG00106 are heterozygous at this SNP and the phase information is different in the three vcf files. But if phase 1 data does not contain the phased haplotypes, then columns G and K shouldn't even exist... Can you elaborate? Thanks.

ADD REPLY
0
Entering edit mode

Sorry, can't seem to attach xls file...

ADD REPLY
0
Entering edit mode

Thanks. The key portion of your comprehensive answer concerns the writing of the code.....

ADD REPLY
0
Entering edit mode

This should be a comment on donfreed's answer, not an answer of its own. Be more careful please.

ADD REPLY
0
Entering edit mode

I used samtools phase to divide 1000 genomes bam files into phase 0 and 1. For one snp genotyped as GC, does phase 0 correspond to G and phase 1 correspond to C?

ADD REPLY
0
Entering edit mode

This is not an answer, it should be a comment on a relevant post. I'm moving it to a comment now, please be more careful in the future.

ADD REPLY
0
Entering edit mode

If G is the reference allele and C is the alternate allele, then the answer to your question is 'yes'.

ADD REPLY
0
Entering edit mode

This belongs as a comment on @azmanr's post, not as an answer. Please be more mindful in the future.

ADD REPLY
0
Entering edit mode

Okay, so when I use samtools phase to get two consensus sequences for each phase, does phase 0 correspond to the reference sequence while phase 1 is the alternate consensus sequence? It is my understanding that if I have a phased bam file I can get genotypes specific to a haplotype and can use samtools phase to get the consensus sequences for each haplotype.

Thanks, Azman

ADD REPLY
4
Entering edit mode
9.6 years ago
donfreed ★ 1.6k

The 1000 genomes vcfs are large but an example dataset with three individuals might help. From column 9 on, the genotypes in your vcf file should be represented like this:

0|1   1|1   0|0
1|0   1|1   1|1
1|0   0|0   1|1

Over a specific region of the genome, you could count the haplotypes which are present by transposing the data (starting with the first individual). Since the SNPs are either variant or reference, you can represent reference calls as '0' and variant calls as '1' giving you the haplotypes: 011, 100, 110, 110, 011, 011. For a vcf from phase1 of thousand genomes, this would give you 2184 haplotypes (two from each individual).

From here it is easy to see that there are three haplotypes present in our example dataset: 011, 100 and 110. Haplotype 011 is present 3 times in 2 individuals, haplotype 100 is present only once while haplotype 110 is present twice in a single individual. To get the frequencies of these haplotypes, just divide by the total number of haplotypes in your dataset:

011 = 0.5
100 = .1666
110 = .3333

There is probably no publicly available software to do this so you could write the code yourself.

Note that the quality of your haplotype counts are dependent upon both the quality of called variants and the quality of the phasing.

ADD COMMENT

Login before adding your answer.

Traffic: 2536 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6