Question: Haplotype frequencies from 1000 genomes
3
gravatar for whittlemr
4.6 years ago by
whittlemr30
whittlemr30 wrote:

I'm trying to pull out haplotype frequencies from the 1000 genomes
dataset. Suppose I have the interval Chr21: 20,548,907 - 20,549,196 in
which there are about 10 SNPs. I wish to identify all the different
phased haplotypes in the dataset (1092 individuals, or a subset of them)
for this 300 bp region and then count them so as to determine their
frequencies.

I've downloaded the Chr21 data
(ALL.chr21.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz)
and can visualize the genotypes using IGV or Genome Browse but do not know how to
manipulate them so as to count the different haplotypes.

Any help on this would be great, thanks.

snp genome • 3.6k views
ADD COMMENTlink modified 14 months ago • written 4.6 years ago by whittlemr30

You should use 1000 Genomes phase 3 instead of phase 1 data for this, since it includes phased haplotype information.

ADD REPLYlink written 4.6 years ago by chrchang5234.9k

Many thanks. I've now downloaded three equivalent vcf files (phase 3, phase 1, phase 1 no SHAPEIT) from 1000 genomes and am looking at one SNP using GenomeBrowse (Golden Helix), see attached xls. Is it not columns C, G and K which hold the phase information? So samples HG00097 and HG00106 are heterozygous at this SNP and the phase information is different in the three vcf files. But if phase 1 data does not contain the phased haplotypes, then columns G and K shouldn't even exist... Can you elaborate? Thanks.

ADD REPLYlink written 4.6 years ago by whittlemr30

Sorry, can't seem to attach xls file...

ADD REPLYlink written 4.6 years ago by whittlemr30

Thanks. The key portion of your comprehensive answer concerns the writing of the code.....

ADD REPLYlink written 4.6 years ago by whittlemr30

This should be a comment on donfreed's answer, not an answer of its own. Be more careful please.

ADD REPLYlink written 14 months ago by RamRS21k

I used samtools phase to divide 1000 genomes bam files into phase 0 and 1. For one snp genotyped as GC, does phase 0 correspond to G and phase 1 correspond to C?

ADD REPLYlink written 14 months ago by azmanr0

This is not an answer, it should be a comment on a relevant post. I'm moving it to a comment now, please be more careful in the future.

ADD REPLYlink written 14 months ago by RamRS21k

If G is the reference allele and C is the alternate allele, then the answer to your question is 'yes'.

ADD REPLYlink written 14 months ago by whittlemr30

This belongs as a comment on @azmanr's post, not as an answer. Please be more mindful in the future.

ADD REPLYlink written 14 months ago by RamRS21k

Okay, so when I use samtools phase to get two consensus sequences for each phase, does phase 0 correspond to the reference sequence while phase 1 is the alternate consensus sequence? It is my understanding that if I have a phased bam file I can get genotypes specific to a haplotype and can use samtools phase to get the consensus sequences for each haplotype.

Thanks, Azman

ADD REPLYlink written 14 months ago by azmanr0
4
gravatar for donfreed
4.6 years ago by
donfreed1.4k
Mountain View, CA
donfreed1.4k wrote:

The 1000 genomes vcfs are large but an example dataset with three individuals might help. From column 9 on, the genotypes in your vcf file should be represented like this:

0|1   1|1   0|0

1|0   1|1   1|1

1|0   0|0   1|1

Over a specific region of the genome, you could count the haplotypes which are present by transposing the data (starting with the first individual). Since the SNPs are either variant or reference, you can represent reference calls as '0' and variant calls as '1' giving you the haplotypes: 011, 100, 110, 110, 011, 011. For a vcf from phase1 of thousand genomes, this would give you 2184 haplotypes (two from each individual).

From here it is easy to see that there are three haplotypes present in our example dataset: 011, 100 and 110. Haplotype 011 is present 3 times in 2 individuals, haplotype 100 is present only once while haplotype 110 is present twice in a single individual. To get the frequencies of these haplotypes, just divide by the total number of haplotypes in your dataset:

011 = 0.5

100 = .1666

110 = .3333

There is probably no publicly available software to do this so you could write the code yourself.

Note that the quality of your haplotype counts are dependent upon both the quality of called variants and the quality of the phasing.

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by donfreed1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1157 users visited in the last hour