What Are Phased And Unphased Genotypes?
4
40
Entering edit mode
11.1 years ago
Nick ▴ 370

As the title implies, "what are phased and unphased genotypes?" I am playing with 1000 genomes data and am not sure if I should be handling phased/unphased genotypes differently.

documentation on the internet seems to be quite sparse...

genotyping genome • 65k views
40
Entering edit mode
11.1 years ago

Phased data are ordered along one chromosome and so from these data you know the haplotype. Unphased data are simply the genotypes without regard to which one of the pair of chromosomes holds that allele.

15
Entering edit mode

A biallelic genotype comes from two chromosomes. Phased means I know not only the genotypes but which chromosome each genotype call came from. This lets you interpret which sets of genotypes are being inherited together; google haplotype if this isn't clear.

4
Entering edit mode

No. A lot of depth is not needed to call major and minor allele. First, there is no such thing as major/minor for an individual; those are population values. Allele calls for an individual's sample are based on sequence quality - so two reads can do it, one with an A and one with a G. If high quality, the subject is a heterozygote. SNPs from the 1000G data are in dbSNP 132, I believe.

0
Entering edit mode

The genotype at hand would need to have a lot of depth and allele counts to be able to determine the major and minor alleles then right? Plus, even though the phased data is "ordered," the order of the bases don't really matter right (Aa is the same as aA)?

0
Entering edit mode

Sorry I wasn't clear with the 2nd question. Say I have a genotype called as 1|0. This is the same as 0|1 right? Also, from the example you provided above, supposing one of the reads was horrible (we aren't sure if the called G is really a G), then instead of having a "phased" AG genotype we would have an "unphased" AG genotype?

17
Entering edit mode
11.1 years ago
Genotepes ▴ 950

Hi

actually (I think) phased or unphased status is not related to any measure of quality. For each individual, there are two chromosomes labelled (arbitrarily when you do not have genotypes of the parents) paternal and maternal. The names are self-explanatory.

For a haterozyguous genotype at a SNP position (which is called conditional on some quality score), you may know which allele is on the maternal chromosome and which one is on the paternal chromosome. The genotyped is "ordered". If you are able to assign, for a heterozyguous call (still conditional on the quality) at another SNP position which allele is on the paternal chromosome and which one is on the maternal, then you are able to phase these two SNPs - or more precisely, to phase the alleles at this SNPs. You then get an haplotype - or a suite of "ordered" SNPs.

In this context, having ordered 0/1 at SNP1 and 1/0 at SNP 2 is not the same as having 0/1 at SNP 1 and 1/0 at SNP 2.

First gives : 0 1 while second gives 0 0 _____ _____

           1   0                          1   1


Now, one could use some pre-estimated phase information on a panel population - typically different from the population where you call your alleles - to help calling an allele when the quality is low. This is what BEAGLECALL do, usually in a chip genotyping context.

As for the 1000 G data, having the phased data helps getting a better estimate of linkage disequilibrium. This also means that the format may differ so you need to take care when you take this as an input. But besides input format and more info about LD, the way you may use phased and unphased here are not really different.

Christian

PS : sorry if I went too far to the basics

0
Entering edit mode

I realise the format is not what I expected

Trying to re-display this.

Genotype 1 would be 1 0 / 0 1 while genotype 2 would be 0 0 / 1 1

0
Entering edit mode

hi, i've read about the concept of phased haplotypes and ordered genotypes but never worked with any data. When the OP says they have a genotype called as 0|1 what are the numbers? Is it paternal allele/ maternal allele or is the paternal allele always 0 and the maternal allele always 1

0
Entering edit mode

hi, i've read about the concept of phased haplotypes and ordered genotypes but never worked with any data. When the OP says they have a genotype called as 0|1 what are the numbers? Is it paternal allele/ maternal allele so paternal allele = 0 and maternal allele = 1 for this SNP

0
Entering edit mode

Hi - sorry to bump this ancient thread. I've been scratching my head over this for some time now.
What does 01/00 mean? If an individual in 1000G (Phase 3), for SNP1 has 00 haplotype what does it tell?

Thanks.

0
Entering edit mode

My experience is with phased data from 1000 genomes for imputation programs (so not vcf files). There, you have one line per chromosome (in a .haplo type file) - I think paternal is the first one. There,the 0 and 1 refers to a code from a descriptive marker file. Let's say rs1 has alleles A and G and rs2 is C T Then ind1 0 1 ind1 0 0 means thath ind 1 bears haplotypes A - T A - C If the convention is paternal/maternal, then 0/0 - 0/1 Could you tell us which file you are using ? What I was referring to is 1000G processed file intended for softs like IMPUTE or MACH

0
Entering edit mode

Excuse me:

In the phased genotype, the paternal is the first. Is there any materials that could prove it?

Many thanks.

0
Entering edit mode

Hi. I have never used any of this data. I was just reading the question out of general interest and wasn't familiar with the notation :)

0
Entering edit mode

if the paternal haplotype is A-T and the maternal haplotype is A-C why isn't the second notation 0/0 - 1/0 or using the bases A/A - T/C

0
Entering edit mode

Let me check .... You're right, -1 for me. Was after a long night. Apologies Let me rephrase (although I am sure you inderstand) Ind 1 id 0 1 at rs1 and 0 0 at rs2, he will have the haplotypes (00) / (10). Or (AT)/(GT).

I really screwed up the example but not easy to do hapltoype things ...

Apologies

6
Entering edit mode
4.5 years ago
Jerry Zhu ▴ 80
5
Entering edit mode
11.1 years ago

If you are analysing the 1000G data taking each SNP as an independent data point, you most probably don't need phased data. If what you are studying are correlations between, say, pairs of SNPs, and can be influenced by recombination, like linkage disequilibrium or selective sweeps, then you need phased data.