Question: How Do Heterozygotes And Somatic Mutations Manifest In Sequencing Projects
gravatar for User 6659
5.9 years ago by
User 6659920
User 6659920 wrote:


I have only read about DNA sequencing and never seen the actual results from a sequencing project. I'm wondering how heterozygotes and and somatic mutations show up in sequencing results. This is my understanding of a sequencing project

1) extract DNA, typically from blood cells 2) make clone library. There is a formula which works out how many clones you need to make sure all of the DNA of a heterozygous individual is represented in a clone (by all of the DNA i mean both copies of a chromosome) 3) sequence the clones. The sequencing project has an overall coverage. On a genome basis, it means that, on average, each base has been sequenced a certain number of times (10X, 20X...). For a specific nucleotide, it represents the number of sequences that added information about that nucleotide.

If the individual is heterozygous at a loci you will see 2 alleles at that position. You would expect to see each allele in approximately 50% of the sequencing reads. However is it correct that there is no reason stopping your clone library from overrepresenting one chromosome so you do not get a 50:50 distribution of each allele?

Considering somatic mutations. it is possible that one of your blood cells has a spontaneous mutation at a particular locus and it is possible that the DNA fragment from this such blood cell is inserted into a clone libary. Whilst I imagine this is very rare, is it possible? How would this show up in your sequencing results? Lets say a locus has 25x coverage and only one of those reads is a different allele to the others due to your somatic mutation, would it be classed as a sequencing error or would you class the locus as heterozygous? If that locus was already heterozygous you could in theory get 3 alleles there I presume?

thanks a lot

ADD COMMENTlink modified 5.2 years ago by Malachi Griffith14k • written 5.9 years ago by User 6659920
gravatar for Malachi Griffith
5.2 years ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith14k wrote:

When thinking about somatic variants and the allele frequencies of those variants it is important to consider the purity and heterogeneity (aka clonality) of the samples being sequenced. In cancer genome sequencing it is rare that a tumor sample is 100% pure (no normal cells) and is composed of identical tumor cells each containing the same mutations. Tumors change over time and a single tumor may contain multiple 'sub-clones'. Until we are sequencing DNA of individual cells we need to remember that we are sampling from a population of entities that are not guaranteed to be identical.

For the following discussion:

Variant allele frequency = read count supporting mutant base / total read count at that position

For heterozygous somatic variants we can expect a tumor variant allele frequency of 50% and a normal variant allele frequency of 0%. Of course there will always be some variability due to some of the technical artifacts already discussed (random sequencing errors, etc.).

Imagine you are identifying somatic variants in tumor DNA by comparison to normal DNA. The simplest case is where the tumor sample is pure (contains 0 contaminating normal cells) and is perfectly homogenous (all tumor cells contain exactly the same set of mutations). In this case, we should see a tumor variant allele frequency close to 50%.

Effect of poor tumor purity:

If the tumor sample contains contaminating normal cells, this will reduce the observed tumor variant allele frequency. Similarly, if the 'normal' sample was adjacent to the tumor it may contain some tumor cells and the observed normal variant allele frequency may be higher than 0%. Even if blood is used to obtain constitutive DNA for identifying somatic variation, it may still contain circulating tumor cells or cell free DNA that came from tumor cells.

Effect of complex tumor heterogeneity:

Even if a tumor is 'pure' with no contaminating normal cells/DNA we may still see somatic variants that have a tumor variant allele frequency of less than 50%. For example, imagine our tumor consists of a founder clone and a sub-clone derived from it. Perhaps the sub-clone comprises 40% of the cells in the overall tumor mass that we extracted and sequenced. In this case somatic variants that exist only in the sub-clone will have a substantially reduced tumor variant allele frequency.

When both purity and heterogeneity are in play, interpreting tumor variant allele frequencies just gets more and more complicated ...

ADD COMMENTlink written 5.2 years ago by Malachi Griffith14k

I think things get even more complicated if you factor in ploidy (many cancers are not diploid).

I have been thinking of a formula that computes the cancer cell fraction (CCF) from measured variant allele frequencies (VAF), taking into account known ploidy and purity. Do you happen to know how to do this?

In Figure 1 of Landau et al. ( they present an example with VAF=0.125, ploidy = 3, purity = 67%, and the resulting CCF=0.5. However, my own naive calculation 0.125 x 3 / 0.67 = 0.56 is slightly off. What am I missing?

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Christian2.4k

Think I figured it out:

f = variant allelic frequency = 0.125
t = ploidy = 3
a = purity = 0.67

CCF = f * (a * t + 2(1-a)) / a = 0.4981343
ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Christian2.4k

Also incase anyone is wondering (took me a minute to notice this) the figure is showing the case where you literally have 3 cells, two are tumor and one is normal. So the purity of .67 is a rounded version of 2/3 in their example. If you do more sig-figs on the purity you get closer to 0.5 using your equation.


Nice, that gives you the MLE. The probability distribution given lowish coverage (used here can be modeled like this in R:

f <- function(CCF, purity, absCN){



probs=sapply(CCFs, function(c){
    dbinom(alt_alleles, coverage, f(c, 2/3, 3))


plot(CCFs, probs)




ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by John St. John940

Very nice! Now I can even put 95% confidence intervals to my CCF estimates.

ADD REPLYlink written 2.7 years ago by Christian2.4k

One bit that may or may not concern you. CCF is not an accurate term for what this number means. Kind of an issue in a lot of these papers. What the number really means is the average number of mutations per cell. This usually doesn't matter, but consider the example where you start with a mutation, then amplify that mutation a few times, and the tumor chromosomes all share the mutation, and they are also in an amplified state. This CCF value will be greater than 1! That is clearly not a fraction...


The code above, as used in the paper mentioned above, only looks at potential "CCF" values between 0 and 1. If you isntead relax that restriction and change the line to CCFs=seq(0.01,3,by=0.01), you will see that the maximum can be over 1 in some of these cases.



f <- function(CCF, purity, absCN){



# 3 tumor cells with 3 alleles each, 

# one normal cell, mutant is on two of the 

# 2/3 chromosomes in each tumor

probs=sapply(CCFs, function(c){
    dbinom(alt_alleles, coverage, f(c, purity, absCN))


plot(CCFs, probs) 
ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by John St. John940

The link of the paper of Landau et al is outdated. Could you please provide the title, year, and first author of the paper?

ADD REPLYlink written 13 months ago by dinglizhongman0
Its Landau et al. (2013): Clonal evolution in hematological malignancies and therapeutic implications. For me the link still works.
ADD REPLYlink written 13 months ago by Christian2.4k
gravatar for Darlingtonia
5.5 years ago by
Darlingtonia10 wrote:

Yes, generally you have the right idea, though some details change depending on how the sequencing was performed.

Clone libraries are created because each clone should only incorporate a single copy of whatever locus has been amplified. One then sequences some number of those clones. If you're working with a diploid organism then you can use a formula to calculate the probability that both copies of the locus have been sequenced for the number of clones you have picked. These can be extended to organisms with higher ploidy levels.

Somatic mutations could be picked up while sequencing, and their frequency in the output would correlate with how many cells in your sample carried that mutation. Although I don't do much of these analyses myself, I expect it would be very difficult to distinguish a somatic mutation from sequencing error. If you knew the error rate of your sequencing technique, and especially if you had a specific mutation you were looking for, detecting a somatic mutation should be easier. (There is also the problem of distinguishing between error caused by the sequencing machine vs. errors incorporated upstream such as contamination or a mutation while the DNA was in the clone.)

For the type of sequencing I do (Illumina) I would almost always classify a single read out of 25 as an error. As with anything, however, if you were very confident that your protocols and platforms reduced enough error, you may want to take a second look at the outlier.

As for over-representation of one chromosome or allele over another, it does happen and should be considered. However, I don't know much about these biases so can't say how dramatic they can be or which particular biases a specific platform is prone to.

ADD COMMENTlink written 5.5 years ago by Darlingtonia10
gravatar for Swbarnes2
5.2 years ago by
Swbarnes21.3k wrote:

I'd say in general, you should not expect one parental chromosome to be favored over another.

Depending on the techniques used, you might get some bias if one allele looks very different from the other, (say, because your PCR primers work better on one allele than the other) but in general, you should see each at about 50/50. If you get a very skewed difference, that suggests that you are looking at a false positive.

With most technology, a 1/25 mutation would not be distinguishable from noise. If you had 1000x coverage with high quality Illumina data, you could spot it. With sanger, never.

You can't have three alleles in the same organism, unelss there's been a genome duplication. You only have two copies of every gene, one from mom, and from dad.

ADD COMMENTlink written 5.2 years ago by Swbarnes21.3k

I think you wanted to say in your last sentence "You can't have three alleles in the same individual (when we're talking human) or strain / breed (when we're talking inbred organisms)".

ADD REPLYlink written 5.2 years ago by Bert Overduin3.5k

I think you wanted to say in your last sentence "You can't have three alleles in the same individual"

ADD REPLYlink written 5.2 years ago by Bert Overduin3.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1410 users visited in the last hour