Why is the noise level higher in regions with amplification than regions with deletion in the genome?
1
0
Entering edit mode
7.5 years ago
Dataman ▴ 350

Hi all,

I am writing a tool in Python which reads the allele fraction data [1] for each snp (coming from a DNA-seq experiment on solid tumor samples) and tries to find the change points in the data track. However, I have noticed that the standard deviation (noise level) in regions with amplification is much higher than regions with deletion. I was wondering why this is the case since this has an effect on the performance of the tool in detecting the change points?

Following figures represent the idea visually. Fig 1. shows a region in chr4 with deletion and fig 2. shows a region in chr9 with amplification. Fig 3 represent a snapshot of how the tool works for the moment.

fig 1. chr4 with deletion

fig 2. chr9 with amplification

fig 3. detected change points with different window sizes

.....................................................................................................................................................

[1] Allele fraction for each snp is calculated as: (#alternative allele / #total reads)

dna-seq allele fraction tool snp noise • 2.2k views
1
Entering edit mode

Could it be due to the fact that there are fewer discrete steps on the way down than up? Lose one allele and you have 1+noise, lose the other and you have noise. Gain an allele and you have 3+noise and then the sky is the limit...

Could also imagine that amplifying something to high copy numbers by e.g. breakage-fusion-bridge cycles is inherently messy and could result in higher overall variance..

0
Entering edit mode

I was considering this might have something to do with the read (A, T, C, G) at each snp position.

3
Entering edit mode
7.5 years ago

BTW, amplification is misspelled in figure 2.

That variance increases with signal is expected. In effect, you're drawing from a multinomial distribution, where the variance (n*p*(1-p)) will generally increase (in this context, variance will always increase) with the probability of seeing a read map to a location. In other words, we would expect amplified regions to have increased variance and deleted regions to have decreased variance.

0
Entering edit mode

Thanks so much for the answer, Devon.

So, if I understood it right, the alignment of a specific read to a location in the reference genome is a 'Bernoulli Trial' since it either aligns to that location (success) or not, and since let's assume that the depth of coverage is 30x, we are drawing from a Binomial distribution, because we are performing the Bernoulli trial on average 30 times.

The binomial distribution has two parameters: n and p. n, we can assume here that it is 30 for a normal diploid sample considering the depth of coverage of 30x.

Now, let's move from a read to a snp within that read. Since I am working with heterozygous snps, the probability of seeing the alternative allele is equal to the probability of seeing the reference allele which is equal to 0.5. So, it can be said that the the probability of success is always 0.5 (so this is the second parameter for the binomial distribution).

Now let imagine different possibilities:

Normal diploid (2N): n = 30, p = 0.5 => variance = 30 * 0.5 * 0,5 = 7.5

Hemizygous deletion (1N): Here, we can assume that if the coverage is 30, we would see 15 reads, since we have lost one of the copies, however, the probability of success (seeing the alternative allele) remains the same. So:

Variance = 15 * 0.5 * 0.5 = 3.75

Homozygous deletion (0N): n = 0, p = 0.5, var = 0

+1 gain (3N): n = 45, p = 0.5 => var = 45 * 0.5 * 0.5 = 11.25

+2 gain (4N): n=60, p= 0.5 => 60 * 0.5 * 0.5 = 15 and so on.

1
Entering edit mode

Well, you get the general idea, though the details are slightly different (this ends up being a multinomial rather than binomial distribution).

It's not that the alignment to a location represents a Bernoulli trial, but rather that you're sampling reads from a finite number of locations (that is, the reads are drawn from a categorical distribution). The probability of observing a read originating from a given location is dependent on the ploidy of that location, its mappability, and various library-specific biases. In the end, the variance equations for multinomial and binomial distributions end up being the same (np*(1-p), though this is per-outcome for the multinomial case).

BTW, n would be the total number of reads in the library and p is going to be a really small number (<<0.001), since the probability from each location will sum to 1 and there will be a LOT of locations :) But anyway, you already showed that you got the gist in your comment, this was just an FYI.

0
Entering edit mode

So, then, if I understood it right, n is always the same (constant) and this is the "probability of observing a read originating from a location" that changes based on certain events such as deletion, amplification, and other factors such as ploidy, mapability, etc. that you mentioned earlier?

BTW, would you please recommend me a source where I can learn more on the statistics of problems such as this and more generally, statistics related to bioinformatics?

Thanks again!

1
Entering edit mode

Yup, you got it!

I unfortunately can't recommend any specific statistics books or classes. I've heard good things about the data analysis courses offered on coursera (the people leading the course are really excellent, so I'd be surprised if the course itself weren't good). There's a similar course that Istvan Albert reviewed here: Reviewing a MOOC: HarvardX - PH525x Data Analysis for Genomics. I expect that at least certain components of those would be useful for you.

Of course, if you're associated with a university, then perhaps they have some decent statistics classes (though the ones offered to scientists are usually terrible).