Question: How To Estimate Coverage Using Gamma-Fit?
0
GAO Yang250 wrote:

Hi, guys~ Here is a short introduction of a K-mer counting tool. Now, I got a similar data like this example , and then got the number of k-mer of the peak. But it's said to estimate the k-mer coverage using the gamma-fit. How? Could anybody show me how to get this coverage information, say, just using this banana slug as example? Thank you!

coverage • 3.9k views
written 7.1 years ago by GAO Yang250
5
Arun2.3k wrote:

Suppose you bin the K-mers to find the frequency or no: of times you find k-mers that occur only once, occur twice etc... and so on and you have them in a vector. Then, because the ideal K-mer distribution (the one without initial errors they talk about under Gamma distribution is wrong) looks like a gamma distribution, they go ahead and fit this frequency values using a model that estimates the best possible parameters of gamma distribution.

A Gamma distribution has two parameters `shape` and `scale` and its `mean = shape * scale`. And that is the `average coverage`. So, all we have to do is fit a gamma distribution and obtain the product of the parameters to get the average coverage (or at least that is what they do, assuming your gamma fit is accurate, which may or may not be. You can check this by plotting your k-mer frequency for different k-mers as they do). So, here's how you could obtain the parameters. I had the answer before, but I deleted it because I thought I interpreted your question wrong.

Just to illustrate that the model works, I'll take a vector of 1000 values from a gamma distribution with `shape=2` and `scale=15`. So, its mean is 30. We can then verify how good the gamma fit is to this vector by looking at the estimated parameters. Here's a script in `R` which uses `BB package`

``````# install.packages("BB") # if you already don't have it.
require(BB)

# generate test data to check model efficiency
set.seed(100)
gamma.fn <- function(n, shape=2, scale=15) {
u <- runif(n)
out <- apply( as.matrix(u), 1, function(x) rgamma(1, shape=shape, scale=scale) )
out
}
out <- gamma.fn(1000)

# maximizing function
f = function(shape, scale) {
-sum(log(dgamma(out, shape=shape, scale=scale) ) )
}

# arbitrary start
start0 <- list("shape"=1.2, "scale"=4 )

fcn <- function(x) f( x, x )
res <- spg(par=as.numeric(start0), fn=fcn, lower=c(0, 0 ), upper=c(Inf, Inf ), control=list( maxit=1000 )  )
``````

Doing this, I obtain the parameters in `res\$par` and `shape=1.956524` and `scale=14.750196`. And mean = 28.85912. The fit seems to be very close to that of your input values for which you wanted a model fit. Now you can try this code by using your k-mer frequency in `out` and estimating the parameters and obtaining mean.

I hope its clear. If not, please feel free to comment.

Wow,amazing~ very detailed teaching,I believe it will help many people, not only me ~ Thank U :)

you're welcome :)

Arun, I tried this on my data, I scan my data into one vector (named "out" too), and copy this code from the " f = function(shape, scale) { " line down to the end, But it's error! I know little about R , Could you plz show me how to modify this code for customized data? thanks~

Hi gaouangnow, It would be then helpful if you post your code here.

sure

``````>require(BB)

>out <- scan("data.txt") # 940 items which are the number of K-mer at diff multiplicity

>f = function(shape, scale) { -sum(log(dgamma(out, shape=shape, scale=scale) ) ) }

>start0 <- list("shape"=1.2, "scale"=4 ) # I also tried diff values for these two var

>fcn <- function(x) f( x, x )

>res <- spg(par=as.numeric(start0), fn=fcn, lower=c(0, 0 ), upper=c(Inf, Inf ), control=list( maxit=940) )
``````

error message: error : spg(par = as.numeric(start0), fn = fcn, lower = c(0, 0), upper = c(Inf, : Failure in initial function evaluation!

I guess out contains the frequency of k-mers? I think the problem is that the numbers are too big. It would be nicer if you could scale out by its mean or max or an arbitrary large value like this:

``````out <- scale("my_file.txt")
m   <- mean(out)
out <- out / m
``````

Now fit the model similarly. If things go well, res\$par will contain the fitted shape and scale parameter for the normalized out. I'll try it out from my office and write back. got to run!

Yes, I have too many counts at the frequency=1 position (sequence error or other stuff), maybe I need to filter this out before doing this gamma-fit?

2
Arun2.3k wrote:

Since this is a bit more detailed, I write here as a separate answer.

The problem I think you face: Your frequency values are too large, but your maximizing function, which starts with initial values you set, are too small. That means, the dgamma() function will give just 0's. log(0) is -Infinity and its sum with negative is Inf. So, your model will never converge as it has an error to begin with. Basically, the scale parameter is too large for you and the initial scale value you choose is very small for the model to work. But choosing a HUGE scale value initially, while will converge nicely, will not give accurate results. To overcome this, we normalize your data by dividing it by an arbitrarily huge number (which you can increase if you find that the result does not fit good) and then fit gamma to the normalized variable and then compute the mean and shape and scale from this one. Sounds difficult, but its really easy and there are only 2 changes to the code.

So, how do we proceed?

`First change: Normalize`

``````out    <- scan("myfile.txt")
out.b  <- out          # take a backup
out    <- out/10000    # divide out by an arbitrarily high value
``````

Remember, mean = expectation of a random variable. That is, E(X) = mean, where X is a random variable (which is out here). Also, E(aX) = a * E(x). So, if you find fit a gamma for out.norm, then, prod(res\$par) * 10000 will be the mean of out. And the corresponding shape and scale parameters of your data, which is in out is, shape=shape of out.norm and scale=scale of normalized out * 10000. Also, normalizing by a high value helps estimate the scale parameter nicely (at least with what I tried). Lets code this now (there are not so many changes:

Same maximizing function (written differently for clarity)

``````# maximizing function
f = function(shape, scale) {
val <- dgamma(out, shape=shape, scale=scale)
t <- -sum(log( val ) )
return(t)
}
``````

`Second change: set scale parameter initial value to mean of the normalized out variable`

``````start0 <- list("shape"=1, "scale"=mean(out) )
``````

Also, `maxit` is the maximum number of iterations. It does not depend on the number of points you have. Choose a higher number if convergence is not reached after that many runs you chose initially. Here I choose 5000 iterations to start with. IF it converges, fine, else, I'll try increasing it and running it again.

``````fcn <- function(x) f( x, x )
res <- spg(par=as.numeric(start0), fn=fcn, lower=c(1, 1 ), upper=c(Inf, Inf ), control=list( maxit=5000 )  )
``````

`res` contains the parameters for normalized `out`. Get back the original parameters for out from its parameters.

``````# get back the parameters
shape_out <- res\$par
scale_out <- res\$par * 10000
mean_out <- shape_out * scale_out
check_mean_out <- mean(out.b) # check if check_mean_out and mean_out are relatively similar.
``````

This should work. Sorry for the trouble. Let me know if things still don't work out.

Wow! I filtered my data and used this "updated" code, now it's done! Although my data was not fit well, the code is excellent! I also gained some R programming skills :) Arun, thanks for yr patience & guidance.