Question: Saturation analysis on downstream processed reads, fitting formula in data
0
gravatar for jaime.alvarez.benayas
4.1 years ago by
Spain
jaime.alvarez.benayas20 wrote:

Hello,

I have DNA sequencing data enriched for a region of DNA, after mapping and doing further processing downstream analysis (including de-duplication of reads) I end up with a subset of informative reads.

I have performed sampling (from 10% of the input sequenced reads to 100%), after each sampling I perform the full analysis pipeline and obtain the number of informative reads.

I am trying to find out what is the saturation point after which, no more additional input reads will yield informative reads in the downstream analysis.

I end up with a table such as the following

sample % 10 20 30 ... 100 Informative reads 1234 1421 1700 ... 3000

I have seen this paper where they are trying to do something similar:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3612374/#SD3

However, I don't think I can use this method because of how the data is processed. In the case of the paper, the data is directly assigned to one of multiple groups (for example, bins of a certain genome length), the downstream analysis I am performing just differentiates between "informative" and "non-informative" reads which no subgrouping within the informative reads and more complex processing.

When I plot the table described above, I end up with a similar curve as Michaelis-Menten described by a mathematical function :

y = V x / ( K + x )

Where V and K are parameters.

I have been trying to fit curves to the data to extrapolate the saturation point in Python but I can't seem to get it to work:

import numpy as np
import pandas as pd
import lmfit as lm
import matplotlib.pyplot as plt


if __name__ == "__main__":


    y = # Array of data with corresponding informative reads
    x = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

    # Define an objective function to minimise
    def obj(params, x, y) :
        model = params["V"].value * x / ( params["K"].value + x)
        return model - y

    # Create parameters object
    params = lm.Parameters()
    params.add("V", value = 20, min = 15, max = 25) # looks about right from plt.plot(x, y)
    params.add("K", value = 40, min = 35, max = 55) # based on our guesses for V

    result = lm.minimize(obj, params, args = (x, y) )

    lm.report_fit(params)

    plt.plot(x, y)
    plt.plot(x, result.residual + y)

I am getting an error:

 model = params["V"].value * x / ( params["K"].value + x)
ValueError: operands could not be broadcast together with shapes (200,) (10,)

Thanks

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by jaime.alvarez.benayas20

Im not expert in the matter, but I know * is not what I often think it should be (array multiplication). can you tell us the type() of params["V"].value and x ? And maybe also the value of both at the time it crashes.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by John12k

Hello jaime.alvarez.benayas!

It appears that your post has been cross-posted to another site: Cross-posted at Seqanswers

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 4.1 years ago by RamRS26k
2
gravatar for Brian Bushnell
4.1 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

Cross-posted

ADD COMMENTlink written 4.1 years ago by Brian Bushnell17k

Sorry, I redirected the question at Seqanswers

ADD REPLYlink written 4.1 years ago by jaime.alvarez.benayas20
0
gravatar for jaime.alvarez.benayas
4.1 years ago by
Spain
jaime.alvarez.benayas20 wrote:

Thanks for the hint John, checked and it got me in the right tracks to fix the problem:

>>> type(params["V"].value)
<type 'int'>
>>> type(params["K"].value)
<type 'int'>
>>> type(x)
<type 'list'>

Apparently I was using simple arrays as inputs and when operating def obj(params, x, y) it was trying to multiply scalars x arrays: 3 * [0,1,2] gives [0,1,2,0,1,2,0,1,2], not [0,3,6]

by using x = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100]) and the same for y, I get the program running.

However, no matter what values (and ranges), I put for V and K I always seem to get the same result:

[[Variables]]
    V:   280000     (init= 280000)
    K:   90         (init= 90)
[[Correlations]] (unreported correlations are <  0.100)



[[Variables]]
    V:   10         (init= 10)
    K:   90         (init= 90)
[[Correlations]] (unreported correlations are <  0.100)

I will continue investigating this.

Any other ideas for formulas for saturation of reads to fit the curve?

Thanks!

ADD COMMENTlink written 4.1 years ago by jaime.alvarez.benayas20

Yeah bingo, a list with 1 value. That has tripped me up so many times with numpy, although it's usually ndarrays of len 1. Regarding the rest of the problem, i'm afraid I cannot help much with that because it's a little too confusing for me. Perhaps if I could see your intermediate steps or some graphs i'd have a better idea of whats going on, but I think you'll probably have this cracked by Monday evening.

ADD REPLYlink written 4.1 years ago by John12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1609 users visited in the last hour