Question

Rip-Seq: How Do You Determine What'S Good Coverage?

3

Entering edit mode

15.1 years ago

Jason ▴ 940

Hi I've performed an RNA-immunoprecipitation and then sequenced the results using NGS. We already have the reads mapped (s. cerevisiae), however we have a simple question which we are having difficulty answering, how many reads must be found for a specific transcript for that transcript to be considered present or statistically significant? We want to answer this question because there are some genes which only have a few reads, while others have over 500 and we think there's always the possibility that those 3 reads could have been due to error while it's much more likely the 500 reads are not. So we want to develop a statistical analysis to answer this question, does anyone know if there is already one out there or if you have any ideas as to how I should start?

Let me be clear I'm dividing my analysis into two questions here, whether the transcript has enough coverage to be considered usable and if it does is it highly enriched?

Also, my application of this experiment is not for determining where splice sites or SNPs are located, rather we really just need your most basic coverage to determine which RNAs are present.

Thanks, I appreciate any input

edit (6/3/10)

After the discussion below with Eric it has occurred to me I can explain my question in another way. I think there have been two types of methods discussed here (see the second answer and comments). One where a strict threshold is set i.e. only look at genes with at least 10 tags, and another where the threshold is based on some sort of stat making the threshold, as it relate to raw tags, variable. I guess in your opinion is one better than another, should I use both or neither? The other point I guess I'm getting at is if I see that this statistical method illustrates very few transcripts have enough tags above background, do I conclude I need more sequencing depth/more replicates?

another edit (6/4/10): Sorry, I failed to mention that we have a test sample in which our protein of interest is myc tagged and the RNA in the IP is being pulled down from our protein of interest. And we have a control in which the strain has no myc tag encoded in the genome (wild-type) so we pull down background RNA (with consistency). So I need to compare my test to my control (background) and see if the test has enough reads above background (statistically speaking) to be considered present in addition to being highly IP'd over other transcripts.

rna statistics coverage • 8.6k views

ADD COMMENT • link updated 11.9 years ago by Biostar 20 • written 15.1 years ago by Jason ▴ 940

score 3 · Answer 1 · 2010-06-01

3

Entering edit mode

15.1 years ago

brentp 24k

Not sure how rip-seq will vary, but, if you havent already, check out the cufflinks paper and supplemental info.

They have a section on estimating transcript abundance with details on how they parameterize the statistical model of "the probablility of observing an RNA-seq fragment".

In fact, you may be able to use cufflinks directly

ADD COMMENT • link 15.1 years ago by brentp 24k

0

Entering edit mode

This is an interesting read however I got lost when all the equations were used (I'm trained as a molecular biologist, but I'm dabbling with bioinformatics). If you have a second, could you explain how you think I could implement their methods, keeping in mind I'm more of a wet lab type of guy? Thanks

ADD REPLY • link 15.1 years ago by Jason ▴ 940

0

Entering edit mode

i'm not suggesting you implement their methods! just suggesting that they (who know about these sort of stats) have created cufflinks and described the model and potential short-comings and example use for something very close to the problem you describe. so ... don't try to implement anything, just try to run top-hat and maybe cufflinks on your reads.

ADD REPLY • link 15.1 years ago by brentp 24k

0

Entering edit mode

The only caveat is that I don't think cufflinks is used when you have a background control or input, since you are essentially measuring input (total RNA) alone. Sorry, I didn't mention originally I had a background control that is sequenced, that sorta changes things. I edited my original post.

ADD REPLY • link 15.1 years ago by Jason ▴ 940

score 2 · Answer 2 · 2010-05-31

2

Entering edit mode

15.1 years ago

Eric Normandeau 11k

This may be an overly simplified reply, but it seems to me that the mere fact of finding a transcript means that it is expressed or present, as you say.

I have a hunch that the problem may rather be: "At what level of transcription (eg: 20 transcripts) can we expect to be missing only a certain percent (maybe 5%) of the genes that are really expressed at that level in the cell, given that there is a high level of stochasticity in RNA-Seq data. Am I mistaken?

I do not know of a statistical approach to test this, however, unless you have more information than only a single number (the number of sequences found) for a particular transcript.

If you have data about individuals, maybe you could test the 'individual-based percent of presence' of transcripts of different representations. This would give you an idea of the stochasticity in your data, base upon which you could put a criteria. Something like: The representation (in number of seqs) for which, on average, 95% of your individuals have at least 1 sequence.

Hope this can help somehow!

Cheers

ADD COMMENT • link 15.1 years ago by Eric Normandeau 11k

0

Entering edit mode

Eric I really like the way you reworded my question. Do you think I could calculate FDRs for RPKMs at a 95% confidence and get my answer? The only thing I'm worried about is that because I'm doing a an IP I'm not expecting to always get reads across the entire transcript. Thus, I'm not sure if RPKM may create more false negatives. Thanks

ADD REPLY • link 15.1 years ago by Jason ▴ 940

0

Entering edit mode

@Jason I don't see how you could do FDRs on RPKM data. A FDR presupposes a list of p-values, not of data measures. How many individuals were pooled for this experiment? You may want to arbitrarily accept only genes that have on average a certain number of transcripts (eg: 2) per individual. Depending on the analysis you want to perform later, you may also use a criteria where you eliminate below a threshold that gives you no statistical power to carry your analysis. I believe this is the approach I would prefer. G'luck!

ADD REPLY • link 15.1 years ago by Eric Normandeau 11k

0

Entering edit mode

I have biological replicates of my background control IP (no tag or WT, pretty much whatever binds to beads or the antibody nonspecifically) and replicates of protein of interest sequencing runs. So I would have two RPKM values per gene that I can compare, the test to the control. Is that what you were referring to?

ADD REPLY • link 15.1 years ago by Jason ▴ 940

0

Entering edit mode

I was rather thinking 6 to 20 replicates. Is it crucial to you to avoid using an arbitrary criterion based on the knowledge about the type of analyses you need to perform later? What is the real benefit of having a (just as much arbitrary) statistical criterion to tell you below what number to drop a sequence?

ADD REPLY • link 15.1 years ago by Eric Normandeau 11k

0

Entering edit mode

Lets stick to using around 3 reps, due to expenses. We are uncomfortable using a strictly arbitrary threshold for a few reasons. One, we designed our own RIP which is unlike other protocols, and no one has used the NGS we've used, in concert with RIP, in a publication. Thus we feel it is important to not only look to other publications for a reasonable threshold, but make a call ourselves based on something more than our gut.

ADD REPLY • link 15.1 years ago by Jason ▴ 940

0

Entering edit mode

Secondly, in our first run we didn't really have a ton of tags map to mRNA because of rRNA contamination (we are trying to fix this in a few different ways but in the meantime I'd like to get something from that run). Therefore, there are plethora of genes with very few tags (especially in the background control). Thirdly, we want to be confident we are making the most significant calls possible because the number of different mRNAs which associate is part of our interest, not just which ones are the greatest in number in the IP. If that were the case I'd just select the top 200 or something.

ADD REPLY • link 15.1 years ago by Jason ▴ 940

0

Entering edit mode

For instance, if I have gene1 with 10 tags in the test and 2 tags in the control and I have gene2 with 100 in the test and 20 in the control I should have a method that can say even though it's the same fold enrichment (lets assume same gene length), I should be more confident in the second example. And Or how about if we go decide to do an arbitrary threshold, wouldn't one at least base that upon maybe a standard deviation on a gene by gene basis?

BTW, thanks for having this discussion, I think it will help me prepare for my comps exam haha.

ADD REPLY • link 15.1 years ago by Jason ▴ 940

0

Entering edit mode

@Jason I really get the impression that your question is changing while we are having this discussion. Could you please take some time to reword it exactly as you mean it in the question part up there :) Depending on what you'll write, I may have one or two suggestions, but now I don't know exactly what you are trying to do. Cheers!

ADD REPLY • link 15.1 years ago by Eric Normandeau 11k