Question: What % of rRNA contamination is acceptable?
0
gravatar for John
6 months ago by
John210
United States
John210 wrote:

Hi

I have n=3 samples of neuronal ribosome immuno-precipitated RNA seq data. All the data has more or less same alignment rate like below:

34167279 reads; of these:
  34167279 (100.00%) were unpaired; of these:
    22847962 (66.87%) aligned 0 times
    7904836 (23.14%) aligned exactly 1 time
    3414481 (9.99%) aligned >1 times
33.13% overall alignment rate

I checked those (66.87%) aligned 0 times they are rRNAs. Looks like need to do more special rRNA depletion steps than given in the Takara Kit, as the sample has enriched Ribosome.

I have around 100 differentially expressed genes (FDR<0.05) but they don't make any sense to the experiment. PCA also didn't separate well. Anybody has suggestion about what percentage of rRNA contamination is acceptable? The number of reads aligned is enough to do differential expression?

Thanks for your help!

rna-seq alignment • 502 views
ADD COMMENTlink modified 6 months ago by Istvan Albert ♦♦ 84k • written 6 months ago by John210
1

Anybody has suggestion about what percentage of rRNA contamination is acceptable?

That is going change based on experimental aim. Sounds like in your case the results are not making experimental sense with data you have. So it may be time to think about redoing the experiment making sure rRNA depletion works more efficiently. If you have money available you could more deeply sequence existing libraries (giving up on 66% of reads since they will be rRNA again).

ADD REPLYlink written 6 months ago by genomax85k

How many replicates do you have per sample?

ADD REPLYlink written 6 months ago by Istvan Albert ♦♦ 84k

Three replicates.......................

ADD REPLYlink modified 6 months ago • written 6 months ago by John210

Are you making a pairwise comparison with 3 replicates over two conditions, thus comparing a total of 6 sequencing runs?

When asking a question on how much data ought to be enough the most important detail is the experimental design.

(PS: let me also insert a few passive-aggressive dots to ensure a prompt response ............)

ADD REPLYlink modified 6 months ago • written 6 months ago by Istvan Albert ♦♦ 84k
1

I have control and treatment, 3 samples each. They are not paired!

P.s they’re not passive aggressive dots. Biostar didn’t allow me to comment without certain number of characters. I really appreciate everyone in the forum for helping each others! Thank you for trying to help me!

ADD REPLYlink modified 6 months ago • written 6 months ago by John210
1

Jon : Spaces work too. So no need to use dots.

ADD REPLYlink written 6 months ago by genomax85k

Yes it works! Thanks

ADD REPLYlink written 6 months ago by John210

"paired" refers to control vs. treatment (or rather condition A vs condition B), not to the read data.

+1 for the dots to overcome the length limit.

ADD REPLYlink written 6 months ago by cschu1812.3k
1

The reason we have a limit is that in general (like in the case above) people do not provide sufficient detail in short replies. Adding just one more sentence may help us avoid another round of back-and-forth. Defeating this with dots (or spaces for that matter) is usually counterproductive and should not be encouraged with upvotes.

Though I admit we should do a better job at informing users of why the limit is there - and make them more aware that it is for their benefit.

ADD REPLYlink modified 6 months ago • written 6 months ago by Istvan Albert ♦♦ 84k

Got it thanks. Then I should call it pairwise comparison between two condition with three replicates each.

ADD REPLYlink written 6 months ago by John210
1
gravatar for Istvan Albert
6 months ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

Align your reads against the genome, then visualize the alignments with IGV.

Investigate how uniformly are your transcripts covered.

7 million reads that align evenly to transcripts ought to work, as long as transcripts are evenly covered.

A properly depleted RNAseq should have less then 10% rRNA in it (as little as 2-3%). Normally the rRNA would be over 80-90%, thus your data already is depleted.

Now what often happens is that one problem feeds on another. It is rare that "bad" data has only one type of problem. Problems come in bunches.

For example, if the transcripts are fragmented incorrectly you may end up with coverage from one end (5' or 3' only). Most RNA-seq methods do not recognize or inform you of these types of problems, they will happily report various low P-values, when in fact the transcript is not properly covered.

Thus my recommendation is to align against the genome and visually investigate the data to identify the reasons the methods produce odd results.

ADD COMMENTlink modified 6 months ago • written 6 months ago by Istvan Albert ♦♦ 84k

Thanks, I looked at them in IGV.

if the transcripts are fragmented incorrectly you may end up with coverage from one end (5' or 3' only).

This is not a problem.

ADD REPLYlink modified 6 months ago • written 6 months ago by John210

take a screenshot of a gene/transcript that comes up differentially expressed and show it here.

show the coverages for both tracks the control and condition, here is an example (host with imgur):

enter image description here

ADD REPLYlink written 6 months ago by Istvan Albert ♦♦ 84k

Here is the screen shot! first three samples are Condition A, Second three samples are condition B. Kat6a is upregulated in Condition B (with log2FC=0.83, padj<0.01 shown by DESEQ2). I don't know why it is aligned to intronic regions. Because the sample is ribosome bound RNA. Thanks

5'

middle

3'

ADD REPLYlink modified 6 months ago • written 6 months ago by John210

I would say that the data shows too many reads in the intronic regions (your second image) - 10 to 20% of the exonic read counts are present.

Also, there is unexpectedly weird fragmentation throughout - coverages drop the zero in the middle of the exon.

ADD REPLYlink written 5 months ago by Istvan Albert ♦♦ 84k

I used featurecounts to count only exons, still the same genes are showing up as DE genes.

Do you think the problem is only in library prep (fragmentation)? or The low number of reads (due to rRNA contamination?

My question is Will I get good coverage If I do deep sequencing with the same library or not? Thanks

ADD REPLYlink modified 5 months ago • written 5 months ago by John210
1

The problem here goes much deeper. Simply put most statistical methods cannot account for the type of errors you observe.

The concept of transcript integrity as described here:

http://rseqc.sourceforge.net/#tin-py

is the closest to what you might be after. What you want is to filter not just for p-values but to select transcripts with high transcript integrity.

To be honest, I think your library is a little suspect. It is quite unexpected to see so many intronic regions have coverages.

ADD REPLYlink written 5 months ago by Istvan Albert ♦♦ 84k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1577 users visited in the last hour