Question: No change in number of reads and quality after removing rRNA contamination using sortmerna
gravatar for Sumit Paliwal
21 months ago by
Sumit Paliwal20 wrote:

Hi Everyone, I have done RNA-seq (75 bp paired-end; 50 million reads per sample) on some samples to assess differential expression. Initial FastQC report showed some sequence duplications issue. Despite that, I did the alignment and imported the resulting BAM files in Seqmonk for further analysis. RNASeq QC report in SeqMonk indicated rRNA contamination. So, I went back and removed the rRNA reads using sortmerna. In some cases, the rRNA read files are huge. However, when I do a FastQC analysis on the files without rRNA reads, I do not see a change in number of reads or sequence duplication levels compared to original raw files. I don't know how is it possible. Suggestions and solution please. Thanks.


ADD COMMENTlink written 21 months ago by Sumit Paliwal20

I don't think rRNA contamination is a problem (unless it's a lot), but that might also depend on what you want to do with the data.

ADD REPLYlink written 21 months ago by WouterDeCoster37k

In my samples it ranges from 5%-25%. I think it is a significant amount.

ADD REPLYlink written 21 months ago by Sumit Paliwal20

How much rRNA contamination? 1%, 5% 10%, 50%? In general, there always is some small percentage of rRNA left, you only need to worry if is too much and decreases the number of non-rRNA reads to less than the minimum recommended for the analysis you want to perform. For example, rules of thumb for differential gene expression and differential transcript expression are 10million and 20million, respectively (I am not 100% about the 20million figure).

The other point is FastQC will almost always point there is duplication in RNAseq samples, because there will be some genes / transcripts which are very expressed.

ADD REPLYlink written 21 months ago by h.mon24k

rRNA contamination in my samples ranges from 5%-25%. With this level of contamination, I should see a difference in number of reads before and after processing files with sortmerna. My concern is that I do not see any difference, although I find that there are a significant number of reads that match with rRNA in sortmerna ouput files.

ADD REPLYlink written 21 months ago by Sumit Paliwal20

You will have to explain yourself better. What are the command lines you are using? If a sample has 25% of rRNA contamination and you use SortMeRNA on it, you should have two outputs, one with 75% of the reads, free of rRNA, and another with 25% of your reads, all rRNA.

ADD REPLYlink written 21 months ago by h.mon24k

Here is the image of the FastQC Basic Statistics prior to removal of rRNA reads: FastQC-before

SeqMonk RNASeq QC report suggesting rRNA contamination:

SeqMonk RNASeq QC

I used the following command to remove rRNA reads:

while read i
        /home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/sortmerna --ref $SORTMERNA_DB --reads /media/windows/Sumit/RawMerged/${i}_merged.fastq --paired_in -a 8 --log --fastx --aligned /media/windows/Sumit/sampleoutput/${i}_rRNA --other /media/windows/Sumit/sampleoutput/${i}_sortmerna   
    done < /media/windows/Sumit/uniquelist.txt

Following is the result section from the log file generated from the above command:


Total reads = 48574678
Total reads passing E-value threshold = 18473367 (38.03%)
Total reads failing E-value threshold = 30101311 (61.97%)
Minimum read length = 76
Maximum read length = 76
Mean read length = 76

By database:

home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-bac-16s-id90.fasta 0.18%

/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-bac-23s-id98.fasta 0.18%

/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-arc-16s-id95.fasta 0.03%

/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-arc-23s-id98.fasta 0.01%

/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-euk-18s-id95.fasta 13.60%

/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-euk-28s-id98.fasta 23.96%

/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/rfam-5s-database-id98.fasta 0.01%

/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/rfam-5.8s-database-id98.fasta 0.06%

Following is the FastQC report on the file generated using SortMeRNA for the same sample: FastQC-after

As you can see that there are significant number of reads matching to Eukaryotic 18s and 28s. Still there is no change in reads in the second FastQC statistics. I hope the problem is clear now.

ADD REPLYlink modified 21 months ago • written 21 months ago by Sumit Paliwal20

The problem was clear from the start, how you got to the conclusions was the blurry part.

It seems SortMeRNA is failing to split the files. Did you examine L-C-M6_rRNA-R1.fastq? Is it empty? Which version of SortMeRNA? What is the result if you run it with --paired_out?

I never used SeqMonk, but it seems it is under-reporting contamination: the graphics show less than 25% for (what I presume are) all samples, but SortMeRNA reported 38% for the sample you provided output.

I don't know what is going on, but let me suggest two tools to help you: MultiQC does aggregate reports for lots of tools, it does for FastQC, so it is easy to compare before and after results; and BBDuk is really fast at quantifying and removing contamination, being almost as sensitive as SortMeRNA but at a fraction of the time.

ADD REPLYlink written 21 months ago by h.mon24k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1087 users visited in the last hour