Question: rRNA decontamination of RNA-Seq reads (tool choice, introns etc)
gravatar for Anand Rao
8 months ago by
Anand Rao310
United States
Anand Rao310 wrote:


I need some help with performing my rRNA decontamination step properly, which is part of pre-processing pipeline for my Illumina RNA-Seq reads, before mapping to the reference genome (a plant species).


Download Rrna Sequences - SILVA database is useful for human research, not so much for plants etc

rRNA in human - RFAM database searches will identify rRNA matches

rRNA remove from RNAseq data - sormeRNA, and BBsplit are comparable tools useful for rRNA decontamination

Cleaning RNA-Seq data from rRNA - BBDuk can also be used

Filtering rRNA from RNAseq data - BBMap is a tool I used, but output mapped (outm file) to rRNA fasta file, did not always correspond to rRNA sequences, not sure why - Eukaryotic SSU rRNA can have introns


Decontaminate RNA-Seq Illumina reads by removing rRNA sequences

note: I do have rRNA FASTA sequences from the genome annotation project (62 in all, and of different lengths) - but I am not sure whether they contain introns or not?

With these links, and my main inferences from those links listed above as background information, here are MY QUESTIONS:

  1. Previous posts allude to how sormeRNA is much slower than BBSuite tools, but that they should both give concordant results. Is there a reason(s) to choose sormeRNA over BBsuite tools?

  2. If not, within BBsuite tools, is there a good reason to choose one over the other two ? (BBmap vs. BBsplit Vs BBduk) Should I check for introns in the rRNA sequences, and if yes, then what's a good method for that?

  3. If any of the 62 rRNA sequences contain introns, then my understanding is I remove the intronic sequences and used the spliced rRNA sequences as the reference file for decontamination, correct?

  4. Are there common reason(s) why my BBmap trial returned outm mapped to the rRNA reference FASTA file where some but not all contained rRNA sequence match, as determined using online BLASTn at NCBI?

note: I can post syntax and more detail if this thread goes in the direction of BBmap, but thought that would distract the reader, so they are not included here yet..

I look forward to answers that cover all my questions 1 - 4. Thanks, in advance, for guidance from forum members.

ADD COMMENTlink modified 5 months ago • written 8 months ago by Anand Rao310

For number 1: I don't think there would be a specific reason to choose one over other.

For number 4: You may need to adjust bbmap alignment parameters. Make alignments more or less stringent. Reduce value of k= to allow for more accurate matches.

OR you could align your data to human rDNA repeat I linked in the post you have above. Get an idea of % reads aligning there. This is to make sure the % is relatively same across samples (should be less than 5% if the libraries are ribodepleted/poly A entiched). In the final counting step ignore rRNA reads (don't count).

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax87k
gravatar for Brian Bushnell
8 months ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

BBDuk is a filter. At JGI we use it to remove reads with 31-mers corresponding to common ribosomal 31-mers, because some downstream tools like Trinity fail when there is an overload of rRNA reads.

Running BBDuk with common ribosomal kmers is good practice, but not ideal; the sensitivity and specificity are high, but not 100%. We do that at JGI because we work with metatranscriptomes for which the ribosomal sequences are not known and often 90% of the reads are ribosomal.

If you are doing RNA-seq on a reference organism, with a complete ribosomal sequence, the best practice IMO is to use BBSplit or Seal to split the reads between genomic and ribosomal, and then just run your analysis on the genomic reads. If you do not have a complete ribosomal sequence, you can still separate them with BBDuk. Mapping to Silva unfortunately takes too long.

I'm not a biologist. From my limited understanding of biology, it seems as though rRNAs are always unspliced. My assumption is that if you see introns in proks or mitos, they are simply misannotated. But don't consider me an authority here; I'm just relating what I see.

ADD COMMENTlink written 8 months ago by Brian Bushnell17k

Hi Brian, Genomax, et al.,

With help from Rfam team, I was able to detect four group II introns, in 3 out of 62 rRNA gene sequences, as matches to RF00029, using Infernal suite's cmscan on the command line, with syntax similar to the example at this Rfam weblink.

So some rRNAs can be spliced (3 out of 62), while most are not spliced (rest 59 of 62), at least in this plant species under consideration.

I then used from BBtools ver 38-72 to obtain kmer sequences of varying k=13, 20, 31, 40, 50 from either

  • just 4 intron sequences
  • 62 UNspliced rRNA sequences
  • SPLICED rRNA sequences (of which only 3 contain introns)

With a simple Perl script, I added unique FASTA headers to these files, and mapped each of those files to the reference genome, using STAR aligner (can share genome indexing and mapping syntax if you wish).

A summary of how many of those mapping go to mRNA encoding vs. ncRNA encoding (rRNA etc) genomic loci is shown in the image below: Introns-r-RNA-kmer-mapping2-Mtrunc-A17-r5-0-ANR-Ref-SUMMARY

Would you concur with my conclusions from these results:

  • k-mers from SPLICED rRNA genes still map to mRNA encoding genic regions of the reference genomes even at k=50 (45), with only marginally higher instances of genic mapping at lower k-mer lengths of 40 (50), 31 58), but somewhat higher at k-mer=20 (99).
  • The shortest read length I've allowed until this step is 20nt. So using k-mer > 20nt will result in some marginal loss of RNA-Seq reads < k-mer length.
  • In silico splicing of rRNA gene sequences (3 / 62) does not appear to signifcantly reduce mapping of k-mers to genic regions in the reference genome at all.
  • A strategy other than increasing k-mer length is required for mapping rRNA derived kmers to just rRNA genes.

Assuming those conclusions are correct, I proceeded with, per your recommendation, using the following generic syntax: in=$IN \
ref=MtrunA17r5.0-ANR-EGN-r1.6.rrna_SPLICED.shIDs.fna \
basename=$BASENAME_%.fq.gz outu=$OUTU 
Xmx24g threads=1 minratio=0.90 minhits=1 maxindel=200000 strictmaxindel=t 
deterministic=t showprogress=100000 k=15
  • I varied minratio with values of 0.90, 0.95 and 0.99
  • I also varied minhits with values of 1, 2, 3,5 and even 10 (this gave empty file)
  • I used either k=15 or bloom filter kmer as bloomk=31

I then converted $OUTM from, with matches to the SPLICED rRNA sequences from FASTQ to FASTA format, using

Use this FASTA file as input to Rfam/Infernal against a very small subset of rRNA CMs to check whether all instances of rRNA reported by are indeed rRNA sequences, using cmscan. These take a while to run (~ 1 day for 125K sequences).

IMPORTANTLY, In one such bbsplit $OUTM file, cmscan found only ~ 88% of reported reads to match rRNA profiles. So it appears is reporting ~12% false positives rRNA matches?

My questions to you are these:

  1. What other parameters can I vary to bring this FPR as close to 0% as possible?
  2. Can kmer value be increased beyond the 8-15 range currently allowed for BBmap and BBsplit?
  3. Should I instead look more carefully into BLOOM filter option(s)?
  4. Should I systematically look into FPR for various 1<=minhits <=kmer or bloomk?
  5. Some other idea(s)?
  6. Essentially, could you please advice how I can remove all rRNA mapping reads, but not other reads, prior to transcriptome assembly? Thanks!


ADD REPLYlink modified 7 months ago • written 7 months ago by Anand Rao310
gravatar for Anand Rao
5 months ago by
Anand Rao310
United States
Anand Rao310 wrote:

After much discussion with the Ram team, specifically Eric Nawrocki, one of the Infernal software suite's authors, the covariance models (CMs) in Rfam for rRNA sequences are built from the maximum possible breadth of taxonomic range. As a result, some regions of the CMs are not as conserved as they could be, had they been built with same sequences, but from a narrower taxonomic range. Due to these local regions of divergence, matches to the rRNA CMs inferred using infernal, sometimes but not always, do not score as high as they would have had the CMs been built using rRNA sequences from closely related species. As a result, Infernal + Rfam CMs miss detecting true rRNAs, resulting in false negatives; whereas BLAST does not suffer from this problem. As a result, across results from these 3 tools - SEAL, BLAST and Infernal, Infernal is the outlier, whereas SEAL & BLAST are highly concordant.

BOTTOMLINE 1 - BBmap & BLAST results match: Infernal + Rfam CMs is NOT an ideal tool to test for rRNA sequences, or to verify SEAL or BBSPLIT results. I consider this issue resolved, as in, BBmap suite results for rRNA removal from Illumina reads matches BLAST results.

BOTTOMLINE 2 - No need to use spliced rRNA reference file: Even though (just) a few of the rRNA sequences in my plant species contain introns (which is not an indicator whether they are retained or spliced out), using spliced versus unspliced rRNA reference sequences did not appear to make a significant differences in BBSPLIT results. So it might be wise to not go through the pain of in silico conversion of UNspliced rRNA sequences to spliced rRNA sequences for the purpose of rRNA de-contamination of Illumina reads. I consider this issue also resolved.

Thank you Brian Bushnell and genomax

ADD COMMENTlink written 5 months ago by Anand Rao310
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1072 users visited in the last hour