rRNA decontamination of RNA-Seq reads (tool choice, introns etc)
Entering edit mode
4.7 years ago
Anand Rao ▴ 640


I need some help with performing my rRNA decontamination step properly, which is part of pre-processing pipeline for my Illumina RNA-Seq reads, before mapping to the reference genome (a plant species).


Download Rrna Sequences - SILVA database is useful for human research, not so much for plants etc

rRNA in human - RFAM database searches will identify rRNA matches

rRNA remove from RNAseq data - sormeRNA, and BBsplit are comparable tools useful for rRNA decontamination

Cleaning RNA-Seq data from rRNA - BBDuk can also be used

Filtering rRNA from RNAseq data - BBMap is a tool I used, but output mapped (outm file) to rRNA fasta file, did not always correspond to rRNA sequences, not sure why

https://www.hindawi.com/journals/dpis/2013/854869/ - Eukaryotic SSU rRNA can have introns


Decontaminate RNA-Seq Illumina reads by removing rRNA sequences

note: I do have rRNA FASTA sequences from the genome annotation project (62 in all, and of different lengths) - but I am not sure whether they contain introns or not?

With these links, and my main inferences from those links listed above as background information, here are MY QUESTIONS:

  1. Previous posts allude to how sormeRNA is much slower than BBSuite tools, but that they should both give concordant results. Is there a reason(s) to choose sormeRNA over BBsuite tools?

  2. If not, within BBsuite tools, is there a good reason to choose one over the other two ? (BBmap vs. BBsplit Vs BBduk) Should I check for introns in the rRNA sequences, and if yes, then what's a good method for that?

  3. If any of the 62 rRNA sequences contain introns, then my understanding is I remove the intronic sequences and used the spliced rRNA sequences as the reference file for decontamination, correct?

  4. Are there common reason(s) why my BBmap trial returned outm mapped to the rRNA reference FASTA file where some but not all contained rRNA sequence match, as determined using online BLASTn at NCBI?

note: I can post syntax and more detail if this thread goes in the direction of BBmap, but thought that would distract the reader, so they are not included here yet..

I look forward to answers that cover all my questions 1 - 4. Thanks, in advance, for guidance from forum members.

rRNA intron RNA-Seq bbsuite sormeRNA • 6.5k views
Entering edit mode

For number 1: I don't think there would be a specific reason to choose one over other.

For number 4: You may need to adjust bbmap alignment parameters. Make alignments more or less stringent. Reduce value of k= to allow for more accurate matches.

OR you could align your data to human rDNA repeat I linked in the post you have above. Get an idea of % reads aligning there. This is to make sure the % is relatively same across samples (should be less than 5% if the libraries are ribodepleted/poly A entiched). In the final counting step ignore rRNA reads (don't count).

Entering edit mode
4.6 years ago

BBDuk is a filter. At JGI we use it to remove reads with 31-mers corresponding to common ribosomal 31-mers, because some downstream tools like Trinity fail when there is an overload of rRNA reads.

Running BBDuk with common ribosomal kmers is good practice, but not ideal; the sensitivity and specificity are high, but not 100%. We do that at JGI because we work with metatranscriptomes for which the ribosomal sequences are not known and often 90% of the reads are ribosomal.

If you are doing RNA-seq on a reference organism, with a complete ribosomal sequence, the best practice IMO is to use BBSplit or Seal to split the reads between genomic and ribosomal, and then just run your analysis on the genomic reads. If you do not have a complete ribosomal sequence, you can still separate them with BBDuk. Mapping to Silva unfortunately takes too long.

I'm not a biologist. From my limited understanding of biology, it seems as though rRNAs are always unspliced. My assumption is that if you see introns in proks or mitos, they are simply misannotated. But don't consider me an authority here; I'm just relating what I see.

Entering edit mode

Hi Brian, Genomax, et al.,

With help from Rfam team, I was able to detect four group II introns, in 3 out of 62 rRNA gene sequences, as matches to RF00029, using Infernal suite's cmscan on the command line, with syntax similar to the example at this Rfam weblink.

So some rRNAs can be spliced (3 out of 62), while most are not spliced (rest 59 of 62), at least in this plant species under consideration.

I then used kmercountexact.sh from BBtools ver 38-72 to obtain kmer sequences of varying k=13, 20, 31, 40, 50 from either

  • just 4 intron sequences
  • 62 UNspliced rRNA sequences
  • SPLICED rRNA sequences (of which only 3 contain introns)

With a simple Perl script, I added unique FASTA headers to these files, and mapped each of those files to the reference genome, using STAR aligner (can share genome indexing and mapping syntax if you wish).

A summary of how many of those mapping go to mRNA encoding vs. ncRNA encoding (rRNA etc) genomic loci is shown in the image below: Introns-r-RNA-kmer-mapping2-Mtrunc-A17-r5-0-ANR-Ref-SUMMARY

Would you concur with my conclusions from these results:

  • k-mers from SPLICED rRNA genes still map to mRNA encoding genic regions of the reference genomes even at k=50 (45), with only marginally higher instances of genic mapping at lower k-mer lengths of 40 (50), 31 58), but somewhat higher at k-mer=20 (99).
  • The shortest read length I've allowed until this step is 20nt. So using k-mer > 20nt will result in some marginal loss of RNA-Seq reads < k-mer length.
  • In silico splicing of rRNA gene sequences (3 / 62) does not appear to signifcantly reduce mapping of k-mers to genic regions in the reference genome at all.
  • A strategy other than increasing k-mer length is required for mapping rRNA derived kmers to just rRNA genes.

Assuming those conclusions are correct, I proceeded with BBsplit.sh, per your recommendation, using the following generic syntax:

bbsplit.sh in=$IN \
ref=MtrunA17r5.0-ANR-EGN-r1.6.rrna_SPLICED.shIDs.fna \
basename=$BASENAME_%.fq.gz outu=$OUTU 
Xmx24g threads=1 minratio=0.90 minhits=1 maxindel=200000 strictmaxindel=t 
deterministic=t showprogress=100000 k=15
  • I varied minratio with values of 0.90, 0.95 and 0.99
  • I also varied minhits with values of 1, 2, 3,5 and even 10 (this gave empty file)
  • I used either k=15 or bloom filter kmer as bloomk=31

I then converted $OUTM from BBsplit.sh, with matches to the SPLICED rRNA sequences from FASTQ to FASTA format, using reformat.sh

Use this FASTA file as input to Rfam/Infernal against a very small subset of rRNA CMs to check whether all instances of rRNA reported by bbsplit.sh are indeed rRNA sequences, using cmscan. These take a while to run (~ 1 day for 125K sequences).

IMPORTANTLY, In one such bbsplit $OUTM file, cmscan found only ~ 88% of reported reads to match rRNA profiles. So it appears bbsplit.sh is reporting ~12% false positives rRNA matches?

My questions to you are these:

  1. What other parameters can I vary to bring this FPR as close to 0% as possible?
  2. Can kmer value be increased beyond the 8-15 range currently allowed for BBmap and BBsplit?
  3. Should I instead look more carefully into BLOOM filter option(s)?
  4. Should I systematically look into FPR for various 1<=minhits <=kmer or bloomk?
  5. Some other idea(s)?
  6. Essentially, could you please advice how I can remove all rRNA mapping reads, but not other reads, prior to transcriptome assembly? Thanks!


Entering edit mode
4.4 years ago
Anand Rao ▴ 640

After much discussion with the Ram team, specifically Eric Nawrocki, one of the Infernal software suite's authors, the covariance models (CMs) in Rfam for rRNA sequences are built from the maximum possible breadth of taxonomic range. As a result, some regions of the CMs are not as conserved as they could be, had they been built with same sequences, but from a narrower taxonomic range. Due to these local regions of divergence, matches to the rRNA CMs inferred using infernal, sometimes but not always, do not score as high as they would have had the CMs been built using rRNA sequences from closely related species. As a result, Infernal + Rfam CMs miss detecting true rRNAs, resulting in false negatives; whereas BLAST does not suffer from this problem. As a result, across results from these 3 tools - SEAL, BLAST and Infernal, Infernal is the outlier, whereas SEAL & BLAST are highly concordant.

BOTTOMLINE 1 - BBmap & BLAST results match: Infernal + Rfam CMs is NOT an ideal tool to test for rRNA sequences, or to verify SEAL or BBSPLIT results. I consider this issue resolved, as in, BBmap suite results for rRNA removal from Illumina reads matches BLAST results.

BOTTOMLINE 2 - No need to use spliced rRNA reference file: Even though (just) a few of the rRNA sequences in my plant species contain introns (which is not an indicator whether they are retained or spliced out), using spliced versus unspliced rRNA reference sequences did not appear to make a significant differences in BBSPLIT results. So it might be wise to not go through the pain of in silico conversion of UNspliced rRNA sequences to spliced rRNA sequences for the purpose of rRNA de-contamination of Illumina reads. I consider this issue also resolved.

Thank you Brian Bushnell and genomax

Entering edit mode
2.2 years ago
Dreamer ▴ 40

Recently, we developed a rRNA reads detection method named RiboDetector (https://github.com/hzi-bifo/RiboDetector), and in the article we benchmarked different computational tools: https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac112/6533611. RiboDetector is the most computationally efficient (10+ times faster than SortMeRNA in CPU mode, 30-50 times faster in GPU mode) and most accurate software for rRNA reads removal. It can be used out-of-the-box without any database:

  • GPU mode:

    ribodetector -t 20 \
    -l 100 \
    -i inputs/reads.1.fq.gz inputs/reads.2.fq.gz \
    -m 10 \
    -e rrna \
    --chunk_size 256 \
    -o outputs/reads.nonrrna.1.fq outputs/reads.nonrrna.2.fq
  • CPU mode

    ribodetector_cpu -t 20 \
    -l 100 \
    -i inputs/reads.1.fq.gz inputs/reads.2.fq.gz \
    -e rrna \
    --chunk_size 256 \
    -o outputs/reads.nonrrna.1.fq outputs/reads.nonrrna.2.fq
Entering edit mode

I only saw the discussion of sortmerna in the article, but there was no discussion of the bbsplit bbduk tool.

In my experience, the sequences on the map to rRNA produced by sortmerna and bbsplit are very similar (I extracted a thousand pieces each, with 950 overlapping). For my data, sortmerna and bbsplit give 8% rRNA content, but bbduk thinks my rRNA content is 90%. Such a big difference that I don't know which one to believe.

In fact, without removing rRNA, I used kallisto and sleuth to generate an expression matrix. Using the tmp standardized count, I screened out the genes starting with RP, and divided their count by all the gene counts to get a ratio of 50%. So which one is accurate?

Entering edit mode

As @Brian said in his answer above use the results from bbsplit.

BBDuk uses k-mers and depending on the parameters you are using you may be getting spurious results. If you can post your complete bbduk command I may be able to say something. You are not likely to have 90% rRNA unless your enrichment/depletion (which ever was used) completely failed.


Login before adding your answer.

Traffic: 2091 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6