Question: Cleaning RNA-Seq data from rRNA
gravatar for debitboro
4.3 years ago by
debitboro180 wrote:


I have RNA-Seq data in fastq format (human breast cancer tissue) from Illumina platform, I want to perform a Differential Expression analysis. I used Fastqc to check my files, and I found that the data is full of rRNA (almost 80%). Is there a way to clean my data from rRNA, and get a file containing only mRNA ?


rna-seq qc rrna cleaning • 12k views
ADD COMMENTlink modified 3.1 years ago by h.mon30k • written 4.3 years ago by debitboro180

Are you sure you really need to remove the rRNA reads? That's normally not needed, even if you have an abundance of them (it just means that you have fewer mRNA sequences than you might have wanted).

ADD REPLYlink written 4.3 years ago by Devon Ryan95k

@Devon: Here is the relevant thread elsewhere (by @debitboro) A: Fastqc for RNA-Seq data (Illumina 1.9)
It is true that cleaning rRNA reads is not needed but there is going to be ~10% reads left in the sample referenced that are usable after ignoring/eliminating rRNA. Not enough for human DE analysis.
@debitboro: If this sample is irreplaceable and no new libraries can be made then you can collect more data by sequencing (at least 2x) the sample again to reach ~10M+ non-rRNA read level.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by genomax85k

Ouch, yeah, that's a problem. I guess they never rRNA depleted the samples, too bad. I agree that there's likely nothing that can be done with the files to make them usable.

ADD REPLYlink written 4.3 years ago by Devon Ryan95k
gravatar for h.mon
3.1 years ago by
h.mon30k wrote:

This is an old post, but it surprises me no one mentioned using BBDuk with the file ribokmers.fa (a link to this file, and an explanation of how it has been created, can be found on this post).

BBDuk and SortMeRNA will be largely concordant, but BBDuk will run at a fraction of the time and accepts compressed fastq files as well. My impression is SortMeRNA is slightly more precise, but for cleaning datasets for downstream DGE BBDuk is fine.

Just to reinforce, though: as genomax pointed out, at this level (93%) of rRNA contamination, the best you can do is get back to the bench and redo the experiment, or at least deplete rRNA before doing the library prep again.

You never told us how the libraries were prepared. Did you depleted rRNA with something like RiboZero, or used an Illumina mRNA kit with poly-A capture? Or sequenced total RNA?

ADD COMMENTlink written 3.1 years ago by h.mon30k
gravatar for WouterDeCoster
4.3 years ago by
WouterDeCoster44k wrote:

I would suggest to carry on with the data-analysis (feature counting, e.g. using htseq-count) and then remove all identifiers from rRNA. A list like that could be downloaded from BioMart.

ADD COMMENTlink written 4.3 years ago by WouterDeCoster44k
gravatar for EVR
4.3 years ago by
EVR570 wrote:


You can go to the website and download the rRNA (both large and small subunits) for Human and later map your fastq files using bowtie or bowtie2 to the these rRNA database from silva and later the unmaaped reads from this step can be used to for mRNA analysis.

bowtie2 -x /path/to/rRNA_bowtie2_indexes --un /path/to/unmapped_reads.fastq your_sample.fastq

Later the unmapped reads from this step can be used to for normal mRNA analysis.

let me know if need more details

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by EVR570

thank you Tom, I'm going to try this, and I'll come back to you

ADD REPLYlink written 4.3 years ago by debitboro180

You could also use SortMeRNA to achieve this: "SortMeRNA takes as input a file of reads (fasta or fastq format) and one or multiple rRNA database file(s), and sorts apart rRNA and rejected reads into two files specified by the user."

ADD REPLYlink written 4.3 years ago by Daniel Swan13k

Hi Daniel,

I tried to use sortMeRNA, but as a first-of-all step, I tried to download the .fasta files LSU and SSU subunits from SILVA using ARB package, that is not an easy task to construct a reference database for homo-sapiens. I found an example that I was following on github ( The example concerns bacteria. Can I use the same files (reference database files) for human ?

Thanks in advance

ADD REPLYlink written 3.2 years ago by debitboro180

SortMeRNA provides 5.8S, 18S and 28S rRNA databases as well, have you tried those with your samples (I see one specific in the 28S db: HomSa172 FB355642 Eukaryota;Metazoa;Chordata;Craniata;Vertebrata;Euteleostomi;Mammalia;Eutheria;Euarchontoglires;Primates;Haplorrhini;Catarrhini;Hominidae;Homo;)? You can find the taxonomy strings for all included databases here:

Otherwise, one approach to build a homo sapiens only db could be to download the LSU ARB database, export all eukaryotes into xml (using ARB) and parse that to extract homo sapiens rRNA. You would then need to clean it (e.g., make sure rRNA doesn't have partial mRNA segments) for that you could follow some of the steps in the README.txt of folder above & then possibly cluster, though I imagine you could just use them all as references.


ADD REPLYlink written 3.1 years ago by jenya.kopylov0

Thank you for your response jenya.kopylov,

I have tried to follow the instruction from the documentation of sortMeRNA to get rid of rRNAs.

So first of all, I fastqc the original file to check the quality and the content of my reads, I've got the following result (Organism = human). As you can see in the per sequence GC content plot is shifted to the right form the overall GC content of the human genome (~46%), and I've got also the over-represented sequences which correspond to rRNA (I blasted some of them, and I have got the rRNA is the source of those sequences). What I have done is to try sortMeRNA on this file. After running, I got the following result. I think sortMeRNA have performed the task well, but I'm worried about the result since, sortMeRNA remove almost 100000 - 6862 = 93.13% of the sequences (sortMeRNA classified them as rRNA sequences), in addition the resulted GC content still give the same distribution.

Thank you in advance

ADD REPLYlink written 3.1 years ago by debitboro180

Are 93% of the sequence being removed? With that kind of data loss you are going to have small chance of getting useful results with the remaining data.

Sorry to say this but it may be time to consider cutting your losses and going back to the bench to redo the experiment. This time depleting rRNA before making libraries.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by genomax85k

Hi Tom,

I tried your suggestion but in the end the result shows that only 4,4K / 50K = 8.8 % of reads are not rRNA. I think that this amount of mRNA is not enough to perform a differential expression analysis.

have a nice day

ADD REPLYlink written 4.3 years ago by debitboro180

What type of experiment you are working with? Ribosome Profiling?

ADD REPLYlink written 4.3 years ago by EVR570

See the original post at top for that info.

ADD REPLYlink written 4.3 years ago by genomax85k
gravatar for Buffo
4.3 years ago by
Buffo1.8k wrote:

you can write a script for discriminate mRNA, each read that contains >= AAAAAAAA could be just mRNA, each mRNA in illumina reads (from eucariotic cells) is expected to have 6 "A`s" at least. On my opinion.

ADD COMMENTlink written 4.3 years ago by Buffo1.8k

Except for the fact that these are not full length mRNA's being sequenced so every read does not have a polyA tail.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by genomax85k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1040 users visited in the last hour