Cleaning RNA-Seq data from rRNA
5
4
Entering edit mode
8.0 years ago
debitboro ▴ 260

Hi,

I have RNA-Seq data in fastq format (human breast cancer tissue) from Illumina platform, I want to perform a Differential Expression analysis. I used Fastqc to check my files, and I found that the data is full of rRNA (almost 80%). Is there a way to clean my data from rRNA, and get a file containing only mRNA ?

Thanks

RNA-Seq rRNA cleaning QC • 22k views
ADD COMMENT
2
Entering edit mode

Are you sure you really need to remove the rRNA reads? That's normally not needed, even if you have an abundance of them (it just means that you have fewer mRNA sequences than you might have wanted).

ADD REPLY
2
Entering edit mode

@Devon: Here is the relevant thread elsewhere (by @debitboro) A: Fastqc for RNA-Seq data (Illumina 1.9)
It is true that cleaning rRNA reads is not needed but there is going to be ~10% reads left in the sample referenced that are usable after ignoring/eliminating rRNA. Not enough for human DE analysis.
@debitboro: If this sample is irreplaceable and no new libraries can be made then you can collect more data by sequencing (at least 2x) the sample again to reach ~10M+ non-rRNA read level.

ADD REPLY
1
Entering edit mode

Ouch, yeah, that's a problem. I guess they never rRNA depleted the samples, too bad. I agree that there's likely nothing that can be done with the files to make them usable.

ADD REPLY
4
Entering edit mode
6.9 years ago
h.mon 35k

This is an old post, but it surprises me no one mentioned using BBDuk with the file ribokmers.fa (a link to this file, and an explanation of how it has been created, can be found on this post).

BBDuk and SortMeRNA will be largely concordant, but BBDuk will run at a fraction of the time and accepts compressed fastq files as well. My impression is SortMeRNA is slightly more precise, but for cleaning datasets for downstream DGE BBDuk is fine.

Just to reinforce, though: as genomax pointed out, at this level (93%) of rRNA contamination, the best you can do is get back to the bench and redo the experiment, or at least deplete rRNA before doing the library prep again.

You never told us how the libraries were prepared. Did you depleted rRNA with something like RiboZero, or used an Illumina mRNA kit with poly-A capture? Or sequenced total RNA?

ADD COMMENT
2
Entering edit mode
23 months ago
Dreamer ▴ 40

There are a number of rRNA reads removal tools available. Recently, we developed a rRNA reads detection software named RiboDetector (https://github.com/hzi-bifo/RiboDetector). Benchmarking: https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac112/6533611 shows that RiboDetector is the most computationally efficient and most accurate software for rRNA reads removal. The analysis also suggests that rRNA reads should be removed before alignment to target reference genes/genomes, otherwise they could be mapped to certain coding genes which share partial sequence similarity to rRNAs.

RiboDetector can be used out-of-the-box without any database:

  • GPU mode:

    ribodetector -t 20 \
    -l 100 \
    -i inputs/reads.1.fq.gz inputs/reads.2.fq.gz \
    -m 10 \
    -e rrna \
    --chunk_size 256 \
    -o outputs/reads.nonrrna.1.fq outputs/reads.nonrrna.2.fq
    
  • CPU mode

    ribodetector_cpu -t 20 \
    -l 100 \
    -i inputs/reads.1.fq.gz inputs/reads.2.fq.gz \
    -e rrna \
    --chunk_size 256 \
    -o outputs/reads.nonrrna.1.fq outputs/reads.nonrrna.2.fq
    
ADD COMMENT
1
Entering edit mode
8.0 years ago
EVR ▴ 610

hi,

You can go to the http://www.arb-silva.de/ website and download the rRNA (both large and small subunits) for Human and later map your fastq files using bowtie or bowtie2 to the these rRNA database from silva and later the unmaaped reads from this step can be used to for mRNA analysis.

bowtie2 -x /path/to/rRNA_bowtie2_indexes --un /path/to/unmapped_reads.fastq your_sample.fastq

Later the unmapped reads from this step can be used to for normal mRNA analysis.

let me know if need more details

ADD COMMENT
0
Entering edit mode

thank you Tom, I'm going to try this, and I'll come back to you

ADD REPLY
1
Entering edit mode

You could also use SortMeRNA to achieve this: "SortMeRNA takes as input a file of reads (fasta or fastq format) and one or multiple rRNA database file(s), and sorts apart rRNA and rejected reads into two files specified by the user." http://bioinfo.lifl.fr/RNA/sortmerna/

ADD REPLY
0
Entering edit mode

Hi Daniel,

I tried to use sortMeRNA, but as a first-of-all step, I tried to download the .fasta files LSU and SSU subunits from SILVA using ARB package, that is not an easy task to construct a reference database for homo-sapiens. I found an example that I was following on github (https://github.com/biocore/sortmerna/tree/master/rRNA_databases). The example concerns bacteria. Can I use the same files (reference database files) for human ?

Thanks in advance

ADD REPLY
0
Entering edit mode

SortMeRNA provides 5.8S, 18S and 28S rRNA databases as well, have you tried those with your samples (I see one specific in the 28S db: HomSa172 FB355642 Eukaryota;Metazoa;Chordata;Craniata;Vertebrata;Euteleostomi;Mammalia;Eutheria;Euarchontoglires;Primates;Haplorrhini;Catarrhini;Hominidae;Homo;)? You can find the taxonomy strings for all included databases here: https://github.com/biocore/sortmerna/tree/master/rRNA_databases/silva_ids_acc_tax.tar.gz

Otherwise, one approach to build a homo sapiens only db could be to download the LSU ARB database, export all eukaryotes into xml (using ARB) and parse that to extract homo sapiens rRNA. You would then need to clean it (e.g., make sure rRNA doesn't have partial mRNA segments) for that you could follow some of the steps in the README.txt of folder above & then possibly cluster, though I imagine you could just use them all as references.

Jenya

ADD REPLY
0
Entering edit mode

Thank you for your response jenya.kopylov,

I have tried to follow the instruction from the documentation of sortMeRNA to get rid of rRNAs.

So first of all, I fastqc the original file to check the quality and the content of my reads, I've got the following result (Organism = human). As you can see in the per sequence GC content plot is shifted to the right form the overall GC content of the human genome (~46%), and I've got also the over-represented sequences which correspond to rRNA (I blasted some of them, and I have got the rRNA is the source of those sequences). What I have done is to try sortMeRNA on this file. After running, I got the following result. I think sortMeRNA have performed the task well, but I'm worried about the result since, sortMeRNA remove almost 100000 - 6862 = 93.13% of the sequences (sortMeRNA classified them as rRNA sequences), in addition the resulted GC content still give the same distribution.

Thank you in advance

ADD REPLY
0
Entering edit mode

Are 93% of the sequence being removed? With that kind of data loss you are going to have small chance of getting useful results with the remaining data.

Sorry to say this but it may be time to consider cutting your losses and going back to the bench to redo the experiment. This time depleting rRNA before making libraries.

ADD REPLY
0
Entering edit mode

Hi Tom,

I tried your suggestion but in the end the result shows that only 4,4K / 50K = 8.8 % of reads are not rRNA. I think that this amount of mRNA is not enough to perform a differential expression analysis.

have a nice day

ADD REPLY
0
Entering edit mode

What type of experiment you are working with? Ribosome Profiling?

ADD REPLY
0
Entering edit mode

See the original post at top for that info.

ADD REPLY
0
Entering edit mode
8.0 years ago

I would suggest to carry on with the data-analysis (feature counting, e.g. using htseq-count) and then remove all identifiers from rRNA. A list like that could be downloaded from BioMart.

ADD COMMENT
0
Entering edit mode
8.0 years ago
Buffo ★ 2.4k

you can write a script for discriminate mRNA, each read that contains >= AAAAAAAA could be just mRNA, each mRNA in illumina reads (from eucariotic cells) is expected to have 6 "A`s" at least. On my opinion.

ADD COMMENT
4
Entering edit mode

Except for the fact that these are not full length mRNA's being sequenced so every read does not have a polyA tail.

ADD REPLY

Login before adding your answer.

Traffic: 1807 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6