rRNA Removal In Rna-Seq Data
3
8
Entering edit mode
9.2 years ago
jockbanan ▴ 420

Hello everyone!

I have paired-end human RNA-seq data mapped with Tophat2 and I want to perform differential expression analysis. I know my data have quite high level of ribosomal RNA contamination, so I want to deal with it.

When using cuffidiff, there is this -M/--mask-file option that allows me to support cuffdiff with "filter" .gtf file. I downloaded such file (rRNA + tRNA + mtRNA) from UCSC genome browser and everything worked fine.

Now I want to perform the same analysis using htseq-count + deseq2 and I wander - is there such easy way to deal with rRNA contamination in this case? After lots of googling, I have 3 ideas:

a) Use htseq-count with .gtf file that simply does not contain rRNA genes

b) Take .fastq files with reads, map them to my "filter.gtf" file first, then take only unmapped reads and map them to reference, use resulting .bam for analysis

c) Subtract reads mapping to regions in my "filter.gtf" from my sample .bam file.

a) I don't have such such .gtf file. I use hg19 reference downloaded from here: http://tophat.cbcb.umd.edu/igenomes.shtml which seems to contain rRNA genes.

b) Tried this a while ago and had some problems... anyway, this would be rather long solution, I want to try something easier first.

c) This is my favorite. I used bedtools intersectBed with -v option to subtract reads mapping to regions in "filter.gtf" from my .bam file. But for some read pairs, this removes only one read from the pair and therefore causes htseq-count to raise the famous warning: "Read ... claims to have an aligned mate which could not be found. (is the SAM file properly sorter?)"

So finally, my questions:

1) Do you think c) is the right way of removing rRNA contamination?

2) If so, do you think I should just ignore htseq-count warnings? Or should I try to somehow remove these "orphan" reads without mate? (Because htseq-count treats them as reads with mate not mapped. But when one of the mates is mapped to rRNA region, don't I want to always remove both of them?)

rna-seq deseq • 18k views
0
Entering edit mode

I like doing an initial alignment to just rRNA sequences using bowtie2 and then do downstream analysis with unmapped reads. You can nicely see what percentage of each file aligned to rRNA, for one thing. Maybe this is like b, but I'm not sure. I don't usually bother to do this unless there's a lot of rRNA contamination, though.

0
Entering edit mode

Have you ever looked what difference it makes wrt DEGs when you remove or not remove rRNA reads? We are using Ribozero with a rather large fraction of rRNA reads (~20-25%), so I am wondering if removing rRNA reads improves results.

Note: we are using DESeq2 to detect differentially expressed genes.

5
Entering edit mode
9.2 years ago

Option (a) is the normal way to do this. If you don't count them, they're not there and the library sizes will be adjusted accordingly. Sure, (c) would work, but I imagine it'd be a lot easier to just use GenomicRanges (or even grep) to remove rRNA from the GTF file. Then you don't have an extra step for each aligned sample.

Regarding the warnings, you might quickly check to ensure that the orphaned mates aren't mapping to a gene. I would assume not, but if you remove the mates in this case then the orphaned reads would start getting counted when they really shouldn't.

There's also an option (d), which is to just run everything as normal and then just remove the rRNA lines from the counts files after importing into R. If you just make a "lines to exclude" file once, then you can efficiently remove the problematic counts. This might prove to be the easiest option.

1
Entering edit mode

Thanks for reply! Actually, I've just realized that I can simply use bedtools intersectBed with -v to directly subtract one .gtf file from another.

0
Entering edit mode

Never thought about that way, that'd certainly do the trick as well!

0
Entering edit mode

I am using a script from our lab. My previous colleagues did with this way. But, I a having trouble with the bam file from bedtools when input it into picard.jar. This does not happen if I use the bam file before using bedtools to remove rRNA. So I will try option a. I hope there is no difference between those methods

1
Entering edit mode
9.2 years ago

I personally use option b) when doing this. The reason is that it allows me to use multiply matching reads in subsequent analysis, while making sure that these definitely do not map to rRNA. But the other options you suggest also seem valid to me.

0
Entering edit mode
8 months ago
Dreamer • 0

You could just remove rRNA reads before aligning your reads to your reference genome. There are a number of rRNA reads removal tools available. Recently, we also developed a rRNA reads detection software named RiboDetector (https://github.com/hzi-bifo/RiboDetector). Benchmarking: https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac112/6533611 shows that RiboDetector is the most computationally efficient and most accurate software for rRNA reads removal. The analysis also suggests that rRNA reads should be removed before alignment, otherwise they could be mapped to certain coding genes which share partial sequence similarity to rRNAs.

RiboDetector can be used out-of-the-box without any database:

• GPU mode:

ribodetector -t 20 \
-l 100 \
-m 10 \
-e rrna \
--chunk_size 256 \

• CPU mode

ribodetector_cpu -t 20 \
-l 100 \
-e rrna \
--chunk_size 256 \