Question: rRNA Removal In Rna-Seq Data
gravatar for jockbanan
4.0 years ago by
Czech Republic
jockbanan370 wrote:

Hello everyone!

I have paired-end human RNA-seq data mapped with Tophat2 and I want to perform differential expression analysis. I know my data have quite high level of ribosomal RNA contamination, so I want to deal with it.

When using cuffidiff, there is this -M/--mask-file option that allows me to support cuffdiff with "filter" .gtf file. I downloaded such file (rRNA + tRNA + mtRNA) from UCSC genome browser and everything worked fine.

Now I want to perform the same analysis using htseq-count + deseq2 and I wander - is there such easy way to deal with rRNA contamination in this case? After lots of googling, I have 3 ideas:

a) Use htseq-count with .gtf file that simply does not contain rRNA genes

b) Take .fastq files with reads, map them to my "filter.gtf" file first, then take only unmapped reads and map them to reference, use resulting .bam for analysis

c) Subtract reads mapping to regions in my "filter.gtf" from my sample .bam file.

What I've already tried:

a) I don't have such such .gtf file. I use hg19 reference downloaded from here: which seems to contain rRNA genes.

b) Tried this a while ago and had some problems... anyway, this would be rather long solution, I want to try something easier first.

c) This is my favorite. I used bedtools intersectBed with -v option to subtract reads mapping to regions in "filter.gtf" from my .bam file. But for some read pairs, this removes only one read from the pair and therefore causes htseq-count to raise the famous warning: "Read ... claims to have an aligned mate which could not be found. (is the SAM file properly sorter?)"

So finally, my questions:

1) Do you think c) is the right way of removing rRNA contamination?

2) If so, do you think I should just ignore htseq-count warnings? Or should I try to somehow remove these "orphan" reads without mate? (Because htseq-count treats them as reads with mate not mapped. But when one of the mates is mapped to rRNA region, don't I want to always remove both of them?)

Thanks for your ideas!

rna-seq deseq • 12k views
ADD COMMENTlink modified 6 weeks ago by h.mon9.6k • written 4.0 years ago by jockbanan370

I like doing an initial alignment to just rRNA sequences using bowtie2 and then do downstream analysis with unmapped reads. You can nicely see what percentage of each file aligned to rRNA, for one thing. Maybe this is like b, but I'm not sure. I don't usually bother to do this unless there's a lot of rRNA contamination, though.

ADD REPLYlink written 4.0 years ago by Madelaine Gogol4.8k

Have you ever looked what difference it makes wrt DEGs when you remove or not remove rRNA reads? We are using Ribozero with a rather large fraction of rRNA reads (~20-25%), so I am wondering if removing rRNA reads improves results.

Note: we are using DESeq2 to detect differentially expressed genes.

ADD REPLYlink written 3.4 years ago by Christian2.6k
gravatar for Devon Ryan
4.0 years ago by
Devon Ryan73k
Freiburg, Germany
Devon Ryan73k wrote:

Option (a) is the normal way to do this. If you don't count them, they're not there and the library sizes will be adjusted accordingly. Sure, (c) would work, but I imagine it'd be a lot easier to just use GenomicRanges (or even grep) to remove rRNA from the GTF file. Then you don't have an extra step for each aligned sample.

Regarding the warnings, you might quickly check to ensure that the orphaned mates aren't mapping to a gene. I would assume not, but if you remove the mates in this case then the orphaned reads would start getting counted when they really shouldn't.

There's also an option (d), which is to just run everything as normal and then just remove the rRNA lines from the counts files after importing into R. If you just make a "lines to exclude" file once, then you can efficiently remove the problematic counts. This might prove to be the easiest option.

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Devon Ryan73k

Thanks for reply! Actually, I've just realized that I can simply use bedtools intersectBed with -v to directly subtract one .gtf file from another.

ADD REPLYlink written 4.0 years ago by jockbanan370

Never thought about that way, that'd certainly do the trick as well!

ADD REPLYlink written 4.0 years ago by Devon Ryan73k
gravatar for Jelena Aleksic
4.0 years ago by
Cambridge, UK
Jelena Aleksic890 wrote:

I personally use option b) when doing this. The reason is that it allows me to use multiply matching reads in subsequent analysis, while making sure that these definitely do not map to rRNA. But the other options you suggest also seem valid to me.

ADD COMMENTlink written 4.0 years ago by Jelena Aleksic890
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1339 users visited in the last hour