I have paired-end human RNA-seq data mapped with Tophat2 and I want to perform differential expression analysis. I know my data have quite high level of ribosomal RNA contamination, so I want to deal with it.
When using cuffidiff, there is this -M/--mask-file option that allows me to support cuffdiff with "filter" .gtf file. I downloaded such file (rRNA + tRNA + mtRNA) from UCSC genome browser and everything worked fine.
Now I want to perform the same analysis using htseq-count + deseq2 and I wander - is there such easy way to deal with rRNA contamination in this case? After lots of googling, I have 3 ideas:
a) Use htseq-count with .gtf file that simply does not contain rRNA genes
b) Take .fastq files with reads, map them to my "filter.gtf" file first, then take only unmapped reads and map them to reference, use resulting .bam for analysis
c) Subtract reads mapping to regions in my "filter.gtf" from my sample .bam file.
What I've already tried:
a) I don't have such such .gtf file. I use hg19 reference downloaded from here: http://tophat.cbcb.umd.edu/igenomes.shtml which seems to contain rRNA genes.
b) Tried this a while ago and had some problems... anyway, this would be rather long solution, I want to try something easier first.
c) This is my favorite. I used bedtools intersectBed with -v option to subtract reads mapping to regions in "filter.gtf" from my .bam file. But for some read pairs, this removes only one read from the pair and therefore causes htseq-count to raise the famous warning: "Read ... claims to have an aligned mate which could not be found. (is the SAM file properly sorter?)"
So finally, my questions:
1) Do you think c) is the right way of removing rRNA contamination?
2) If so, do you think I should just ignore htseq-count warnings? Or should I try to somehow remove these "orphan" reads without mate? (Because htseq-count treats them as reads with mate not mapped. But when one of the mates is mapped to rRNA region, don't I want to always remove both of them?)
Thanks for your ideas!