filtering rRNA from fastq files or mapping results
2
0
Entering edit mode
6.9 years ago

Hi everybody,

I am new to RNA-Seq. I am working on a RNA-seq data set from the mouse genome. I have ran some quality testing using the fastqc and fastx software. I have found out that the data set is full of rRNA reads.

I would like to filter there reads out, but I am not sure how to go forward with this. I have also read this post, but it didn't help much.

As I have done the mapping against the ensembl GRCM38 version, I guess it is better to work with the ensembl file. But how to do the filtering?

Is there a way to filter the rRNA reads before the mapping?

I know I can convert the gtf files into GenomicRanges object. BUT will that help me to remove the reads mapped to this coordinates after the mapping. if so how?

The last option, as reads in the above mentioned post is to take the list of rRNA ensembl IDs and just filter the list of results from featureCounts or HTseq-count to remove those rRNAs from it.

Is this option better than the other two?

thanks in advance for any information

Tomas

rna-seq gtf rRNA • 15k views
9
Entering edit mode
6.9 years ago

Ensembl puts rRNA in column 2 of its GTF files if the line describes an rRNA, so just grep -wv rRNA Mus_musculus.GRCm38.78.gtf to get rid of them. You'll map them, but not count them. Mapping is fast, you don't need to bother removing them ahead of time.

BTW, you actually don't need to bother removing these at all in most cases. Any decent normalization method (e.g., that used be DESeq2 or edgeR) won't be messed up by differences in rRNA amount.

2
Entering edit mode

Yeah. Filtering some reads in advance of sophisticated normalization could skew the experiment. You want to get rid of rRNA before sequencing, then hand all read data to the analysis software unmolested.

0
Entering edit mode

Yes I know that. The sequencing facility did run rRNA depletion twice, but there still were a lot of rRNA residues in the fastq files I have. So now I need to figure out, what to do with it.

Thanks Devon, I will try it without removing them, but than just filtering them from the list of read counts.

0
Entering edit mode

I think in ribosomal profiling, it might be useful to remove the rRNA before mapping. Ribo-seq data might still contain about 20-30% of total RNA as rRNA. Mapping this together with rest of non rRNA reads might screw the downstream quantification.

0
Entering edit mode

My zebrafish data also has ribosome RNA contamination and my boss mandatorily asked me to use cufflinks...As the cufflinks manual says""Tells Cufflinks to ignore all reads that could have come from transcripts in this GTF file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file. Due to variable efficiency of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often improves the overall robustness of transcript abundance estimates." I think -M is the right way? Then I tried to get the complete rRNA/tRNA GTF files, at first I though rmsk from UCSC table browser is enough but I found this post in UCSC google groups: " the rmsk table contains coordinates for various repeat families. Some of these repeat families are derived from specific RNA families, such as rRNA or tRNA. These differ from the actual rRNA and tRNAs, the coordinates for which are found in different tracks. " So I am confused, how should I get the right rRNA/tRNA GTF files to use in cufflinks?

0
Entering edit mode

For cufflinks you'll want the repeatmasker track from UCSC (just extract the rRNA and tRNA sites).

0
Entering edit mode

Dear Devon, Karl et al.

I came across an issue with rRNA contamination in my RNA-seq data (poly-A sequencing), and I found several posts about this issue. I observed contradictory or not very straightforward answers regarding what I can do next. For example, here you're recommending to not remove ribosomal RNAs from the mapping as this would introduce a bias in the analysis, whereas I am under the impression that others recommend removing it from the analysis (please see here). I tried removing rRNA from my reads using SortMERNA (using all eukaryotic and prokaryotic databases in this program), but the GC distribution per sample did not change. But in one sample, instead of 3 overrepresented sequences mapping to a ribosomal protein (rpS29, each appearing 10-15% of the reads), I get only one mapping to the same protein with similar level of representation in the RNA-seq data (~10-15% of the reads)). The GC distribution plot did not change.

Is there an objective way to assess if this rRNA contamination will be a problem for my differential expression analysis?

1
Entering edit mode
6.9 years ago
Charles Plessy ★ 2.8k

You can remove rRNA reads with the TagDust 2 program and the human ribosomal DNA complete repeating unit (U13369.1). It might be good to also remove the mitochondrial rRNA sequences (MT-RNR1 and MT-RNR2).

Edited: I got the genome wrong, see below for the mouse reference sequences.

0
Entering edit mode

why do I need to remove the human rRNA ( or for that reason the mitochondrial sequences) if I'm working with the mouse genome? Do I expect to find the human rRNA in my fastq files?

0
Entering edit mode

You don't.

0
Entering edit mode

Oops sorry, for mouse the equivalent sequences are BK000964, mt-Rnr1, and mt-Rnr2.

0
Entering edit mode

Are these the complete rRNA from the mouse?

Which means, if I first map my fastq files to this sequence, I will be able to remove most of my rRNA reads by taking only the unmapped reads in the next step?

Do I need to allow multiple mapping in this step? I guess there might be some multiplications of rRNA reads in the fastq files, if I already know from the fastqc report, that there reads are over-represented.

1
Entering edit mode

The mouse genome version mm10 does not seem to contain the rDNA regions where the rRNA is produced by the RNA polymerase I (assembling these regions this is actually challenging). But these regions or the rRNAs themselves have been the template for repeated elements that are scattered on the chromosomes (the "RNA" part in the RepeatMasker track in the UCSC genome browser). Therefore, when aliging transcriptome reads on a genome, the reads truely originating from rRNA molecules will be matched in a wrong location, regardless of multimapping. As noted in other answers, this is not a problem with some tools for RNA-seq analysis. However, if you look at your data in a less annotation-centric way, it may become confusing and it can be better to remove these reads before alignment.

The removal step with TagDust 2 that I suggested above matches the reads against a short list of user-defined reference sequences, not against a genome, therefore multimapping is not an issue in that step. As a first aproximate (since there are multiple copies in the genome and one can not exclude that some copies have some slight differences), you can assume that the BK000964 sequence represents all rRNAs.