Hi,
I have some total RNA-seq data that contain quite a lot of rRNA reads (up to 60%, obviously this is not ideal, but due to tissue/cells the RNA was derived from, this was somewhat unavoidable) and was wondering what the best way of dealing with these was.
I have processed this data in a few different ways (either using STAR followed by featureCounts/Salmon or just salmon in mapping/selective-alignment mode) using the GENCODE M22 transcripts fasta file (and/or the primary GTF/genome file). Currently I have not removed the rRNA reads so as the gencode transcript file contains rRNA sequences, these have been counted.
I had a few questions:
- Will keeping the rRNA reads in affect differential expression analysis using DESeq2 (or edgeR etc)? I suspect it won't but just want to check?
- Keeping the rRNA reads in would significantly affect the TPM values, would this be an issue especially if I intend to compare TPM values with other RNA-seq data? (again, I know this is not ideal but I believe it may be the best way to do some of the things I want to do e.g. compare homologous genes between this dataset and a human RNA-seq dataset)
- If removing the rRNA reads is the better option, when should I remove them, before quantification/featureCounts (by removing them from the transcript fasta file or the GTF file), or after by just removing that row of counts? If I do it after, is there an easy way to recalculate the TPM provided by salmon/after tximport?
Apologies if this is not the right place to ask this question. I notice there a couple of similar questions and the thought seems to vary from just keeping them or that rRNA sequences aren't typically in the transcript file so its not an issue but unless I'm mistaken they do seem to be in the GENCODE transcripts fasta file so was wondering what to do in this specific case and also regarding the TPM values.
Thanks so much for the help.