Question

I would like to exclude rRNA in RSEM and quantify other genes.

0

Entering edit mode

3 months ago

Apprentice ▴ 160

I have fastq files obtained by total RNA-seq of mouses. I mapped these fastq files with STAR using GRCm38 as a reference and quantified it with RSEM. The results showed that the library kit should have removed the rRNA, but the many read counts of rRNA was still observed. Is there any way to remove the rRNA or quantify only the mRNA when quantifying by RSEM? Please let me know. My annotation file is Mus_musculus.GRCm38.102.gtf.

STAR RNA-seq RSEM • 1.0k views

ADD COMMENT • link updated 3 months ago by dsull ★ 5.8k • written 3 months ago by Apprentice ▴ 160

score 1 · Answer 1 · 2024-01-22

1

Entering edit mode

3 months ago

bassanio ▴ 60

try using sortMeRNA

ADD COMMENT • link 3 months ago by bassanio ▴ 60

0

Entering edit mode

SortMeRNA is a program tool for filtering, mapping and OTU-picking NGS reads in metatranscriptomic and metagenomic data. The core algorithm is based on approximate seeds and allows for fast and sensitive analyses of nucleotide sequences. The main application of SortMeRNA is filtering ribosomal RNA from metatranscriptomic data. Additional applications include OTU-picking and taxonomy assignation available through QIIME v1.9+

This is not metatranscriptomic data. Do you know for sure that sortMeRNA can be used here?

ADD REPLY • link 3 months ago by Ram 43k

2

Entering edit mode

It should work. There are papers that have used it to remove rRNA from bulk RNA-seq data.

ADD REPLY • link 3 months ago by Dunois ★ 2.5k

0

Entering edit mode

Yes this is the standard method. check the standard pipeline in nfcore

ADD REPLY • link 3 months ago by bassanio ▴ 60

score 1 · Answer 2 · 2024-01-22

If you want to quantify only certain RNA species (e.g. exclude rRNAs), you can just modify the GTF accordingly.

For example, if you only want to quantify protein coding genes, long non-coding RNAs, and antisense transcripts, you could do the following:

wget https://ftp.ensembl.org/pub/release-110/gtf/mus_musculus/Mus_musculus.GRCm39.110.gtf.gz

gtf_file="new_mouse_gtf_file.gtf"

zcat Mus_musculus.GRCm39.110.gtf.gz|grep "#!" > $gtf_file
zcat Mus_musculus.GRCm39.110.gtf.gz|grep -i "gene_biotype \"protein_coding\"" >> $gtf_file
zcat Mus_musculus.GRCm39.110.gtf.gz|grep -i "gene_biotype \"lncRNA\"" >> $gtf_file
zcat Mus_musculus.GRCm39.110.gtf.gz|grep -i "gene_biotype \"lincRNA\"" >> $gtf_file
zcat Mus_musculus.GRCm39.110.gtf.gz|grep -i "gene_biotype \"antisense\"" >> $gtf_file

score 0 · Answer 3 · 2024-01-22

0

Entering edit mode

3 months ago

jv ★ 1.8k

No method will remove 100% of rRNAs before library construction. What percent of your reads mapped to rRNA loci?

Why do you want to remove rRNA genes from RSEM analysis? The convention is to drop rRNA genes from the expected counts matrix before performing downstream differential expression analysis. Is there something else you are trying to do with the data?

ADD COMMENT • link 3 months ago by jv ★ 1.8k

0

Entering edit mode

Thank you for comment. Of the mapped reads, rRNA reads occupied 10%-50% of total reads. The RSEM automatically calculates the TPM after the read count. As a result, the TPM is obtained considering the number of rRNA reads. I would like to calculate TPM without including the number of rRNA reads.

ADD REPLY • link 3 months ago by Apprentice ▴ 160

2

Entering edit mode

As mentioned by multiple people, I would personally throw everything (including the rRNAs) into RSEM because of the way RSEM handles multimapping. If a read maps well to an rRNA and something else, the EM algorithm can do probablistic assignment. You can just choose not to look at them later (RNAseq is about relative abundance anyway).

If you toss out rRNA reads, you run the risk of reads that multimap well to an rRNA and to another gene being tossed out due to some shared sequence. While probably won't make a difference in your downstream analysis, why go through all the trouble of trying to remove rRNA?

TPMs are basically constructed so that the sum of TPMs across all genes equals 10^6. If you want, you can get your TPMs including rRNAs, and then toss out the rRNA TPMs, and rescale everything so the sum is 10^6. I'm guessing you have multiple samples (e.g. sample A has 10% rRNA, sample B has 60% rRNA). Well, if you're comparing between samples, TPM is not really a good way to compare between samples anyway.

ADD REPLY • link 3 months ago by dsull ★ 5.8k

0

Entering edit mode

Thank you for your comment. I will be using edgeR for comparison between groups based on raw count data. TPM data will be used for clustering analysis. As you suggested, it seemed like a good idea to discard the rRNA TPM and recalculate the TPM using the remaining transcript data. I would like to give it a try.

ADD REPLY • link 3 months ago by Apprentice ▴ 160

0

Entering edit mode

For this you need to remove rRNA from Fastq files even before alignment to make sure you are not skewing the data in any other ways

ADD REPLY • link 3 months ago by bassanio ▴ 60

1

Entering edit mode

Check this post: rRNA contaminants RNASeq

ADD REPLY • link 3 months ago by Meisam ▴ 230

0

Entering edit mode

What sort of analysis are you planning to do that you need to use TPM?

ADD REPLY • link 3 months ago by jv ★ 1.8k

0

Entering edit mode

Thanks for your comment, I would like to do a cluster analysis using TPM.

ADD REPLY • link 3 months ago by Apprentice ▴ 160

0

Entering edit mode

I still encourage using a non-TPM metric for your cluster analysis. You can use edgeR's TMMs for that.

ADD REPLY • link 3 months ago by dsull ★ 5.8k