Question

How to normalize long-read RNA-seq data for comparison with short-reads

2

Entering edit mode

14 months ago

Bernardo ▴ 20

I am working on a project comparing RNAseq quantification results between Illumina short-reads and Nanopore long-reads and I have a couple questions about comparing the quantification results from these two technologies. More specifically I need some help with figuring out how to normalize the data for the comparisons within samples and between samples. So far I have come up with the following plan:

Using CPM to compare gene/transcript expression within each sample sequenced with nanopore. For example, comparing if gene.X transcripts are more abundant than gene.Y transcripts within sample_1 sequenced with nanopore. Using CPM instead of TPM for nanopore seems like a good option since our nanopore runs do not have transcript length bias. Does this sound like a good strategy?
Using TPM to compare gene/transcript expression within each sample sequenced with illumina. For example, comparing if gene.X transcripts are more abundant than gene.Y transcripts within sample_1 sequenced with illumina. Using TPM instead of CPM for illumina seems like a good option since illumina has transcript length bias (a single long transcript will have more counts that a single short transcript). Does this sound like a good strategy?
Here is where I am having trouble coming up with a good normalization strategy. Comparing gene/transcript expression between the same sample sequenced with illumina and nanopore. e.g., performing a spearman correlation between gene expression in sample_1 sequenced with illumina and sample_1 sequenced with nanopore. I am not sure what would work here since Illumina has transcript length bias and nanopore does not. Do you have any suggestions?

Any help here will be greatly appreciated.

Best, Bernardo

RNAseq long-reads normalization short-reads • 1.0k views

ADD COMMENT • link updated 17 days ago by rpolicastro 13k • written 14 months ago by Bernardo ▴ 20

0

Entering edit mode

Hi, I was wondering about something similar. I have both Illumina (short) and nanopore (long) reads. What sort of normalisation did you end up using? I have bambu output for nanopore and salmon for Illumina. I think each has some sort of normalisation but what about within each sample and between samples? I have trouble coming up with something that makes sense for my analysis other than correcting for library size?

Any input is greatly appreciate!

ADD REPLY • link 17 days ago by newuser2024 • 0

0

Entering edit mode

The author of Salmon just preprinted oarfish, a formalized adaptation of Salmon to Nanopore.

https://www.biorxiv.org/content/10.1101/2024.02.28.582591v1

It should make comparison easier since they both follow similar principles for transcript abundance estimation.

ADD REPLY • link 17 days ago by rpolicastro 13k

score 3 · Accepted Answer · 2023-01-26

I did some work on this area before, it definitely has a lot of challenges. The biggest difference, is short reads RNAseq measurement are calculated considering the transcript length (normalized against it). The long reads tools does not normalized to transcript length. The reads are the actual transcripts, if you are using directRNAseq. Due to the science behinds the chemistry. I believe there is a paper that attempt doing this,

https://academic.oup.com/nar/article/50/4/e19/6439677

They use EM based method, similar to kallisto or salmon for illumina. I will report back on my work later when I have more time.

I worked with the great rpolicastro at the time. He must have some great thoughts.