Question

How to normalize long-read RNA-seq data for comparison with short-reads

3

Entering edit mode

21 months ago

Bernardo ▴ 30

I am working on a project comparing RNAseq quantification results between Illumina short-reads and Nanopore long-reads and I have a couple questions about comparing the quantification results from these two technologies. More specifically I need some help with figuring out how to normalize the data for the comparisons within samples and between samples. So far I have come up with the following plan:

Using CPM to compare gene/transcript expression within each sample sequenced with nanopore. For example, comparing if gene.X transcripts are more abundant than gene.Y transcripts within sample_1 sequenced with nanopore. Using CPM instead of TPM for nanopore seems like a good option since our nanopore runs do not have transcript length bias. Does this sound like a good strategy?
Using TPM to compare gene/transcript expression within each sample sequenced with illumina. For example, comparing if gene.X transcripts are more abundant than gene.Y transcripts within sample_1 sequenced with illumina. Using TPM instead of CPM for illumina seems like a good option since illumina has transcript length bias (a single long transcript will have more counts that a single short transcript). Does this sound like a good strategy?
Here is where I am having trouble coming up with a good normalization strategy. Comparing gene/transcript expression between the same sample sequenced with illumina and nanopore. e.g., performing a spearman correlation between gene expression in sample_1 sequenced with illumina and sample_1 sequenced with nanopore. I am not sure what would work here since Illumina has transcript length bias and nanopore does not. Do you have any suggestions?

Any help here will be greatly appreciated.

Best, Bernardo

RNAseq long-reads normalization short-reads • 1.8k views

ADD COMMENT • link updated 3 months ago by GenoMax 147k • written 21 months ago by Bernardo ▴ 30

0

Entering edit mode

Hi, I was wondering about something similar. I have both Illumina (short) and nanopore (long) reads. What sort of normalisation did you end up using? I have bambu output for nanopore and salmon for Illumina. I think each has some sort of normalisation but what about within each sample and between samples? I have trouble coming up with something that makes sense for my analysis other than correcting for library size?

Any input is greatly appreciate!

ADD REPLY • link 7 months ago by newuser2024 • 0

1

Entering edit mode

The author of Salmon just preprinted oarfish, a formalized adaptation of Salmon to Nanopore.

https://www.biorxiv.org/content/10.1101/2024.02.28.582591v1

It should make comparison easier since they both follow similar principles for transcript abundance estimation.

ADD REPLY • link 7 months ago by rpolicastro 13k

0

Entering edit mode

3 months ago

callumjcparr ▴ 90

TL;DR

Would salmon TPM for short-read and oarfish with some sequencing depth normalisation help compared short-read with long-read?

In more verbose terms am I right in thinking that:

Salmon

Gives NumReads, an adjustment balancing the unique and ambiguous reads of a transcript following certain models but isn't normalized for sequencing depth
This NumRead is then normalised to give the TPM to normalize for depth and length bias of short-read
This TPM should be used for short-read when doing any downstream analysis
salmon can be used for long-reads, especially with the --ont model but it would be more correct to use the NumRead count not TPM

oarfish

Is designed for long-read only and has a similar way to distribute the reads with ambiguous mapping to transcripts as to salmon. As this only gives NumRead output we should use NumRead for downstream analysis.
Do we need a further normalization to account for sequencing depth so we can then compare across samples and/or platforms?

I am also trying to compare the sensitivity for transcript detection between some short-read data and long-read data on similarish source of RNA.

ADD COMMENT • link 3 months ago by callumjcparr ▴ 90

0

Entering edit mode

Please post this as a new question since you are asking about specific programs for people to weigh in on. Then come back here to delete this answer.

ADD REPLY • link 3 months ago by GenoMax 147k

score 3 · Accepted Answer · 2023-01-26

I did some work on this area before, it definitely has a lot of challenges. The biggest difference, is short reads RNAseq measurement are calculated considering the transcript length (normalized against it). The long reads tools does not normalized to transcript length. The reads are the actual transcripts, if you are using directRNAseq. Due to the science behinds the chemistry. I believe there is a paper that attempt doing this,

https://academic.oup.com/nar/article/50/4/e19/6439677

They use EM based method, similar to kallisto or salmon for illumina. I will report back on my work later when I have more time.

I worked with the great rpolicastro at the time. He must have some great thoughts.