Question

Normalizing RNA-seq read counts within a single sample - gut check

0

Entering edit mode

3.2 years ago

plberry ▴ 30

I'm doing a pilot analysis on some public RNA-seq datasets that include mRNA-seq, Ribo-seq, and RNC-seq for the same samples. The analysis is part of a larger project to look at ways to predict protein structure/function/class by examining translational dynamics and seeing what they have in common.

I've got my reads all mapped to the transcriptome, and was starting my usual normalization for differential expression analysis when I started wondering if that is really appropriate - would the FPKM output by STAR's genecounts mode be adequate since I am not going to be really doing DE analysis/comparing read counts between samples, but between different transcripts within samples?

Would the fact that eventually we will comparing the translational dynamics (TE, TI, EV, and more) between organisms but for homologous proteins influence this?

My intuition says that FPKM is sufficient, since I will be comparing ratios instead of raw counts between organisms/samples.

RNA-Seq • 704 views

ADD COMMENT • link updated 3.2 years ago by rpolicastro 13k • written 3.2 years ago by plberry ▴ 30

score 2 · Accepted Answer · 2021-02-01

2

Entering edit mode

3.2 years ago

rpolicastro 13k

If you want to quantify read counts over transcripts instead of genes you should use a program such as Salmon instead of exon counting software like featureCounts or the STAR quantification mode. You can use the salmon TPM counts for your downstream analysis since FPKM is no longer recommended.

ADD COMMENT • link 3.2 years ago by rpolicastro 13k

2

Entering edit mode

The reason for this is that transcripts of a gene share most of their sequence information so it is actually difficult to assign the reads uniquely to transcripts. This uncertainty has to be estimated for a proper downstream (differential) analysis. Salmon solves this by producing Gibbs or bootstrap replicates which some downstream differential analysis frameworks such as swish can make use of. Not sure what in your case your exact analysis is going to be with a single sample though. What is TE/TI/EV by the way?

ADD REPLY • link 3.2 years ago by ATpoint 81k

1

Entering edit mode

Apologies, "TI" should be "TR".

TE is translational efficiency, which I learned as rpkM (sub in the updated normalization scheme) from your Ribo-seq data / rpkM from your mRNA-seq data. It should show the ratio of how many ribosomes are actually bound to a transcript. So if you have a highly transcribed gene but it's got a terrible RBS or a leaky Kozak sequence, it should have a low TE.

TR or translation initiation efficiency is the same, but with rpkM from your RNC seq data in the numerator. This should show the ratio of transcripts where translation initiation actually occurred.

EVI is elongation velocity index and uses all three - (rpkM from RNC-seq)^2/(rpkM from Ribo-seq x rpkM from mRNA-seq). If you have a low EVI, you probably have a lot of ribosome pausing, and that opens the door for more co-co translational folding. (For example, if the interfaces of a homodimer are close to the N-terminus, two ribosomes travelling on the same mRNA will have the interfaces for the two peptide chains trailing along, and if one ribosome hits a hairpin or there's some other factor that causes it to pause, the second ribosome might bump into it, causing the interface region of the two peptide chains to come together. When translation starts again, the rest of the protein subunits will be produced, and the homodimer already has a head start on assembly.)

The downstream analysis is (probably) going to involve building a model of translation dynamics, including looking at RNA secondary structure, motifs in sequences flanking where ribosomes are bound, codon choice, which codons the A- and P- sites are bound to, how close to consensus the Kozak sequence is, etc. The impetus for this is that the bench folks I work with have noticed some interesting similarities in protein phase-transition behavior from seemingly unrelated proteins in widely different classes and so we also want to see if there's a common thread between the proteins that exhibit this behavior to guide further investigation.

ADD REPLY • link 3.2 years ago by plberry ▴ 30

1

Entering edit mode

Reading through the Salmon documentation, it seems like I can run my already transcriptome-mapped reads through it - since I am going to be using the actual sequences of the ribosome footprints from the Ribo-seq analysis for some molecular-level dynamics analysis, will that avoid potential problems with the reads possibly being mapped differently by Salmon in mapping mode vs STAR? If it makes a difference, after filtering out contamination and ncRNA, I ran STAR to only output uniquely mapped reads.

ADD REPLY • link 3.2 years ago by plberry ▴ 30

1

Entering edit mode

It's fine to run Salmon in alignment mode if you need them aligned as well. The counts will be more accurate in mapping based mode, but will still be an upgrade over exon counting.

ADD REPLY • link 3.2 years ago by rpolicastro 13k