I'm doing a pilot analysis on some public RNA-seq datasets that include mRNA-seq, Ribo-seq, and RNC-seq for the same samples. The analysis is part of a larger project to look at ways to predict protein structure/function/class by examining translational dynamics and seeing what they have in common.
I've got my reads all mapped to the transcriptome, and was starting my usual normalization for differential expression analysis when I started wondering if that is really appropriate - would the FPKM output by STAR's genecounts mode be adequate since I am not going to be really doing DE analysis/comparing read counts between samples, but between different transcripts within samples?
Would the fact that eventually we will comparing the translational dynamics (TE, TI, EV, and more) between organisms but for homologous proteins influence this?
My intuition says that FPKM is sufficient, since I will be comparing ratios instead of raw counts between organisms/samples.
The reason for this is that transcripts of a gene share most of their sequence information so it is actually difficult to assign the reads uniquely to transcripts. This uncertainty has to be estimated for a proper downstream (differential) analysis. Salmon solves this by producing Gibbs or bootstrap replicates which some downstream differential analysis frameworks such as swish can make use of. Not sure what in your case your exact analysis is going to be with a single sample though. What is TE/TI/EV by the way?
Apologies, "TI" should be "TR".
TE is translational efficiency, which I learned as rpkM (sub in the updated normalization scheme) from your Ribo-seq data / rpkM from your mRNA-seq data. It should show the ratio of how many ribosomes are actually bound to a transcript. So if you have a highly transcribed gene but it's got a terrible RBS or a leaky Kozak sequence, it should have a low TE.
TR or translation initiation efficiency is the same, but with rpkM from your RNC seq data in the numerator. This should show the ratio of transcripts where translation initiation actually occurred.
EVI is elongation velocity index and uses all three - (rpkM from RNC-seq)^2/(rpkM from Ribo-seq x rpkM from mRNA-seq). If you have a low EVI, you probably have a lot of ribosome pausing, and that opens the door for more co-co translational folding. (For example, if the interfaces of a homodimer are close to the N-terminus, two ribosomes travelling on the same mRNA will have the interfaces for the two peptide chains trailing along, and if one ribosome hits a hairpin or there's some other factor that causes it to pause, the second ribosome might bump into it, causing the interface region of the two peptide chains to come together. When translation starts again, the rest of the protein subunits will be produced, and the homodimer already has a head start on assembly.)
The downstream analysis is (probably) going to involve building a model of translation dynamics, including looking at RNA secondary structure, motifs in sequences flanking where ribosomes are bound, codon choice, which codons the A- and P- sites are bound to, how close to consensus the Kozak sequence is, etc. The impetus for this is that the bench folks I work with have noticed some interesting similarities in protein phase-transition behavior from seemingly unrelated proteins in widely different classes and so we also want to see if there's a common thread between the proteins that exhibit this behavior to guide further investigation.
Reading through the Salmon documentation, it seems like I can run my already transcriptome-mapped reads through it - since I am going to be using the actual sequences of the ribosome footprints from the Ribo-seq analysis for some molecular-level dynamics analysis, will that avoid potential problems with the reads possibly being mapped differently by Salmon in mapping mode vs STAR? If it makes a difference, after filtering out contamination and ncRNA, I ran STAR to only output uniquely mapped reads.
It's fine to run Salmon in alignment mode if you need them aligned as well. The counts will be more accurate in mapping based mode, but will still be an upgrade over exon counting.