Hi, I'm new in this, so a bit confused and would appreciate your help. I'm using Galaxy for analyzing my RNAseq data. so far I've used: HISAT2 (based on the genome) --> HTseq count --> DeSeq2. It is my understanding that DeSeq2 normalize to library size, is that correct? but not to CDS length? I wish to both compare gene expression across different samples, but also within the same sample. so both of these normalizations (library size and CDS length) are needed? If so, what tool can I use that is available in Galaxy? What about Degust? It has methods of EdgeR, Voom/Lima. Are these normalize the results according to what I need? Thanks a lot in advance!
I think you need to step back for a moment and think about what each measurement is good for.
In general, you want to use counts when you are looking at the expression of one gene in multiple conditions and TPM when you are comparing or ranking the expression of different genes.
You start at the most basic with the count of reads aligning to genes. The count is a quantification of expression but it is also a partial measurement of variance in that it is a good indicator of ho much counting noise there is (i.e. the difference in uncertainty in 1 vs 2 is greater than the difference in counting uncertainty between 100 vs 200). However, if you sequence one library twice as much as another library it's no good. So you have to normalize by the library size.
DEseq uses (mostly) counts normalized by library size. It doesn't normalize by length intentionally because it wants to preserve something approximating the absolute count of the number of reads that was used to quantify the gene expression. This is OK because when you are only looking at one gene the denominator (length of the read) will cancel out so you don't have to worry about it.
However, you can't use raw counts to get the overall abundance of an RNA in a cell, which sometimes is what you want. Then you want to use TPM which normalizes by length. A 100 base pair transcript with 100 reads has more copies than than a 10,000 base transcript gene with 100 reads. So if I was deciding on a gene target based on expression and wanted something that had limited off target effects I would use TPM to see if there are 1 or 2 transcripts in a cell (I guess OK) vs 100s of transcripts (probably bad). I am not sure that's completely right in real life but it's sort of theoretically right.
For read count I usually use something like bedtools restricting to uniquely mapped reads. I don't know if there's a tool for that. For TPM I usually use RSEM but there are lots of good solutions for both.
I have an old blog post here: http://michelebusby.tumblr.com/post/26913184737/thinking-about-designing-rna-seq-experiments-to that goes into some of these ideas in more depth.