I am working on analysis for an RNASeq experiment with the end goal of using DESeq2 to generate a list of differentially expressed genes. We have 4 biological replicates for 4 conditions (differing genotypes in mice). We have 50 bp single ended Illumina read sets, on total RNA (they had low input and needed to go with an UltraLow kit thus the total RNA rather than say pA selected/RiboZero).
I have run Salmon on the fastq files using a mouse GRCm38 v3 cDNA all fasta file as the source to build a Salmon index file, and have read counts at transcript level for every data set. We have compared this at first glance to a prior dataset from a similar prior experiment (I don't know much about the analysis for this prior exit) and the top expressed genes by count show anticipated overlap- literally just looking by eye) so things superficially seem to be heading in the right direction.
Here is the meat of the question, the manual for DESeq(2) is quite clear that they require raw counts as input, and they use a description of read counts mapped to gene i in sample j.
Salmon (and other similar tools, Sailfish, Kallisto) work at a transcript level counting mechanism- it would seem to be trivial to sum to the gene level, and I am having trouble convincing myself that this would be an error. E.G.- presumably any read with mapping quality sufficient to be called as mapping to, for instance, transcript 1 of gene X, will show up as a count there, whereas another read mapping to transcript 2 of gene X will show up under that transcript, but never be double counted. Thus, simply summing transcript counts to gene level should be straight forward and not introduce error.
However, I wonder if I am glossing over something (or simply naieve about this) that would bring to question the validity of this approach. The Salmon developers feel this is a valid even superior approach and state exactly this, but I find other discussions that seem to question this general approach, e.g. summing transcript to gene level counts - I note Kallisto and Salmon appear to operate using aproximately similar approaches and thus discussions relevant to one are probably relevant to the other.
Thoughts and discussion would be very helpful to me. I am happy to start over using a different mapping approach and have used Bowtie, tophat, cufflinks etc, but am very much liking the simplicity of approach in mapping only to transcripts and the speed with which this can be accomplished.
Thanks in advance for illumination and insight. I may (almost certainly) have glossed over or misrepresented a few things along the way in forming this question as this is a relatively new field to me.