This is not a banal discussion. I am facing some problems with the analysis of DE genes in mouse.
Most methods of analysis of DE genes must face two considerations or challenges. The first needs to take into consideration the existence and the different isoform expression of the genes, and the second the variability obtained across biological replicates. These two needs to be accounted together. In the first case, the collapsing or summarizing of the raw isoforms counts to the genes notably affect the output of my analysis
My DE analysis are facing a differential output depending upon the program or package I am using. And I need to take into consideration whether I should generate pseudo-aligned counts or mapped BAM files to this thread.
Most packages face this analysis by considering one of these challenges, but not the other. The pletora of R packages available and/or standalone programs to this end, is high, and I need you to participate in this discussion to help me and others to make the best choice for this type of analysis
To mention, I started to use Kallisto and DESeq2 to see that after importing the data with the tximport package using the txOut = FALSE or TRUE gives two completely different outputs, one being that not DE genes are expressed, and the other giving rational but inaccurate DE expression
¿What do you recommend?
It is unclear to me what the question is. Do you want gene-level or transcript-level analysis? From what I know gene level analysis does not notably benefit from taking into account the mapping uncertainty of reads to transcripts of the same gene as in any case one summarizes all reads to the gene level. In contrast, for transcript-level analysis the uncertainty needs to be taken into account, e.g. via inferential replicates as both kallisto and salmon can return them. This is discussed e.g. in this paper from Mike Love. That having said, DESeq2 itself is built for gene- not for transcript-level analysis as it does not make use of inferential replicates, hence
txOut=TRUE/FALSEmakes a big difference both because dispersion esimation is probably, and multiple testing burden is definitely different.
tximportcompensates for that when summarizing to the gene level, in fact it is the whole point of the method, so
txOut=FALSEshould be fine for gene level.
This is the actual problem. If you are using as reference for the mapping a fasta file containing isoforms, your analysis should be addressing this. I believe you cannot have any other choice that the running of a transcript-level analysis or else, your analysis would be compromised
Why I say this ? This is what happened to me when using the txOut option. I ended with no DE expression at all (txOut = FALSE: p-adj >= 0.18 in all the cases) or with a very significative DE expression when individual isoforms are analyzed and not summarization are done to the gene level (txOut = TRUE: with valued of p-adj <= 10E-6 and log2FoldChanges exceeding 2E20)
And by reading many papers I noticed that in addition to the uncertainty generated by the existence of isoforms, we need also to deal with the variability generated by the existence of biological replicates.
There are packages, like BitSeq, that seems to model the variability at the transcript-level, but not gene-level expression. Other does not account for the variability accounted by the biological replicates
So the actual question is
¿What package or app would you use when you must map your reads to a fasta file (like the human or mouse) that contains many transcript isoforms, and when what you want is to get your analysis done to the gene-level and also considering the biases generated by the biological replicates?
That's not unexpected, which is why these two options exist. In addition to the paper mentioned by ATPoint, the "swimming downstream" manuscript goes into the conceptual and practical differences of transcript/isoform quantification vs. gene quantification.
What do you mean by "rational, but inaccurate DE expression"? How do you gauge both rationality as well as accuracy of your results?
DESeq2 vignetteoffers a pretty comprehensive run-down of the entire process, starting with read quantification import and listing the packages that are needed in addition to
By rational means that after running a GO enrichment, you ended with a set of reasonable set of transcripts (not genes) that could related to the phenotype you observe. I add the "inaccurate" adjetive here because these results are only obtained when you analyze transcripts individually, not at the gene-level
I believe Michael Love and colleagues created the fishpond package using the Swiss method to deal with cases like mine, in which you have a pletora of isoform transcripts in the fasta file you used for the mapping
In other words. DESeq2 could not be appropriate when transcript isoforms are present in your reference fasta file
Sounds like your experiment is not behaving the way you expected it to. If everything checks out on the experimental side then perhaps you should see if results provide an alternate explanation/a new hypothesis. There is also the painful possibility that the experiment failed in some way (not completely but perhaps partially).