I have a question about the StringTie --merge function. I have been working with multiple RNAseq data sets all from the same species, but of different tissue types & using different sequencing specifications. At first, I analyzed each dataset separately, using the StringTie --merge to generate a separate GTF file for each dataset. However, I would like to be able to identify transcripts & genes that are differentially expressed across multiple datasets, even if they aren't annotated in the reference genome. So I used the --merge function on all the samples from multiple data sets and then used this merged GTF to estimate transcript abundances for each of the datasets. Predictably, the gene & transcript count csv files are longer (for example, they included genes/transcripts that were only in one of the several data sets). Even when I removed genes that had zero reads for all samples in a given data set, there were still far more genes in the count files produced from this larger merge (39,575 genes versus 29,847 genes in one data set). My differential expression results using these new count files appear to be similar, but not identical with some overlap with the previous analyses in terms of identified genes.
Can someone with more experience with StringTie --merge enlighten me a bit about what might be happening here? Was this an unreasonable approach? Is there some way that it is handling zeros or low-abundance transcripts/genes that could account for this? There are additional options to consider for this function, but I'm not sure how these would change my results. At this point, I am inclined to just use the separate GTF file for each data set approach, but I feel like I'm losing potentially interesting information. Any thoughts/suggestions would be appreciated!