I am trying the hisat2 -> stringtie -> ballgown workflow. I have 46 samples from 2 tissues. I run stringtie 3 times as described in http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html, approximately like this:
stringtie sample1.bam -G genes_ensembl_GRCh37_20150717.gtf -o sample1_run1.gtf #for each sample
stringtie --merge -G genes_ensembl_GRCh37_20150717.gtf -o merged.gtf listofrun1gtfs.txt
stringtie sample1.bam -G merged.gtf -e -B ballgowndirsample1 -o sample1_run3.gtf
However when I, for one gene of particular interest, compared the isoform expressions after the stringtie run to the aligned reads from the bam file in IGV I saw a problem: A commonly called isoform consisting of 27 exons did not have any expression for the last 6 exons, while the first 21 exons had a coverage between 100 and 500. This pattern was clear for all samples of that tissue and different from the pattern of the other tissue.
After some troubleshooting I saw that the shorter potentially novel 21-exon isoform was called in the first run of stringtie, however the merge step removed it. And I think the reason was because it was a substring of the larger 27 exon isoform. Figure 2 in the paper indicate that this might be the intended behavior of the merging. I also tested merging only one gtf file and indeed the shorter isoform disapeared while the 27 exons isoform was still in.
So, have I misunderstood some parameter settings in stringtie? Or is it biologically impossible for one isoform to be a substring of another longer isoform? Or is this a limitation with stringtie?