I am analysing a study of the parasitic flatworm Schistosoma mansoni that looks at different developmental stages, although I have noticed this problem occurs with the same pipeline in different species.
In attempts to generate a deeper transcriptome, I have made a gtf for each biological replicate (20 bio reps in total) using;
stringtie -p 8 -G ../SmanAnnos.gtf -o FA_B1.gtf FA_B1.bam
Subsequently, I merged these all together using;
stringtie --merge -p 8 -G ../SmanAnnos.gtf -o stringtie_merged.gtf mergelist.txt
As expected, in extension to the orginal Smp_* annotations, it has produced MSTRGs. Whenever I blast queries against this new transcriptome, I find that in many cases, a query may match an Smp* as well as an MSTRG with almost identical scores. When I blast these close duplicates back against the genome, they both match the same regions of the genome.
Is there a step that I am missing that looks through the merged gtf and removes MSTRGs with overlapping Smp*s? I would be worried that these duplicates would mess with FPKM and subsequent DE calculations.