I am working with RNA-seq data from Drosophila, mapping with Hisat2 and using stringtie to reconstruct transcripts. It is non-model Drosophila.
Initially I did the assembly without the split 'n' cigar step with GATK. The output of stringtie then had multiple exons per transcript in the .gtf file, as follows:
transcript 18495 19529 1000 - . gene_id "STRG.7"; transcript_id "STRG.7.1"; cov "3.623000"; FPKM "4.150527"; TPM "4.448586";
exon 18495 18666 1000 - . gene_id "STRG.7"; transcript_id "STRG.7.1"; exon_number "1"; cov "1.837597";
exon 18740 18908 1000 - . gene_id "STRG.7"; transcript_id "STRG.7.1"; exon_number "2"; cov "5.701381";
exon 18996 19229 1000 - . gene_id "STRG.7"; transcript_id "STRG.7.1"; exon_number "3"; cov "3.265812";
exon 19278 19529 1000 - . gene_id "STRG.7"; transcript_id "STRG.7.1"; exon_number "4"; cov "3.779449";
Wanting to make sure my analysis was robust, I ran split 'n' cigar on the alignment files and redid the estimation of transcript abundance with stringtie.
Following this, only one exon was estimated per transcript, like so:
transcript 6866 7438 1000 + . gene_id "STRG.1"; transcript_id "STRG.1.1"; cov "12.038646"; FPKM "2.925254"; TPM "2.443733";
exon 6866 7438 1000 + . gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "1"; cov "12.038646";
transcript 12315 12592 1000 + . gene_id "STRG.2"; transcript_id "STRG.2.1"; cov "2.542066"; FPKM "0.617693"; TPM "0.516016";
exon 12315 12592 1000 + . gene_id "STRG.2"; transcript_id "STRG.2.1"; exon_number "1"; cov "2.542066";
I think including split n cigar is an appropriate part of the workflow, but I must admit that intuitively having only one exon per transcript in the output file seems wrong. In addition, I do not understand how split 'n' cigar would cause this difference.
I'm pretty confused, if anybody has any insight into this issue, or more experience with what stringtie output should optimally be like, that would be most appreciated.
Thanks!