I am following hisat2, stringtie and Deseq2 pipeline. I obtained gene counts using HTSeq by aligning reads against stringtie merged gtf. I aligned reads with HISAT2 without reference GFF file (to avoid biasing alignments to annotated splice junctions). While mapping using Stringtie I used reference gff file.
When I see the results I could see that stringtie has assigned transcripts of two adjacent genes to same MSTRG ID.
Genome GFF file from NCBI
NC_037550.1 Gnomon lnc_RNA 107569536 107577599 . - . ID=rna-XR_003110259.1;Parent=gene-LOC112585719;Dbxref=GeneID:112585719,Genbank:XR_003110259.1;Name=XR_003110259.1;gbkey=ncRNA;
NC_037550.1 Gnomon gene 107580147 107581316 . - . ID=gene-C6H1orf122;Dbxref=GeneID:102404938;Name=C6H1orf122;gbkey=Gene;gene=C6H1orf122;gene_biotype=protein_coding
NC_037550.1 StringTie transcript 107569504 107581424 1000 - . gene_id "MSTRG.10147"; transcript_id "MSTRG.10147.1";
So 1. Why does stringtie assign same ID to adjacent loci? Is that because I did not use reference annotation while aligning?
- How to quantify, when stringtie assigns reads of two adjacent loci to same ID?