I have assembled six transcriptomes with hisat2/stringtie. I merged the resulting gtf files into a single merged gtf.
For many transcripts in the merged gtf a transcript_id starting with 'MSTRG' was created. My understanding is that the MSTRG ids are assigned to transcripts in the merged gtf which may have different ids in the individual unmerged gtfs. This could be because there is no reference id for this transcript.
I am now interested in looking at the expression levels (TPM) of certain transcripts in the individual unmerged samples. However, I am having trouble matching the MSTRG id in the merged gtf to the corresponding ids in the unmerged gtfs.
I attempted to solve this problem using bedtools intersect to get the overlapping coordinates in the merged gtf with one of the unmerged gtfs. This allows me to map the MSTRG id to the unmerged id.
However, I now have a new problem: in some cases a single MSTRG id is assigned to multiple unmerged ids. See below for a simplified example:
22942 24454 gene_id "25" TPM "3" 19883 26517 gene_id "MSTRG.34" 19883 22800 gene_id "26" TPM "5" 19883 26517 gene_id "MSTRG.34" 24624 26412 gene_id "27" TPM "5" 19883 26517 gene_id "MSTRG.34"
My questions are.
- Why has stringtie merged these multiple transcripts into a single transcript in the merged gtf?
- How can I treat the TPM values
as referring to a single transcript (i.e. the MSTRG id) and if so
what is the best way t o do this?
- Get the mean TPM per gene_id
- Sum the TPM values per gene_id