Understand GTF produced by StringTie
0
0
Entering edit mode
5 months ago
antonioggsousa ★ 2.0k

Hi,

I'm collaborating in a project where the transcriptome of Arabidopsis thaliana was assembled with StringTie. Since the aim of the project is also to look into the long-non-coding RNA transcripts, just a few of them were kept.

I need to retrieve the correspondence between transcript_id and gene_id to use this information to perform some downstream analyses. It was given to me the GTF file (for the lncRNA) and I parsed the file in order to retrieve the transcript_id to gene_id correspondence with some R code.

When I started to look into the correspondence between the transcript_id to gene_id I found three types of ids:

  • 1) transcript_id as it appears in the genome annotation of A. thaliana, e.g., AT1G04003.1, mapped against the respective gene_id, i.e., AT1G04003. If I understood correctly this means that the assembled transcript predicted by StringTie exists in the current annotation of A. thaliana.

  • 2) novel transcript that does not appear in the current genome annotation of A. thaliana, e.g., with the transcript_id TCONS_00000010, mapped against the novel gene, i.e., with gene_id XLOC_000005. If I understood correctly this means that the assembled transcript predicted by StringTie does not exists in the current annotation of A. thaliana (and by extent neither the gene).

  • 3) transcript_id as it appears in the genome annotation of A. thaliana, e.g., AT1G04163.1, mapped against the gene_id MSTRG.236 given by the StringTie software instead of the expected AT1G04163 gene. Actually, there is a description field in the GTF for these cases named ref_gene_id that holds the A. thaliana gene identification AT1G04163.

My problem is to understand the 3rd case where StringTie keeps the classification MSTRG... in gene_id instead of ref_gene_id. From this post on the StringTie github repo I think that I can substitute the gene_id MSTRG... by the ref_gene_id but since I don't quite understand these notations I'm not sure. Can this mean that is a new isoform (so new gene_id with MSTRG...), but some of their transcripts are mapped against known transcripts in A thaliana genome, that's why they have transcript_id as it appears in the genome annotation of A. thaliana, e.g., AT1G04163.1 (that corresponds to ref_gene_id AT1G04163), but the overall true gene_id is new/novel (due to novel isoform), and therefore assigned as MSTRG... by StringTie?

I read the StringTie paper and also I checked the manual but I was not able to find a clear answer to this doubt that I have.

Thank you in advance for any help or suggestion,

António

RNA-Seq Assembly • 216 views
ADD COMMENT

Login before adding your answer.

Traffic: 2165 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6