Stringtie transcripts to Gene Name
1
0
Entering edit mode
3.3 years ago

Hello Everyone I'm working on an RNA seq data set obtained from a bacteria, I have followed a pipeline described for HISAT2 using Stringtie and Ballgown. My problem lie in the fact that how to convert the transcript ids generated by Stringtie (STRG0001 or MSTRG000) to actual gene names. The subsequent Differential expression analysis also reports the results with MSTRG or STRG as gene names. I tried to pares the gff file and match these transcripts to gene IDs but i have observed that every gff file is different and same script doesn't works on other files. I would really be thankful if you guys can help me out in this regard, because with the proper gene names my analysis is incomplete. Is there a way to map gene name or symbols to the transcripts. I fell I'm missing some step or is there a method to be followed. I would greatly appreciate the help.

rna-seq R assembly • 1.7k views
ADD COMMENT
0
Entering edit mode

If you use a reference annotation file (-G parameter) at the time of transcript assembly using stringtie you should see the names from your reference GTF in the stringtie output. Only novel genes/transcript variants end up with the STRG identifiers.

ADD REPLY
0
Entering edit mode

Thanks for the reply. I do use the same command with -G, but what i get is the reference ids like transcript_id "rna-XM_029034609.1"; gene_id "gene-CJI97_002588" and in case of novel tanscripts i get these stringtie ids.

I inspected the gff file, and it has gene locus , and product , there is no gene symbol or gene id. probably thats why I'm getting transcript as above. Is this a problem that everyone faces or its just me , because i have not yet come across any other question like this.

ADD REPLY
1
Entering edit mode
3.3 years ago
vkkodali_ncbi ★ 3.7k

If I understand correctly, you don't see GeneID:40027734 (which corresponds to the gene CJI97_002588 in the output of StringTie. RefSeq GFF3 files include several attributes in column 9, not all of which are copied over by StringTie. Using the RefSeq GFF3 as a starting point you can build a mapping table with RefSeq transcript accession, RefSeq protein accession and GeneID of the format:

40027734    XM_029034609.1    XP_028889851.1    CJI97_002588

Then, in a post-processing step, add relevant identifiers to the column 9 of StringTie output. Is this what you are looking for?

ADD COMMENT
0
Entering edit mode

Yes, that's what I'm looking for, something to have as a gene id. Thanks for your suggestion I will try to work this out and get back.

Thanks a lot I really appreciate the suggestion

ADD REPLY

Login before adding your answer.

Traffic: 2620 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6