Question: Stringtie transcripts to Gene Name
0
gravatar for shrivadeepak
6 weeks ago by
United States
shrivadeepak0 wrote:

Hello Everyone I'm working on an RNA seq data set obtained from a bacteria, I have followed a pipeline described for HISAT2 using Stringtie and Ballgown. My problem lie in the fact that how to convert the transcript ids generated by Stringtie (STRG0001 or MSTRG000) to actual gene names. The subsequent Differential expression analysis also reports the results with MSTRG or STRG as gene names. I tried to pares the gff file and match these transcripts to gene IDs but i have observed that every gff file is different and same script doesn't works on other files. I would really be thankful if you guys can help me out in this regard, because with the proper gene names my analysis is incomplete. Is there a way to map gene name or symbols to the transcripts. I fell I'm missing some step or is there a method to be followed. I would greatly appreciate the help.

assembly rna-seq R • 91 views
ADD COMMENTlink written 6 weeks ago by shrivadeepak0

If you use a reference annotation file (-G parameter) at the time of transcript assembly using stringtie you should see the names from your reference GTF in the stringtie output. Only novel genes/transcript variants end up with the STRG identifiers.

ADD REPLYlink written 6 weeks ago by vkkodali2.4k

Thanks for the reply. I do use the same command with -G, but what i get is the reference ids like transcript_id "rna-XM_029034609.1"; gene_id "gene-CJI97_002588" and in case of novel tanscripts i get these stringtie ids.

I inspected the gff file, and it has gene locus , and product , there is no gene symbol or gene id. probably thats why I'm getting transcript as above. Is this a problem that everyone faces or its just me , because i have not yet come across any other question like this.

ADD REPLYlink written 6 weeks ago by shrivadeepak0
1
gravatar for vkkodali
6 weeks ago by
vkkodali2.4k
United States
vkkodali2.4k wrote:

If I understand correctly, you don't see GeneID:40027734 (which corresponds to the gene CJI97_002588 in the output of StringTie. RefSeq GFF3 files include several attributes in column 9, not all of which are copied over by StringTie. Using the RefSeq GFF3 as a starting point you can build a mapping table with RefSeq transcript accession, RefSeq protein accession and GeneID of the format:

40027734    XM_029034609.1    XP_028889851.1    CJI97_002588

Then, in a post-processing step, add relevant identifiers to the column 9 of StringTie output. Is this what you are looking for?

ADD COMMENTlink written 6 weeks ago by vkkodali2.4k

Yes, that's what I'm looking for, something to have as a gene id. Thanks for your suggestion I will try to work this out and get back.

Thanks a lot I really appreciate the suggestion

ADD REPLYlink written 6 weeks ago by shrivadeepak0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2217 users visited in the last hour
_