Editing and Adding to a GTF file
2
1
Entering edit mode
6 days ago

Hi everyone,

I am relatively new to bioinformatics. Currently, I’m working on Whole Genome Bisulfite Sequencing (WGBS) analysis for our target organism, which is a non-model fish species. As part of my research, I am also performing genome annotation.

I’ve completed the structural annotation using BRAKER3 and have obtained the corresponding .gtf, .aa, and .codingseq files. Now, using BLASTp, InterProScan, and EggNOG, I’ve identified gene names corresponding to the "gene_id" entries in my GTF file.

My question is: Is it possible to modify the GTF file so that, instead of arbitrary gene IDs, it displays the corresponding gene names? This would make it easier for our team to browse the genome and immediately recognize gene identities.

If this is feasible, should I use the results from BLASTp or EggNOG to make the replacements? Also, what tools or software would you recommend for editing the GTF file accordingly?

I’ve included an example below:

GTF:

JASCQY010000001.1   gmst    gene    712854  716064  .   +   .   g26
JASCQY010000001.1   gmst    transcript  712854  716064  .   +   .   g26.t1
JASCQY010000001.1   gmst    start_codon 712854  712856  227.542526  +   0   transcript_id "g26.t1"; gene_id "g26";
JASCQY010000001.1   gmst    CDS 712854  713246  227.542526  +   0   transcript_id "g26.t1"; gene_id "g26";
JASCQY010000001.1   gmst    exon    712854  713246  227.542526  +   0   transcript_id "g26.t1"; gene_id "g26";
JASCQY010000001.1   gmst    intron  713247  714043  227.542526  +   0   transcript_id "g26.t1"; gene_id "g26";
JASCQY010000001.1   gmst    CDS 714044  714202  227.542526  +   0   transcript_id "g26.t1"; gene_id "g26";
JASCQY010000001.1   gmst    exon    714044  714202  227.542526  +   0   transcript_id "g26.t1"; gene_id "g26";
JASCQY010000001.1   gmst    intron  714203  714407  227.542526  +   0   transcript_id "g26.t1"; gene_id "g26";
JASCQY010000001.1   gmst    CDS 714408  714606  227.542526  +   0   transcript_id "g26.t1"; gene_id "g26";
JASCQY010000001.1   gmst    exon    714408  714606  227.542526  +   0   transcript_id "g26.t1"; gene_id "g26";
JASCQY010000001.1   gmst    intron  714607  714770  227.542526  +   0   transcript_id "g26.t1"; gene_id "g26";
JASCQY010000001.1   gmst    CDS 714771  714866  227.542526  +   2   transcript_id "g26.t1"; gene_id "g26";
JASCQY010000001.1   gmst    exon    714771  714866  227.542526  +   2   transcript_id "g26.t1"; gene_id "g26";
JASCQY010000001.1   gmst    intron  714867  715165  227.542526  +   0   transcript_id "g26.t1"; gene_id "g26";
JASCQY010000001.1   gmst    CDS 715166  716064  227.542526  +   2   transcript_id "g26.t1"; gene_id "g26";
JASCQY010000001.1   gmst    exon    715166  716064  227.542526  +   2   transcript_id "g26.t1"; gene_id "g26";
JASCQY010000001.1   gmst    stop_codon  716062  716064  227.542526  +   0   transcript_id "g26.t1"; gene_id "g26";

Functional annotation table:

Blastp and Interproscan

Tags    SeqName Description Length  #Hits   e-Value sim mean    #GO GO IDs  GO Names    Enzyme Codes    Enzyme Names    InterPro IDs    InterPro GO IDs InterPro GO Names
true    [INTERPRO, BLASTED, MAPPED, ANNOTATED]  g26.t1  granulin a isoform X2   559 2   8.64E-4 92.86   1   C:GO:0005576    C:extracellular region          no IPS match    no IPS match    no IPS match

EggNOG

Type    Query ID    Gene Name   EggNOG Description  E-Value Bit-Score   Best Tax-Level  EC Codes    #GO GOs GO Names    KEGG KO KEGG Pathway
KOG,ENOG    g26.t1  GRN Granulin    0.075   45.8    Chiroptera      21.0    P:GO:0007566; P:GO:0010469; P:GO:0032355; P:GO:1900006; P:GO:0007618; P:GO:0060179; P:GO:0009725; P:GO:0060999; P:GO:0043312; C:GO:0005783; F:GO:0008083; P:GO:0045666; C:GO:0035578; P:GO:0061351; P:GO:0001835; P:GO:0048488; C:GO:0005768; C:GO:0005615; P:GO:0050769; P:GO:0035988; P:GO:0050679    P:positive regulation of neuron differentiation; C:endoplasmic reticulum; P:positive regulation of dendritic spine development; P:response to estradiol; P:positive regulation of epithelial cell proliferation; P:positive regulation of dendrite development; C:extracellular space; C:endosome; P:positive regulation of neurogenesis; P:neural precursor cell proliferation; P:chondrocyte proliferation; P:regulation of signaling receptor activity; P:male mating behavior; P:embryo implantation; P:neutrophil degranulation; P:response to hormone; P:synaptic vesicle endocytosis; P:mating; C:azurophil granule lumen; P:blastocyst hatching; F:growth factor activity   
Annotation Gene-ID GTF • 508 views
ADD COMMENT
0
Entering edit mode

Loosely related, it's better if you work with the gff3 format since gtf (aka gff2) is deprecated. Braker's output *.gff should be in gff3 format.

ADD REPLY
0
Entering edit mode

Hi!

Yes! I have looked into the GFF3 but apparently I was not able to input --gff command in braker when I was running it last month. right now, I am redoing my annotation so that I could have a gff3 output!

Thank you so much!

ADD REPLY
3
Entering edit mode
6 days ago
Juke34 9.2k

To add functional information (name or function) do not use attributes used to structure the file ( gene_id/ transcript_id for GTF and ID/Parent for GFF). Use an attribute like gene_name for name, product to describe functions and DBxref for functions from DB.

Have a look at AGAT, you can use the manage functional annotation script to add information from blast and interproscan or use « add from tsv »script for any type of function.

ADD COMMENT
0
Entering edit mode

Hi!

Thank you for this! I am currently looking at AGAT and will definitely explore this. I am currently reading and understanding agat_sp_manage_functional_annotation.pl and agat_sq_add_attributes_from_tsv.pl

thank you!

ADD REPLY
0
Entering edit mode
5 days ago
gernophil ▴ 120

I'm not exactly sure, if this works for your usecase, but maybe you could try PyRanges. read_gtf() and to_gtf() seem helpful here. Never used the latter, but the first one is really good.

ADD COMMENT

Login before adding your answer.

Traffic: 2716 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6