How to convert GTF output of TSEBRA to gff3 file as an input for EVM ?
2
1
Entering edit mode
12 months ago
BioinfoBee • 0

Hello,

Curious if anyone have experience using the TSEBRA GTF output in EMV. The GTF file generated by TSEBRA gives error while converting to GFF3 format to be used as an input for EMV. I tried using augustus_GTF_to_EVM_GFF3.pl but it doesn't seems to work. Kindly suggest!

Regards,
B

gtf braker2 gff3 tsebra • 1.7k views
ADD COMMENT
1
Entering edit mode

you should mention what the error is ... it is hard to comment otherwise. What seems to be the error? It is curious that the code errored out GTF is such a simplistic format - closely related to GFF, so conversion is usually trivial. perhaps the GTF file is not quite right?

ADD REPLY
0
Entering edit mode

Istvan Albert Thanks!. I received following error while validating the "augustus_GTF_to_EVM_GFF3.pl" converted gff3 file using EMV utils "gff3_gene_prediction_file_validator.pl"

Error, feature: Chr1-g54932 is described multiple times with different data values:
$VAR1 = {
          'feature_ID' => 'Chr7S-g54932',
          'feat_type' => 'gene',
          'parent_ID' => undef,
          'orient' => '+',
          'rend' => '252617335',
          'lend' => '252612276',
          'contig' => 'Chr7S'
ADD REPLY
1
Entering edit mode

look at your GTF and try to see what is going on, for example do a

cat myfile.gtf | grep Chr7S-g54932

sounds like the feature id needs to be unique. Perhaps need to remove some lines from the file. It is conceivable that both the gene and the transcript have the same IDs

ADD REPLY
0
Entering edit mode

Seems like transcript id and gene id are described multiple times but appears to have same value. Any suggestions on fixing this?

Regards, B

Chr5S   Augustus    gene    139406822   139409543   .   -   .   ID=gene.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2;Name=Augustus%20prediction
Chr5S   Augustus    mRNA    139406822   139409543   .   -   .   ID=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2;Parent=gene.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2;Name=Augustus%20prediction
Chr5S   Augustus    exon    139409460   139409543   .   -   .   ID=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2.exon1;Parent=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2
Chr5S   Augustus    CDS 139409460   139409543   .   -   .   ID=cds.model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2;Parent=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2
Chr5S   Augustus    exon    139408993   139409375   .   -   .   ID=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2.exon2;Parent=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2
Chr5S   Augustus    CDS 139408993   139409375   .   -   .   ID=cds.model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2;Parent=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2
Chr5S   Augustus    exon    139408833   139408903   .   -   .   ID=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2.exon3;Parent=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2
Chr5S   Augustus    CDS 139408833   139408903   .   -   .   ID=cds.model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2;Parent=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2
Chr5S   Augustus    exon    139408387   139408763   .   -   .   ID=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2.exon4;Parent=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2
Chr5S   Augustus    CDS 139408387   139408763   .   -   .   ID=cds.model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2;Parent=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2
Chr5S   Augustus    exon    139407731   139407845   .   -   .   ID=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2.exon5;Parent=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2
Chr5S   Augustus    CDS 139407731   139407845   .   -   .   ID=cds.model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2;Parent=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2
Chr5S   Augustus    exon    139407390   139407656   .   -   .   ID=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2.exon6;Parent=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2
Chr5S   Augustus    CDS 139407390   139407656   .   -   .   ID=cds.model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2;Parent=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2
Chr5S   Augustus    exon    139406822   139407306   .   -   .   ID=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2.exon7;Parent=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2
Chr5S   Augustus    CDS 139406822   139407306   .   -   .   ID=cds.model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2;Parent=model.anno1.jg743953.t2;gene_id=g_174935;transcript_id=anno1.jg743953.t2
ADD REPLY
1
Entering edit mode

I believe that the problem is that everything has the same ID - what a disaster ... alas not atypical of bioinformatics

You would need to write code that parses the GTF and changes the ids. I think the ChatGPT could do it with ease, let me give it a go, I think it is good start and just tinker with it until works if not most of it should be fine, but just run your own ChatGPT and ask it for refinements

Prompt: I have a GTF file where every element has the same ID, I need a Python program that renames the IDS but also keeps the exon parent ids correct

Response:

You can achieve this by using Python and the pandas library to read the GTF file, rename the IDs, and then save the modified GTF file. Here's a script that should do the trick:

import pandas as pd

def read_gtf(file_path):
    gtf_columns = ['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
    gtf_data = pd.read_csv(file_path, sep='\t', comment='#', header=None, names=gtf_columns)
    return gtf_data

def rename_ids(gtf_data):
    # Assuming the ID format is like this: ID=element1;Parent=element1_parent
    for index, row in gtf_data.iterrows():
        old_id = row['attributes'].split(';')[0].split('=')[1]
        new_id = f"element{index+1}"

        # Update the ID
        gtf_data.at[index, 'attributes'] = row['attributes'].replace(old_id, new_id)

        # Update the Parent ID of the exons
        if row['feature'] == 'exon':
            old_parent_id = row['attributes'].split(';')[1].split('=')[1]
            new_parent_id = old_parent_id.replace(old_id, new_id)
            gtf_data.at[index, 'attributes'] = row['attributes'].replace(old_parent_id, new_parent_id)

    return gtf_data

def write_gtf(gtf_data, output_path):
    gtf_data.to_csv(output_path, sep='\t', header=False, index=False)

# Replace these paths with your actual file paths
input_gtf_file = "path/to/your/input.gtf"
output_gtf_file = "path/to/your/output.gtf"

gtf_data = read_gtf(input_gtf_file)
renamed_gtf_data = rename_ids(gtf_data)
write_gtf(renamed_gtf_data, output_gtf_file)

print("Finished renaming the IDs in the GTF file.")

This script assumes that the ID format is like this: ID=element1;Parent=element1_parent. If your GTF file has a different format, you will need to modify the script accordingly. Make sure to replace path/to/your/input.gtf and path/to/your/output.gtf with the actual file paths.

ADD REPLY
0
Entering edit mode

Thanks Istvan Albert for suggesting the script! The idea of using python script (accompanied by chatGPT) seems interesting, I will give a try as well. Also, fyi, I solved this issue by simply formatting the gff3 to remove redundant information. EMV seems to expect only parent_ID information in gff3 file.

ADD REPLY
2
Entering edit mode
12 months ago
Buffo ★ 2.4k

It looks like a mixture of GTF and GFF3 formats. As far as I remember TSEBRA is intended for transcript assembly refinement (stringtie gtf as input?). Your script expects the parent_ID, but it gets a gene_id instead. Your annotation is redundant/wrong format, remove the gene_id and transcript_id.

ADD COMMENT
1
Entering edit mode

this is a good point actually that I missed

the file is not in GTF but GFF format already, it is not even a mixture

GTF format attributes are key "value"; GFF formats are key=value;

ADD REPLY
0
Entering edit mode

Buffo Thanks for the suggestions! It worked! Indeed it was due to redundant, and I solved it by removing gene_id and transcript_id.

As far as I remember TSEBRA is intended for transcript assembly refinement (stringtie gtf as input?). -- TSEBRA helps to combine the multiple BRAKER outputs

ADD REPLY
1
Entering edit mode
12 months ago
Juke34 8.5k

It can also be fixed using AGAT:

agat_convert_sp_gxf2gxf.pl --gff tsebra.gff -o tsebra_fix.gff

ADD COMMENT

Login before adding your answer.

Traffic: 2065 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6