chimeraTE GTF format error
0
0
Entering edit mode
19 days ago

I am trying to run chimeraTE mode 1 using the T2T reference genome and its corresponding GTF annotations file, but I always find the same error:

ERROR: Bad GTF format: GTF does not contain coordinates of genes! The 3rd column must contain "gene" Exiting.... 

The truth is that I have tried to keep only features recognised as "gene" and assign this sequence type to all the features but the error continues appearing.

I would appreciate your feedback so much.

chimeraTE • 7.0k views
ADD COMMENT
0
Entering edit mode

It always helps to add (at least) a few lines of the files you are working with, that way we can better spot potential problems.

ADD REPLY
0
Entering edit mode

Sorry I was having problems to upload the code lines. In this case, I have tried to run the software with this GTF: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf.gz. I would like to send you the head but it is very difficult to understand in this box format.

I literally ran the chimeraTE mode 1 using this GTF as --gene argument. However, the error I found is the following:

Checking gene and TE annotations
GTF GENE
GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf contains:
ERROR: Bad GTF format
GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf does not contain coordinates of genes! The 3rd column must contain "gene" Exiting...
ADD REPLY
0
Entering edit mode

Third column does contain gene in this file:

#gtf-version 2.2
#!genome-build T2T-CHM13v2.0
#!genome-build-accession NCBI_Assembly:GCF_009914755.1
#!annotation-date 08/01/2025
#!annotation-source NCBI RefSeq GCF_009914755.1-RS_2025_08
NC_060925.1     BestRefSeq      **gene**    7506    138480  .       -       .       gene_id "LOC127239154"; transcript_id ""; db_xref "GeneID:127239154"; description "uncharacterized LOC127239154"; gbkey "Gene"; gene "LOC127239154"; gene_biot
ype "lncRNA"; partial "true"; 
ADD REPLY
0
Entering edit mode

Yes, and that is why I do not understand the real failure

ADD REPLY
0
Entering edit mode

Perhaps the program is only looking to get lines with gene attribute. You could give the following a try, which only selects lines that have gene in column 3 of GTF.

zcat GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf.gz | awk '$0 ~ /^#/ || $3 == "gene"' > genes_with_header.gtf
ADD REPLY
0
Entering edit mode

I have just tried it but it returns the same error again

ADD REPLY
0
Entering edit mode

perhaps long shot but your GTF file is tab-delineated, right? (you also did not open it in any windows/dos related software, for editing for instance?)

ADD REPLY
0
Entering edit mode

I checked it and it is tab-lineated. I have also removed all the rows that did not contain 9 columns. However, the error persists.

ADD REPLY
0
Entering edit mode

ok, that's one thing already

what do you mean with "all the rows that did not contain 9 columns" ?, they all should have 9 rows ?

Can you also post the exact command line you are trying to run?

ADD REPLY
0
Entering edit mode

GTF correct format should contain rows with 9 columns (start, end, strand, attributes, etc). The code I am trying to run is the following, where "te" argument is the previosly mentioned GTF that is causing the error:

python3 chimTE_mode1.py 
  --genome      Genome in fasta
  --input       Paired-end files and their respective group/replicate
  --project     Directory name with output data
  --te          GTF file containing TE information
  --gene        GTF file containing gene information
  --strand      Define the strandness direction of the RNA-seq. Two options: "rf-stranded" OR "fwd-stranded"
ADD REPLY
0
Entering edit mode

[I'm picking in on this level as otherwise we'll be running out of space soon ;-) ]

OK, If you can, do add the exact file(names) you are using in the cmdline, be as exactly as possible as if you would type it in in your terminal.

Other idea: can you run the 'default' dataset that comes with the tool. To test that it works

Also, but more difficult: try removing all entries that do not have CDS features assigned (eg. the first one is a lncRNA and thus has no CDS lines, perhaps tool is stumbling over that kind of 'genes' ...)

What you can also consider is running your GTF file through a tool such as AGAT, to double check (and perhaps correct) the structure of the GTF file

ADD REPLY
0
Entering edit mode

I have just tried to run the code using the gtf included in the example dataset and I also run the code again with my GTF file. In this case, I used the last version of the software, and the error I found was the following:

print(str(f"ERROR: Bad GTF format\n{args.gene.name} does not contain coordinates of {feat}s! The 3rd column must contain \"{feat}\"\tExiting..."))
                                        ^^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'name'
ADD REPLY
0
Entering edit mode

did the analysis work with the included GTF file?

Perhaps it needs a 'name' tag in the latest column?

Did you run it through AGAT? what was the result of that?

ADD REPLY
0
Entering edit mode

The analysis did not work with the included GTF file. I also ran AGAT but I did not get conclusions as the output only showed me the number of RNAs per gene, and the maximum and the minimum length of RNAs.

ADD REPLY
0
Entering edit mode

hmm, if it also doesn't work with the included GTF I would contact the developers of the tool and ask them for input.

For AGAT: it has many different sub-commands so make sure you run the correct one ...

ADD REPLY

Login before adding your answer.

Traffic: 2627 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6