Question: BEDOPS gtf2bed conversion error with Ensembl GTF
2
gravatar for bioinformatics.cancer
22 months ago by
United States
bioinformatics.cancer170 wrote:

Hi, I am trying to convert a Canine gene annotation (GTF) file downloaded from Ensembl to BED file using the gtf2bed tool within the BEDOPS application. Using this command gives an error:

$ gtf2bed < Canis_familiaris.CanFam3.1.85_noheader.gtf > Canis_familiaris.CanFam3.1.85_noheader.bed

Error: Potentially missing gene or transcript ID from GTF attributes (malformed GTF at line [1]?)

I checked the first few lines of the GTF file and it seems to match up with the required format:

$ head Canis_familiaris.CanFam3.1.85_noheader.gtf

X ensembl gene 1575 5716 . + . gene_id "ENSCAFG00000010935"; gene_version "3"; gene_source "ensembl"; gene_biotype "protein_coding";

X ensembl transcript 1575 5716 . + . gene_id "ENSCAFG00000010935"; gene_version "3"; transcript_id "ENSCAFT00000017396"; transcript_version "3"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding";

...

I looked at the source code on github for this tool and can see that is check for gene or transcript id and if not present gives this error. But the gene_id is present here in the first line, so not sure how it is reaching the error condition.

I would appreciate any help with troubleshooting this error.

Thank you, - Pankaj

gtf2bed bedops ensembl gtf • 2.9k views
ADD COMMENTlink modified 22 months ago by lairdm20 • written 22 months ago by bioinformatics.cancer170
9
gravatar for Alex Reynolds
22 months ago by
Alex Reynolds24k
Seattle, WA USA
Alex Reynolds24k wrote:

I added more stringent GTF format validation to BEDOPS v2.4.20.

The error suggests that the first line is missing the transcript_id field. It has a gene_id field, as you note, but no transcript_id field. The GTF 2.2 specification indicates that this field is mandatory, though its value can be an empty string.

There are a couple solutions:

  1. Use an older version of gtf2bed that doesn't apply this validation check (e.g., 2.4.19 or earlier)
  2. Or, modify the GTF and add a placeholder field where none exists

I suggest the second solution. You could do the following:

$ awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' input.gtf | gtf2bed - > output.bed

This adds transcript_id ""; to lines in the GTF that do not contain that field, and leaves other lines unchanged.

The GTF that comes out of this awk statement is more valid, enough to get through the conversion step, and so it can be piped to gtf2bed to get BED as output.

ADD COMMENTlink modified 22 months ago • written 22 months ago by Alex Reynolds24k
2

This method did not work for me. gtf2bed only outputs the gtf untouched .

ADD REPLYlink written 20 months ago by tiago211287990

I don't have enough information to debug, but one suggestion is that you put a tee statement in between awk and gtf2bed so that you can examine what comes out of awk: https://en.wikipedia.org/wiki/Tee_(command)

ADD REPLYlink written 20 months ago by Alex Reynolds24k
1

The first solution (awk) worked. Thanks!

ADD REPLYlink modified 22 months ago • written 22 months ago by bioinformatics.cancer170
2
gravatar for lairdm
22 months ago by
lairdm20
EBI, Cambridge, UK
lairdm20 wrote:

Good morning,

We've actually developed a tool at Ensembl to address issues like this quirk of GTF, File Chameleon [1], new in Ensembl 85.

Unfortunately a transcript_id is required even in gene records, as Alex says you can leave this blank, or the other solution I've commonly seen is to simply pad the record duplicating the gene_id in to the transcript_id.

File Chameleon will take any of the Ensembl flat files on our FTP site (GTF, GFF3, Fasta only so far) and transcribe it to correct these quirks, or other adjustments such as remapping to UCSC style chromosome names. If there's additional tweaking of our files that would be useful as part of retrieving them, we'd also love to hear suggestions. The tool is also available for offline use as well [2].

[1] http://www.ensembl.org/Homo_sapiens/Tools/FileChameleon?db=core

[2] https://github.com/FAANG/faang-format-transcriber

ADD COMMENTlink written 22 months ago by lairdm20

I've gotten a few reports about the GTF formatting error. Do you think Ensembl will adjust what it releases to meet spec, or was this done for some other reason?

ADD REPLYlink written 7 months ago by Alex Reynolds24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1060 users visited in the last hour