Question: BEDOPS gtf2bed conversion error with Ensembl GTF
2
gravatar for bioinformatics.cancer
2.1 years ago by
United States
bioinformatics.cancer180 wrote:

Hi, I am trying to convert a Canine gene annotation (GTF) file downloaded from Ensembl to BED file using the gtf2bed tool within the BEDOPS application. Using this command gives an error:

$ gtf2bed < Canis_familiaris.CanFam3.1.85_noheader.gtf > Canis_familiaris.CanFam3.1.85_noheader.bed

Error: Potentially missing gene or transcript ID from GTF attributes (malformed GTF at line [1]?)

I checked the first few lines of the GTF file and it seems to match up with the required format:

$ head Canis_familiaris.CanFam3.1.85_noheader.gtf

X ensembl gene 1575 5716 . + . gene_id "ENSCAFG00000010935"; gene_version "3"; gene_source "ensembl"; gene_biotype "protein_coding";

X ensembl transcript 1575 5716 . + . gene_id "ENSCAFG00000010935"; gene_version "3"; transcript_id "ENSCAFT00000017396"; transcript_version "3"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding";

...

I looked at the source code on github for this tool and can see that is check for gene or transcript id and if not present gives this error. But the gene_id is present here in the first line, so not sure how it is reaching the error condition.

I would appreciate any help with troubleshooting this error.

Thank you, - Pankaj

gtf2bed bedops ensembl gtf • 3.5k views
ADD COMMENTlink modified 2.1 years ago by lairdm20 • written 2.1 years ago by bioinformatics.cancer180
9
gravatar for Alex Reynolds
2.1 years ago by
Alex Reynolds25k
Seattle, WA USA
Alex Reynolds25k wrote:

I added more stringent GTF format validation to BEDOPS v2.4.20.

The error suggests that the first line is missing the transcript_id field. It has a gene_id field, as you note, but no transcript_id field. The GTF 2.2 specification indicates that this field is mandatory, though its value can be an empty string.

There are a couple solutions:

  1. Use an older version of gtf2bed that doesn't apply this validation check (e.g., 2.4.19 or earlier)
  2. Or, modify the GTF and add a placeholder field where none exists

I suggest the second solution. You could do the following:

$ awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' input.gtf | gtf2bed - > output.bed

This adds transcript_id ""; to lines in the GTF that do not contain that field, and leaves other lines unchanged.

The GTF that comes out of this awk statement is more valid, enough to get through the conversion step, and so it can be piped to gtf2bed to get BED as output.

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Alex Reynolds25k
2

This method did not work for me. gtf2bed only outputs the gtf untouched .

ADD REPLYlink written 23 months ago by tiago2112871.0k

I don't have enough information to debug, but one suggestion is that you put a tee statement in between awk and gtf2bed so that you can examine what comes out of awk: https://en.wikipedia.org/wiki/Tee_(command)

ADD REPLYlink written 23 months ago by Alex Reynolds25k
1

The first solution (awk) worked. Thanks!

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by bioinformatics.cancer180
2
gravatar for lairdm
2.1 years ago by
lairdm20
EBI, Cambridge, UK
lairdm20 wrote:

Good morning,

We've actually developed a tool at Ensembl to address issues like this quirk of GTF, File Chameleon [1], new in Ensembl 85.

Unfortunately a transcript_id is required even in gene records, as Alex says you can leave this blank, or the other solution I've commonly seen is to simply pad the record duplicating the gene_id in to the transcript_id.

File Chameleon will take any of the Ensembl flat files on our FTP site (GTF, GFF3, Fasta only so far) and transcribe it to correct these quirks, or other adjustments such as remapping to UCSC style chromosome names. If there's additional tweaking of our files that would be useful as part of retrieving them, we'd also love to hear suggestions. The tool is also available for offline use as well [2].

[1] http://www.ensembl.org/Homo_sapiens/Tools/FileChameleon?db=core

[2] https://github.com/FAANG/faang-format-transcriber

ADD COMMENTlink written 2.1 years ago by lairdm20

I've gotten a few reports about the GTF formatting error. Do you think Ensembl will adjust what it releases to meet spec, or was this done for some other reason?

ADD REPLYlink written 10 months ago by Alex Reynolds25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1029 users visited in the last hour