Question: GTF2BED does not provide meaningful output?
gravatar for chahat_u
22 months ago by
United States
chahat_u120 wrote:

Hi guys,

I have been trying to use gtf2bed to convert a gtf file to bed format, but to no avail. On running the following command - gtf2bed < GRCh38p5_copy.gtf > foo1.bed

it gives the error -

Error: Potentially missing gene or transcript ID from GTF attributes (malformed GTF at line [1]?)

I checked the first few lines of the gtf (removed the commented lines too). They are-

chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2"; chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

Also, I tried to run the tool on the example gtf file mentioned on their own website but the output bed file it gives me is empty.

Any ideas what could be going wrong?

rna-seq bed gtf • 1.0k views
ADD COMMENTlink modified 22 months ago by Alex Reynolds29k • written 22 months ago by chahat_u120
gravatar for Alex Reynolds
22 months ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

There is a bug with Gencode and Ensembl GTF output where they lack the obligatory transcript_id attribute. One solution is to add a dummy attribute, e.g.:

$ wget -qO- \
    | gunzip -c - \
    | awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' - \
    | convert2bed --input=gtf - \
    > output.bed

Another option that doesn't muck with the data is to grab the GFF3, where you can, e.g.:

$ wget -qO- \
    | gunzip -c - \
    | convert2bed --input=gff - \
    > output.bed

Some groups do their own thing — there are similar parsing problems caused with deviations from the spec to the annotations published by the Arabidopsis consortium. Oh well! I seem to have more luck with getting GFF3 that follows spec, so I'd look in that direction, maybe.

ADD COMMENTlink written 22 months ago by Alex Reynolds29k

Hi Alex,

But as I mentioned in my question (the 2 lines I copied from the gtf), the transcript_id is present in the gtf.

Also, what could be the reason that even the example gtf file provided in the website (foo.gtf) also generates an empty bed file?

ADD REPLYlink written 22 months ago by chahat_u120

Take another look at your sample file. Not sure what's up with the demo file (I'll look into it) but your sample input does not meet spec.

ADD REPLYlink written 22 months ago by Alex Reynolds29k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1084 users visited in the last hour