gtf2bed not converting to bed
2
0
Entering edit mode
13 months ago
Agamemnon ▴ 80

Hello,

I am trying to convert a gtf file to bed file.

gtf2bed < Homo_sapiens.GRCh38.109.gene.gtf >  Homo_sapiens.GRCh38.109.gene.bed

I get the following error:

Warning: If your Wiggle data is a significant portion of available system memory, use the --max-mem and --sort-tmpdir options, or use --do-not-sort to disable post-conversion sorting. See --help for more information.
Warning: Potentially missing gene or transcript ID from GTF attributes (malformed GTF at line [1]?)

My original gtf file looks like this: enter image description here

However and by bed file looks like this: enter image description here

I was expecting something along these lines -

7   127588344   127588498   ENST00000000233 0   +
11  64305577    64305736    ENST00000000442 0   +
11  64307167    64307179    ENST00000000442 0   +
12  2794952 2795139 ENST00000001008 0   +
2   37231665    37231705    ENST00000002125 0   +

enter image description here

I am having the same issue with the original ensemble 109 gtf file. Same errors and no clean bed file.

bed gtf • 2.1k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
0
Entering edit mode

The examples from the tool you use are correct, it is BED12 format. Just use cut to limit the output to the first 6 columns if you don't need the rest.

ADD REPLY
0
Entering edit mode

hmm be careful, actually no it is not a BED12 format! The format created by gff2bed from bedops is a particular fanciful format. It does not follow the BED specifications. Until the 6th column it could be considered as a BED, but over that column it becomes something orginal.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Although I gave the example of a gene gtf, what I actually want is the transcript ID.

ADD REPLY
0
Entering edit mode

Even if no official specifications exists for bed, my view was to use the definition by who made the format. And from what I know it was UCSC in early 2000 (1998?). So for me what bedops is doing is misleading. At least they have a nice documentation about how they do their own "bed".
As they in the bedops publication BEDOPS supports a relaxed variation of the BED specification. They could have created their own format, perhaps it would have been less misleading for the users.

ADD REPLY
1
Entering edit mode

Um, there is a specification for BED: https://github.com/samtools/hts-specs/blob/master/BEDv1.pdf

And the BEDOPS documentation does refer to UCSC columns. I know this not only because it is directly from the link I provided above, but because I was the person who wrote said documentation.

ADD REPLY
0
Entering edit mode

Thank you for the link this is really nice. I don't know if there is a specific way o make a specification official (written in a publication, a big group advertising it like for GTF2.2, a consortium managing it like for GFF3 or held under the umbrella of a tool that stands out in the field as you show it for bed), but as long as it is well described, findable, etc, (FAIR) it is good (for me). Thank you for this work. I like the way flexibility is given is the format via BEDn+m!

To come back to Bedops, and following your recent specifications (from what I see it is from ~2021), gtf2bed gives a BED6+5 format as output.

P.S:
We have to pay attention that a BED12 would is different of a BED6+6 :)

ADD REPLY
0
Entering edit mode

I wrote this documentation a long time ago. Whatever, you're wrong and being unpleasant about it. Have a nice day.

ADD REPLY
0
Entering edit mode

Sorry, I did not meant to be unpleasant, really. I will read more carefully your spec.

ADD REPLY
2
Entering edit mode
13 months ago

Seems to me that the OPs main issue is that they are getting the gene id, rather than the transcript id as the name field.

You can tell bedops to use the transcript_id instead of the gene_id by using --attribute-key=transcript_id. Then follow @AlexReynolds answer to cut the first 6 columns.

ADD COMMENT
2
Entering edit mode
13 months ago

Just pipe to cut if you want six columns, e.g.:

$ gtf2bed < foo.gtf | cut -f1-6 > foo.c1t6.bed

You don't need an alternative toolkit like AGAT just to get six columns of output. That's silly. Come on, now.

ADD COMMENT

Login before adding your answer.

Traffic: 2620 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6