gtf2bed not converting to bed
2
0
Entering edit mode
12 days ago
Agamemnon ▴ 60

Hello,

I am trying to convert a gtf file to bed file.

gtf2bed < Homo_sapiens.GRCh38.109.gene.gtf >  Homo_sapiens.GRCh38.109.gene.bed


I get the following error:

Warning: If your Wiggle data is a significant portion of available system memory, use the --max-mem and --sort-tmpdir options, or use --do-not-sort to disable post-conversion sorting. See --help for more information.
Warning: Potentially missing gene or transcript ID from GTF attributes (malformed GTF at line [1]?)


My original gtf file looks like this:

However and by bed file looks like this:

I was expecting something along these lines -

7   127588344   127588498   ENST00000000233 0   +
11  64305577    64305736    ENST00000000442 0   +
11  64307167    64307179    ENST00000000442 0   +
12  2794952 2795139 ENST00000001008 0   +
2   37231665    37231705    ENST00000002125 0   +


I am having the same issue with the original ensemble 109 gtf file. Same errors and no clean bed file.

bed gtf • 722 views
1
Entering edit mode

Can you give AGAT a try: https://agat.readthedocs.io/en/latest/gff_to_bed.html

0
Entering edit mode

The examples from the tool you use are correct, it is BED12 format. Just use cut to limit the output to the first 6 columns if you don't need the rest.

0
Entering edit mode

hmm be careful, actually no it is not a BED12 format! The format created by gff2bed from bedops is a particular fanciful format. It does not follow the BED specifications. Until the 6th column it could be considered as a BED, but over that column it becomes something orginal.

0
Entering edit mode
0
Entering edit mode

Although I gave the example of a gene gtf, what I actually want is the transcript ID.

0
Entering edit mode

Even if no official specifications exists for bed, my view was to use the definition by who made the format. And from what I know it was UCSC in early 2000 (1998?). So for me what bedops is doing is misleading. At least they have a nice documentation about how they do their own "bed".
As they in the bedops publication BEDOPS supports a relaxed variation of the BED specification. They could have created their own format, perhaps it would have been less misleading for the users.

1
Entering edit mode

Um, there is a specification for BED: https://github.com/samtools/hts-specs/blob/master/BEDv1.pdf

And the BEDOPS documentation does refer to UCSC columns. I know this not only because it is directly from the link I provided above, but because I was the person who wrote said documentation.

0
Entering edit mode

Thank you for the link this is really nice. I don't know if there is a specific way o make a specification official (written in a publication, a big group advertising it like for GTF2.2, a consortium managing it like for GFF3 or held under the umbrella of a tool that stands out in the field as you show it for bed), but as long as it is well described, findable, etc, (FAIR) it is good (for me). Thank you for this work. I like the way flexibility is given is the format via BEDn+m!

To come back to Bedops, and following your recent specifications (from what I see it is from ~2021), gtf2bed gives a BED6+5 format as output.

P.S:
We have to pay attention that a BED12 would is different of a BED6+6 :)

0
Entering edit mode

I wrote this documentation a long time ago. Whatever, you're wrong and being unpleasant about it. Have a nice day.

0
Entering edit mode

Sorry, I did not meant to be unpleasant, really. I will read more carefully your spec.

2
Entering edit mode
11 days ago

Seems to me that the OPs main issue is that they are getting the gene id, rather than the transcript id as the name field.

You can tell bedops to use the transcript_id instead of the gene_id by using --attribute-key=transcript_id. Then follow @AlexReynolds answer to cut the first 6 columns.

2
Entering edit mode
11 days ago

Just pipe to cut if you want six columns, e.g.:

\$ gtf2bed < foo.gtf | cut -f1-6 > foo.c1t6.bed


You don't need an alternative toolkit like AGAT just to get six columns of output. That's silly. Come on, now.