Convert .Gff3 File To 12-Column .Bed File
5
2
Entering edit mode
8.2 years ago
LRStar ▴ 190

Hello:

I would like to convert .gff3 file to 12-column .bed file, as in this link under "BED Format" (http://genome.ucsc.edu/FAQ/FAQformat.html#format1).

I have thusfar used Galaxy from Penn State, but it outputs a 6-column .bed file.

Any advice is greatly appreciated! Thank you...

methylation next-gen bioinformatics • 17k views
0
Entering edit mode

Hello @t_pod,

Did you got any solution to your problem?

We are also using gff3ToGenePred for converting gff3 to genepred file but unfortunately getting output for only 2 chromosomes instead f 15.

Any help is appreciated.

7
Entering edit mode
2.4 years ago

The UCSC Genome Browser hosts conversion utilities that you can run from your command line to accomplish the gff3 to BED12 conversion. Note utilities are OS specific and need to be given permission to execute with "chmod +x utilityName".

Here's an example of how I did a conversion using the following steps:

Disclaimer that I work for the UCSC Genome Browser. :)

1
Entering edit mode

This is the answer that saved the day for me. Thank you!

0
Entering edit mode

I spent days trying to figure out how to do this, then I followed your advice and it took me 5 minutes! I can't thank you enough!

0
Entering edit mode

Thank you so much!! Lifesaver

2
Entering edit mode
8.2 years ago

You might take a look at the BEDOPS gff2bed conversion script, to see if it gets you closer. (It doesn't make BED12, but is there enough information in a GFF3 file to get there? You'd need to add color metadata on your own, for instance.)

In any case, this script tries not to throw anything away, except for headers that other BEDOPS tools do not process. If you want to preserve them as BED elements, add the --keep-header option.

You can also apply cut or awk to the output of gff2bed in order to filter, rearrange or add non-GFF/color columns, if needed.

0
Entering edit mode

Thanks for the comment, Alex. I did try to do that, but am still uncertain for how to create certain columns of the .bed file. I think I can create Columns 1 2, 3, 5, 6, 7, 8: Bed(Col1) = Gff3 (Col1). Bed(Col2) = Gff3(Col4). Bed(Col3) = Gff3(Col5). Bed(Col5) = Gff3(Col6). Bed(Col6) = Gff3(Col7). Bed(Col7) = Bed(Col2) [at least in online examples, it seemed repeated]. Bed(Col8) = Bed(Col3) [at least in online examples, it seemed repeated].

0
Entering edit mode

But then, I am left with Columns 4, 9, 10, 11, 12. For Col 4, I don't know how to get the "name" from the Gff3 file. For Col 9, can I just put them all as "255,0,0", else where can I get that from the Gff3 file. For Col 10, I am not sure what the "BlockCount" is. At first, I thought there could only be 0 or 1 exons on the line, and so this column could either be "0" or "1". But that would not make sense for Col 11 and Col 12, because these require comma-separated lists for each element in the blockCount (as if Col 10 has the potential to be >1).

0
Entering edit mode

I particularly do not see examples of these last columns anywhere online. It seems most .bed files are <12 columns, but I am using a software called methylKit, which requires all 12 columns.

0
Entering edit mode

Just in case, here is a head of my .gff3 file:

 PdomScaf0001 maker gene 15 1963 . - . Name=PdomGene00025;ID=1;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274 PdomScaf0001 maker mRNA 15 1963 . - . Name=PdomMRNA00025.1;Parent=1;ID=2;_QI=216%7C0%7C0.2%7C0.6%7C0.5%7C0.6%7C5%7C0%7C98;_eAED=0.43;_AED=0.43;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274-mRNA-1 PdomScaf0001 maker exon 15 100 -0.094 - . Parent=2;ID=3 PdomScaf0001 maker CDS 15 100 . - 2 Parent=2;ID=4 PdomScaf0001 maker exon 223 300 21.8 - . Parent=2;ID=5 PdomScaf0001 maker CDS 223 300 . - 2 Parent=2;ID=6 PdomScaf0001 maker exon 717 765 22.4 - . Parent=2;ID=7 

0
Entering edit mode

And a head of my .bed file, created using gff2bed (same as you suggested):

 PdomScaf0001 14 100 3 -0.094 - maker exon . Parent=2;ID=3 PdomScaf0001 14 100 4 . - maker CDS 2 Parent=2;ID=4 PdomScaf0001 14 1963 1 . - maker gene . Name=PdomGene00025;ID=1;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274 PdomScaf0001 14 1963 2 . - maker mRNA . Name=PdomMRNA00025.1;Parent=1;ID=2;_QI=216%7C0%7C0.2%7C0.6%7C0.5%7C0.6%7C5%7C0%7C98;_eAED=0.43;_AED=0.43;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274-mRNA-1 PdomScaf0001 222 300 5 21.8 - maker exon . Parent=2;ID=5 PdomScaf0001 222 300 6 . - maker CDS 2 Parent=2;ID=6 PdomScaf0001 716 765 7 22.4 - maker exon . Parent=2;ID=7 PdomScaf0001 716 765 8 . - maker CDS 0 Parent=2;ID=8 PdomScaf0001 906 947 9 4.85 - maker exon . Parent=2;ID=9 PdomScaf0001 906 947 10 . - maker CDS 2 Parent=2;ID=10 

0
Entering edit mode

As a side note, only other thing I noticed is that the score column (Column 6) of my .gff3 file does not always seem to be a number between 0 and 1000, which is the consensus of what I see online. In fact, 1844/183,748 lines of the .gff3 file have negative values for the score column. I downloaded it from a Genome Browser at my school. Not sure if this might be a problem?

2
Entering edit mode
8.2 years ago

Hello,

There are no tools directly on the public Galaxy site to transform a GFF3 dataset into a BED12 dataset. However, the Tool Shed has a repository called 'fml_gff3togtf' that includes a tool for this purpose, for use in a local install. The description is a bit bothersome in that it includes a slightly incorrect datatype description, so be sure to test out the results. (the word "wiggle" has no place in this statement: "gff3_to_bed_converter.py: This tool converts gene transcript annotation from GFF3 format to UCSC wiggle 12 column BED format.") http://getgalaxy.org http://usegalaxy.org/toolshed

It might be helpful to let you know what a BED12 file represents:

A BED12 file describes the complete, often spliced, alignment of a sequence to a reference genome. This does not include minor base variation, it is a macro alignment. You can think of each of the blocks as being "exons", although there is no magic here - if the sequence or genome had quality problems, or significant variation (large insertion or deletion), that could cause the alignment to fragment as well. Here is the data description: http://wiki.galaxyproject.org/Learn/Datatypes#Bed

To see examples, at UCSC genome.ucsc.edu), EST or mRNA track will have this as the primary table format. All gene track can also be in BED12 format, or in a related one, genePred: http://genome.ucsc.edu/FAQ/FAQformat.html#format9

Either way, you can convert the data, then load up into the public Galaxy usegalaxy.org) and proceed with your analysis. BEDTools works well with BED12 files. There is definitely information loss attempting to transform BED6 -> BED12, as the global alignment is lost. And adjusting attributes such as score or name are often a preference, so you can alter these however you want, as long as the attribute formatting rules for the columns are followed.

Hopefully this helps,

Jen, Galaxy team

2
Entering edit mode
20 months ago
Juke34 ★ 6.5k

Here you can find a list of tools for this conversion (AGAT, BEDOPS, PASA, Kent utils) and example of results they provide.
I would for sure recommend the AGAT's script agat_convert_sp_gff2bed.pl ^^

0
Entering edit mode

Does AGAT script convert gff3 into 12 column bed? I want conversion in a way to get 12 column bed which can be used as an input for methykit.

0
Entering edit mode

Yes, click the first link I provided, you will see how it looks. I cannot promise it will be exactly how you would like to be (e.g the RGB value in column 9).

0
Entering edit mode

Thank you Juke-34, PASA worked in our case.

0
Entering edit mode

Thanks a lot!

1
Entering edit mode
4.5 years ago
t_pod ▴ 30

Hi, did you find a solution on that topic?

I am currently using the methylKit also and I 've got a similar issue than you. As the annotation file was not available on UCSC, I've downloaded the gff3 file from NCBI and converted it to BED12 file via Galaxy.

However, I am finding many discrepancies in my converted BED12 file when I compare it to as "correct" BED12 file from UCSC:

5th and 9th columns are always "0" ( that might not be an issue),

in total 50% of lines have only the first 6 columns filled and they are just named "CDS" in the 4th column without further specification (should I discard all of them?),

and sometimes the column 11 displays n+1 items, where the additional item is "000" and n= blocks count (=column 10). I am expecting column 11 and 12 to be equal.

Any hints?