Convert .Gff3 File To 12-Column .Bed File

Convert .Gff3 File To 12-Column .Bed File

4

Entering edit mode

11.7 years ago

LRStar ▴ 200

Hello:

I would like to convert .gff3 file to 12-column .bed file, as in this link under "BED Format" (http://genome.ucsc.edu/FAQ/FAQformat.html#format1).

I have thus far used Galaxy from Penn State, but it outputs a 6-column .bed file.

Any advice is greatly appreciated!

Thank you

methylation next-gen • 24k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 11.7 years ago by LRStar ▴ 200

0

Entering edit mode

Hello @t_pod,

Did you got any solution to your problem?

We are also using gff3ToGenePred for converting gff3 to genepred file but unfortunately getting output for only 2 chromosomes instead f 15.

Any help is appreciated.

ADD REPLY • link 5.2 years ago by Tm ★ 1.1k

15

Entering edit mode

5.8 years ago

danielschmelter ▴ 200

The UCSC Genome Browser hosts conversion utilities that you can run from your command line to accomplish the gff3 to BED12 conversion. Note utilities are OS specific and need to be given permission to execute with "chmod +x utilityName".

Here's an example of how I did a conversion using the following steps:

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.basic.annotation.gff3.gz
gunzip gencode.v31.basic.annotation.gff3.gz
./gff3ToGenePred gencode.v31.basic.annotation.gff3 gencode.v31.basic.genePred
./genePredToBed gencode.v31.basic.genePred gencode.v31.basic.bed

Disclaimer that I work for the UCSC Genome Browser. :)

ADD COMMENT • link 5.8 years ago by danielschmelter ▴ 200

1

Entering edit mode

This is the answer that saved the day for me. Thank you!

ADD REPLY • link 5.8 years ago by crcarroll ▴ 90

0

Entering edit mode

I spent days trying to figure out how to do this, then I followed your advice and it took me 5 minutes! I can't thank you enough!

ADD REPLY • link 5.1 years ago by Clt48 • 0

0

Entering edit mode

Thank you so much!! Lifesaver

ADD REPLY • link 4.4 years ago by kskuchin • 0

6

Entering edit mode

5.2 years ago

Juke34 9.2k

Here you can find a list of tools for this conversion (AGAT, BEDOPS, PASA, Kent utils) and example of results they provide.
I would for sure recommend the AGAT's script agat_convert_sp_gff2bed.pl ^^

ADD COMMENT • link 5.2 years ago by Juke34 9.2k

0

Entering edit mode

Does AGAT script convert gff3 into 12 column bed? I want conversion in a way to get 12 column bed which can be used as an input for methykit.

ADD REPLY • link 5.2 years ago by Tm ★ 1.1k

0

Entering edit mode

Yes, click the first link I provided, you will see how it looks. I cannot promise it will be exactly how you would like to be (e.g the RGB value in column 9).

ADD REPLY • link 5.2 years ago by Juke34 9.2k

0

Entering edit mode

Thank you Juke-34, PASA worked in our case.

ADD REPLY • link 5.2 years ago by Tm ★ 1.1k

0

Entering edit mode

Thanks a lot!

ADD REPLY • link 4.7 years ago by ohhhhhhhhhhhj • 0

3

Entering edit mode

11.7 years ago

Alex Reynolds 36k

You might take a look at the BEDOPS gff2bed conversion script, to see if it gets you closer. (It doesn't make BED12, but is there enough information in a GFF3 file to get there? You'd need to add color metadata on your own, for instance.)

In any case, this script tries not to throw anything away, except for headers that other BEDOPS tools do not process. If you want to preserve them as BED elements, add the --keep-header option.

You can also apply cut or awk to the output of gff2bed in order to filter, rearrange or add non-GFF/color columns, if needed.

ADD COMMENT • link 11.4 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks for the comment, Alex. I did try to do that, but am still uncertain for how to create certain columns of the .bed file. I think I can create Columns 1 2, 3, 5, 6, 7, 8: Bed(Col1) = Gff3 (Col1). Bed(Col2) = Gff3(Col4). Bed(Col3) = Gff3(Col5). Bed(Col5) = Gff3(Col6). Bed(Col6) = Gff3(Col7). Bed(Col7) = Bed(Col2) [at least in online examples, it seemed repeated]. Bed(Col8) = Bed(Col3) [at least in online examples, it seemed repeated].

ADD REPLY • link 11.7 years ago by LRStar ▴ 200

0

Entering edit mode

But then, I am left with Columns 4, 9, 10, 11, 12. For Col 4, I don't know how to get the "name" from the Gff3 file. For Col 9, can I just put them all as "255,0,0", else where can I get that from the Gff3 file. For Col 10, I am not sure what the "BlockCount" is. At first, I thought there could only be 0 or 1 exons on the line, and so this column could either be "0" or "1". But that would not make sense for Col 11 and Col 12, because these require comma-separated lists for each element in the blockCount (as if Col 10 has the potential to be >1).

ADD REPLY • link 11.7 years ago by LRStar ▴ 200

0

Entering edit mode

I particularly do not see examples of these last columns anywhere online. It seems most .bed files are <12 columns, but I am using a software called methylKit, which requires all 12 columns.

ADD REPLY • link 11.7 years ago by LRStar ▴ 200

0

Entering edit mode

Just in case, here is a head of my .gff3 file:



PdomScaf0001    maker    gene    15    1963    .    -    .    Name=PdomGene00025;ID=1;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274

PdomScaf0001    maker    mRNA    15    1963    .    -    .    Name=PdomMRNA00025.1;Parent=1;ID=2;_QI=216%7C0%7C0.2%7C0.6%7C0.5%7C0.6%7C5%7C0%7C98;_eAED=0.43;_AED=0.43;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274-mRNA-1

PdomScaf0001    maker    exon    15    100    -0.094    -    .    Parent=2;ID=3

PdomScaf0001    maker    CDS    15    100    .    -    2    Parent=2;ID=4

PdomScaf0001    maker    exon    223    300    21.8    -    .    Parent=2;ID=5

PdomScaf0001    maker    CDS    223    300    .    -    2    Parent=2;ID=6

PdomScaf0001    maker    exon    717    765    22.4    -    .    Parent=2;ID=7

ADD REPLY • link 11.7 years ago by LRStar ▴ 200

0

Entering edit mode

And a head of my .bed file, created using gff2bed (same as you suggested):



PdomScaf0001    14    100    3    -0.094    -    maker    exon    .    Parent=2;ID=3

PdomScaf0001    14    100    4    .    -    maker    CDS    2    Parent=2;ID=4

PdomScaf0001    14    1963    1    .    -    maker    gene    .    Name=PdomGene00025;ID=1;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274

PdomScaf0001    14    1963    2    .    -    maker    mRNA    .    Name=PdomMRNA00025.1;Parent=1;ID=2;_QI=216%7C0%7C0.2%7C0.6%7C0.5%7C0.6%7C5%7C0%7C98;_eAED=0.43;_AED=0.43;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274-mRNA-1

PdomScaf0001    222    300    5    21.8    -    maker    exon    .    Parent=2;ID=5

PdomScaf0001    222    300    6    .    -    maker    CDS    2    Parent=2;ID=6

PdomScaf0001    716    765    7    22.4    -    maker    exon    .    Parent=2;ID=7

PdomScaf0001    716    765    8    .    -    maker    CDS    0    Parent=2;ID=8

PdomScaf0001    906    947    9    4.85    -    maker    exon    .    Parent=2;ID=9

PdomScaf0001    906    947    10    .    -    maker    CDS    2    Parent=2;ID=10

ADD REPLY • link 11.7 years ago by LRStar ▴ 200

0

Entering edit mode

As a side note, only other thing I noticed is that the score column (Column 6) of my .gff3 file does not always seem to be a number between 0 and 1000, which is the consensus of what I see online. In fact, 1844/183,748 lines of the .gff3 file have negative values for the score column. I downloaded it from a Genome Browser at my school. Not sure if this might be a problem?

ADD REPLY • link 11.7 years ago by LRStar ▴ 200

2

Entering edit mode

11.7 years ago

Jennifer Hillman Jackson ▴ 400

Hello,

There are no tools directly on the public Galaxy site to transform a GFF3 dataset into a BED12 dataset. However, the Tool Shed has a repository called 'fml_gff3togtf' that includes a tool for this purpose, for use in a local install. The description is a bit bothersome in that it includes a slightly incorrect datatype description, so be sure to test out the results. (the word "wiggle" has no place in this statement: "gff3_to_bed_converter.py: This tool converts gene transcript annotation from GFF3 format to UCSC wiggle 12 column BED format.") http://getgalaxy.org http://usegalaxy.org/toolshed

It might be helpful to let you know what a BED12 file represents:

A BED12 file describes the complete, often spliced, alignment of a sequence to a reference genome. This does not include minor base variation, it is a macro alignment. You can think of each of the blocks as being "exons", although there is no magic here - if the sequence or genome had quality problems, or significant variation (large insertion or deletion), that could cause the alignment to fragment as well. Here is the data description: http://wiki.galaxyproject.org/Learn/Datatypes#Bed

To see examples, at UCSC genome.ucsc.edu), EST or mRNA track will have this as the primary table format. All gene track can also be in BED12 format, or in a related one, genePred: http://genome.ucsc.edu/FAQ/FAQformat.html#format9

UCSC also has line-command utilities to convert between the formats, pre-compiled versions are here: http://hgdownload.cse.ucsc.edu/downloads.html#source_downloads

Either way, you can convert the data, then load up into the public Galaxy usegalaxy.org) and proceed with your analysis. BEDTools works well with BED12 files. There is definitely information loss attempting to transform BED6 -> BED12, as the global alignment is lost. And adjusting attributes such as score or name are often a preference, so you can alter these however you want, as long as the attribute formatting rules for the columns are followed.

Hopefully this helps,

Jen, Galaxy team

ADD COMMENT • link 11.7 years ago by Jennifer Hillman Jackson ▴ 400

1

Entering edit mode

8.0 years ago

t_pod ▴ 30

Hi, did you find a solution on that topic?

I am currently using the methylKit also and I 've got a similar issue than you. As the annotation file was not available on UCSC, I've downloaded the gff3 file from NCBI and converted it to BED12 file via Galaxy.

However, I am finding many discrepancies in my converted BED12 file when I compare it to as "correct" BED12 file from UCSC:

5th and 9th columns are always "0" ( that might not be an issue),

in total 50% of lines have only the first 6 columns filled and they are just named "CDS" in the 4th column without further specification (should I discard all of them?),

and sometimes the column 11 displays n+1 items, where the additional item is "000" and n= blocks count (=column 10). I am expecting column 11 and 12 to be equal.

Any hints?

ADD COMMENT • link 8.0 years ago by t_pod ▴ 30

Login before adding your answer.

Similar Posts

Loading Similar Posts

Traffic: 3631 users visited in the last hour

Content Search
Users
Tags
Badges

Help About
FAQ

Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the

version 2.3.6