Question: Convert .Gff3 File To 12-Column .Bed File
0
gravatar for LRStar
5.5 years ago by
LRStar10
United States
LRStar10 wrote:

Hello:

I would like to convert .gff3 file to 12-column .bed file, as in this link under "BED Format" (http://genome.ucsc.edu/FAQ/FAQformat.html#format1).

I have thusfar used Galaxy from Penn State, but it outputs a 6-column .bed file.

Any advice is greatly appreciated! Thank you...

ADD COMMENTlink modified 22 months ago by t_pod10 • written 5.5 years ago by LRStar10
2
gravatar for Alex Reynolds
5.5 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

You might take a look at the BEDOPS gff2bed conversion script, to see if it gets you closer. (It doesn't make BED12, but is there enough information in a GFF3 file to get there? You'd need to add color metadata on your own, for instance.)

In any case, this script tries not to throw anything away, except for headers that other BEDOPS tools do not process. If you want to preserve them as BED elements, add the --keep-header option.

You can also apply cut or awk to the output of gff2bed in order to filter, rearrange or add non-GFF/color columns, if needed.

ADD COMMENTlink modified 5.3 years ago • written 5.5 years ago by Alex Reynolds28k

Thanks for the comment, Alex. I did try to do that, but am still uncertain for how to create certain columns of the .bed file. I think I can create Columns 1 2, 3, 5, 6, 7, 8: Bed(Col1) = Gff3 (Col1). Bed(Col2) = Gff3(Col4). Bed(Col3) = Gff3(Col5). Bed(Col5) = Gff3(Col6). Bed(Col6) = Gff3(Col7). Bed(Col7) = Bed(Col2) [at least in online examples, it seemed repeated]. Bed(Col8) = Bed(Col3) [at least in online examples, it seemed repeated].

ADD REPLYlink written 5.5 years ago by LRStar10

But then, I am left with Columns 4, 9, 10, 11, 12. For Col 4, I don't know how to get the "name" from the Gff3 file. For Col 9, can I just put them all as "255,0,0", else where can I get that from the Gff3 file. For Col 10, I am not sure what the "BlockCount" is. At first, I thought there could only be 0 or 1 exons on the line, and so this column could either be "0" or "1". But that would not make sense for Col 11 and Col 12, because these require comma-separated lists for each element in the blockCount (as if Col 10 has the potential to be >1).

ADD REPLYlink written 5.5 years ago by LRStar10

I particularly do not see examples of these last columns anywhere online. It seems most .bed files are <12 columns, but I am using a software called methylKit, which requires all 12 columns.

ADD REPLYlink written 5.5 years ago by LRStar10

Just in case, here is a head of my .gff3 file:

PdomScaf0001 maker gene 15 1963 . - . Name=PdomGene00025;ID=1;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274

PdomScaf0001 maker mRNA 15 1963 . - . Name=PdomMRNA00025.1;Parent=1;ID=2;_QI=216%7C0%7C0.2%7C0.6%7C0.5%7C0.6%7C5%7C0%7C98;_eAED=0.43;_AED=0.43;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274-mRNA-1

PdomScaf0001 maker exon 15 100 -0.094 - . Parent=2;ID=3

PdomScaf0001 maker CDS 15 100 . - 2 Parent=2;ID=4

PdomScaf0001 maker exon 223 300 21.8 - . Parent=2;ID=5

PdomScaf0001 maker CDS 223 300 . - 2 Parent=2;ID=6

PdomScaf0001 maker exon 717 765 22.4 - . Parent=2;ID=7

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by LRStar10

And a head of my .bed file, created using gff2bed (same as you suggested):

PdomScaf0001 14 100 3 -0.094 - maker exon . Parent=2;ID=3

PdomScaf0001 14 100 4 . - maker CDS 2 Parent=2;ID=4

PdomScaf0001 14 1963 1 . - maker gene . Name=PdomGene00025;ID=1;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274

PdomScaf0001 14 1963 2 . - maker mRNA . Name=PdomMRNA00025.1;Parent=1;ID=2;_QI=216%7C0%7C0.2%7C0.6%7C0.5%7C0.6%7C5%7C0%7C98;_eAED=0.43;_AED=0.43;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274-mRNA-1

PdomScaf0001 222 300 5 21.8 - maker exon . Parent=2;ID=5

PdomScaf0001 222 300 6 . - maker CDS 2 Parent=2;ID=6

PdomScaf0001 716 765 7 22.4 - maker exon . Parent=2;ID=7

PdomScaf0001 716 765 8 . - maker CDS 0 Parent=2;ID=8

PdomScaf0001 906 947 9 4.85 - maker exon . Parent=2;ID=9

PdomScaf0001 906 947 10 . - maker CDS 2 Parent=2;ID=10

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by LRStar10

As a side note, only other thing I noticed is that the score column (Column 6) of my .gff3 file does not always seem to be a number between 0 and 1000, which is the consensus of what I see online. In fact, 1844/183,748 lines of the .gff3 file have negative values for the score column. I downloaded it from a Genome Browser at my school. Not sure if this might be a problem?

ADD REPLYlink written 5.5 years ago by LRStar10
2
gravatar for Jennifer Hillman Jackson
5.5 years ago by
Bay Area, CA
Jennifer Hillman Jackson370 wrote:

Hello,

There are no tools directly on the public Galaxy site to transform a GFF3 dataset into a BED12 dataset. However, the Tool Shed has a repository called 'fml_gff3togtf' that includes a tool for this purpose, for use in a local install. The description is a bit bothersome in that it includes a slightly incorrect datatype description, so be sure to test out the results. (the word "wiggle" has no place in this statement: "gff3_to_bed_converter.py: This tool converts gene transcript annotation from GFF3 format to UCSC wiggle 12 column BED format.") http://getgalaxy.org http://usegalaxy.org/toolshed

It might be helpful to let you know what a BED12 file represents:

A BED12 file describes the complete, often spliced, alignment of a sequence to a reference genome. This does not include minor base variation, it is a macro alignment. You can think of each of the blocks as being "exons", although there is no magic here - if the sequence or genome had quality problems, or significant variation (large insertion or deletion), that could cause the alignment to fragment as well. Here is the data description: http://wiki.galaxyproject.org/Learn/Datatypes#Bed

To see examples, at UCSC genome.ucsc.edu), EST or mRNA track will have this as the primary table format. All gene track can also be in BED12 format, or in a related one, genePred: http://genome.ucsc.edu/FAQ/FAQformat.html#format9

UCSC also has line-command utilities to convert between the formats, pre-compiled versions are here: http://hgdownload.cse.ucsc.edu/downloads.html#source_downloads

Either way, you can convert the data, then load up into the public Galaxy usegalaxy.org) and proceed with your analysis. BEDTools works well with BED12 files. There is definitely information loss attempting to transform BED6 -> BED12, as the global alignment is lost. And adjusting attributes such as score or name are often a preference, so you can alter these however you want, as long as the attribute formatting rules for the columns are followed.

Hopefully this helps,

Jen, Galaxy team

ADD COMMENTlink written 5.5 years ago by Jennifer Hillman Jackson370
0
gravatar for t_pod
22 months ago by
t_pod10
t_pod10 wrote:

Hi, did you find a solution on that topic?

I am currently using the methylKit also and I 've got a similar issue than you. As the annotation file was not available on UCSC, I've downloaded the gff3 file from NCBI and converted it to BED12 file via Galaxy.

However, I am finding many discrepancies in my converted BED12 file when I compare it to as "correct" BED12 file from UCSC:

5th and 9th columns are always "0" ( that might not be an issue),

in total 50% of lines have only the first 6 columns filled and they are just named "CDS" in the 4th column without further specification (should I discard all of them?),

and sometimes the column 11 displays n+1 items, where the additional item is "000" and n= blocks count (=column 10). I am expecting column 11 and 12 to be equal.

Any hints?

ADD COMMENTlink written 22 months ago by t_pod10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 950 users visited in the last hour