Question

Any Good Bed/Wig Parsers In Python ?

11

Entering edit mode

13.1 years ago

Ingineer ▴ 120

I am probably going to be the N-th bioinformatician to write the N+1th BED parser so I was wondering, before I start, what good options exist for the python language ? The idea is not just a script that will convert BED to some other format, but a parser that would transform the interval information into an abstract structure consumable by the program.

In my dreams, I see a library like this:

# This super library should be created
import genefiles
# The file has no .bed extension but the library is
# smart enough to guess and recognize what it is.
mydata = genefiles.read("/home/bob/genes")

# Now we have a variable with all the info, on which we can iterate
mydata.type
>>> 'qualitative'
mydata.chromosomes
>>> 'chr1', 'chr2', 'chr3', ...
len(mydata.chr1)
>>> 5671
mydata.chr1[1]
>>> {'start': 100, 'stop': 250, 'score': 15.6, 'strand': '+', 'name': 'NS12'}
mydata.chr1[2]
>>> {'start': 400, 'stop': 500, 'score': 0.7, 'strand': '-', 'name': 'NS45'}

# Nothing stops us from writing a list comprehension now !
smallgenes = [g['name'] for g in mydata.chr1 if g['end']-g['start'] < 100]
>>> ['NS88', 'NS76', 'NS112']

Same would work for loading WIG/GFF etc. files. What do you guys use ?

python parsing bed wiggle • 15k views

ADD COMMENT • link updated 7.9 years ago by Biostar 20 • written 13.1 years ago by Ingineer ▴ 120

1

Entering edit mode

Reddit cross-post discussion: http://www.reddit.com/r/bioinformatics/comments/gizhg/any_good_bedwig_parsers_in_python/

ADD REPLY • link 13.1 years ago by Brad Chapman 9.7k

Ram · Answer 1 · 2011-04-05

14

Entering edit mode

13.1 years ago

Brad Chapman 9.7k

bx-python handles a number of these formats, including BigWig and BigBed which are nice ways to store your interval and score data compressed on disk but still have random access:

Most libraries will be iterative parsers to help be memory conscious on large files. BigWig and BigBed get the closest to your dream API without requiring the entire file be loaded in memory.

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.1 years ago by Brad Chapman 9.7k

2

Entering edit mode

Awesome, thanks for these. I submitted a patch to bx-python to handle the non-tab cases: https://bitbucket.org/james_taylor/bx-python/issue/27/allow-backup-to-space-separated-files-in There was already a flag to allow for your problematic strand case, so if you do: GenomicIntervalReader(file, fix_strand=True, allow_spaces=True) with this patch all of your files will pass. The 14_fields example still passes (instead of your expected fail), but in my opinion this behavior is more correct since people abuse the format by adding extra columns.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.1 years ago by Brad Chapman 9.7k

1

Entering edit mode

Integrated the tests into the gMiner project and added other tests, if you want to use them further or apply them to some other library you can find the latest version here: https://github.com/bbcf/gMiner/tree/master/Extras/tracks/qual/bed/should_pass

ADD REPLY • link 13.0 years ago by Ingineer ▴ 120

0

Entering edit mode

My ideal BED or WIG parser would of course be iterative, yes.

I looked at the BED parser code you linked to. It seams to be extremely unrobust and incompatible with the UCSC definition of the format. http://genome.ucsc.edu/FAQ/FAQformat.html#format1

I probably won't be taking this one.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.1 years ago by Ingineer ▴ 120

0

Entering edit mode

What I mean by robust is: What if the separator is a space instead of a tab ? What if there is a comment at the end of a line ? What if the BED has only three fields ? Do you crash, raise an error, or proceed none the less ?

ADD REPLY • link 13.1 years ago by Ingineer ▴ 120

0

Entering edit mode

By my reading, the bx-python handles all those cases. It splits on whitespace. It only uses the initial fields so extra comments at the end of a line would be ignored. It handles 3, 4 and 5 field BED. The only addition I could suggest is capturing strand as well. If you find cases that the parser fails on, why not write up test cases and bring them up to the author to help improve the library?

ADD REPLY • link 13.1 years ago by Brad Chapman 9.7k

0

Entering edit mode

Challenge accepted. Try my script here: http://freesv.ch/bed_tester.tar.gz Results are better than I thought. But one must realize that many tests pass just because the information is discarded by the parser.

ADD REPLY • link 13.1 years ago by Ingineer ▴ 120

0

Entering edit mode

Integrated the tests into the gMiner project and added other tests, if you want to use them further or apply them to some other library you can find the latest version here: https://github.com/bbcf/gMiner/tree/master/Extras/tests/bed_files

ADD REPLY • link 13.1 years ago by Ingineer ▴ 120

score 8 · Answer 2 · 2011-04-05

8

Entering edit mode

13.1 years ago

brentp 24k

There is also a Bed parser in BedTools which you can access from the python/cython wrapper bedtools-python

There's another python wrapper pybedtools that's mostly for calling the bedtools executables, but it does have a feature module (that could be stand-alone) for Bed, GFF, GTF, and other feature "parsing".

And, there's a Bed module here that puts the file into a numpy array and allows you to do things like your list comprehension above

   b = Bed('some.bed')
   small_genes = b[b['end'] - b['start'] < 100]

ADD COMMENT • link 13.1 years ago by brentp 24k

0

Entering edit mode

Interesting tools ! The bedtools-python seams to work well but is written in c++ unfortunately, making it hard for me to modify or integrate. The pybedtools module really seams like just a wrapper. Finally, rhe numpy array is more interesting, but is not robust or iterative.

ADD REPLY • link 13.1 years ago by Ingineer ▴ 120

0

Entering edit mode

I'd say that pybedtools is more than "just a wrapper" if you try it out and find it insufficient, you could report to the authors. Likewise, if you report how you find the numpy array versionto be less robust, perhaps it could be improved.

ADD REPLY • link 13.1 years ago by brentp 24k

score 7 · Answer 3 · 2011-08-11

Shameless plug: pybedtools now does almost everything you asked (with the exception of the chromosome attribute access), and it's in Cython for speed.

For example, say we have this BED file:

$ cat a.bed
chr1    1    100    feature1    0    +
chr1    100    200    feature2    0    +
chr1    150    500    feature3    0    -
chr1    900    950    feature4    0    +

And this GFF file:

$ cat d.gff
chr1    fake    gene    50    300    .    +    .    ID=gene1
chr1    fake    mRNA    50    300    .    +    .    ID=mRNA1;Parent=gene1;
chr1    fake    CDS    75    150    .    +    .    ID=CDS1;Parent=mRNA1;
chr1    fake    CDS    200    275    .    +    .    ID=CDS2;Parent=mRNA1;
chr1    fake    rRNA    1200    1275    .    +    .    ID=rRNA1;

Then:

>>> from pybedtools import BedTool

pybedtools has auto-detection of BED/GFF/VCF based on position of string and integer fields:

>>> a = BedTool('a.bed')
>>> d = BedTool('d.gff')

>>> a.file_type
'bed'

>>> d.file_type
'gff'

BedTools are iterable, and you get an Interval for each feature. So you can do list comprehension:

>>> [i.name, i.strand) for i in a if len(i) < 150]
[('feature1', '+'), ('feature2', '+'), ('feature4', '+')]

or indexing:

>>> feature = d[0]

You have GFF attribute access (lazily parsed):

>>> feature['ID']
'gene1'

And arbitrary access of fields:

>>> d[1].fields[1]
'mRNA'

...not to mention access to all of BEDTools:

>>> not_in_d = a.intersect(d, v=True)
>>> print not_in_d
chr1    900    950    feature4    0    +

Ram · Answer 4 · 2011-04-05

5

Entering edit mode

13.1 years ago

Michael Schubert ★ 7.1k

At least a GFF parser is available with the newer versions of BioPython. For code examples see their wiki.

There is a question about converting BED to GFF and some tools for other conversions have also been suggested here by Brad.

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.1 years ago by Michael Schubert ★ 7.1k

1

Entering edit mode

Michael, the GFF parser isn't yet rolled into Biopython; the code you linked to is actually older stuff that hasn't been maintained and we'll hope to replace with the new parser. The GFF code is available from: https://github.com/chapmanb/bcbb/tree/master/gff and works as described in the wiki. Thanks for mentioning it.

ADD REPLY • link 13.1 years ago by Brad Chapman 9.7k

1

Entering edit mode

Just tried the GFF parser, works nicely for me!

ADD REPLY • link 12.6 years ago by Weronika ▴ 300

0

Entering edit mode

Thanks for pointing that out, I haven't worked with it myself so far.

ADD REPLY • link 13.1 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

Probably because Brad put it there in the meantime ;-)

ADD REPLY • link 12.6 years ago by Michael Schubert ★ 7.1k

score 3 · Answer 5 · 2012-02-13

3

Entering edit mode

12.2 years ago

Xapple ▴ 30

I just finished work on a python package that can read most of the genomic file formats such as BED, WIG, GFF, BedGraph in a simple and standard way.

It's free and open-source. It will read many different kinds of formats in the same simple syntax:

import track
with track.load('tracks/rp_genes.bed') as rp:
    data = rp.read('chr3')

You install it by typing:

$ sudo easy_install track

Particular attention was brought to the documentation:

http://xapple.github.com/track/

Please tell me what you think about it !

ADD COMMENT • link 12.2 years ago by Xapple ▴ 30

1

Entering edit mode

Good documentation, and easy to install. Does it support interval operations like the intersection of two BEDs? Don't see that in the documentation

ADD REPLY • link 11.8 years ago by pufferfish ▴ 120

0

Entering edit mode

@pufferfish Yes it can do that: http://xapple.github.com/track/content/manipulate.html#track.manipulate.overlap

ADD REPLY • link 11.5 years ago by xapple ▴ 230

score 2 · Answer 6 · 2011-04-05

2

Entering edit mode

13.1 years ago

Andreas ★ 2.5k

Maybe the following helps

BED Classes provided by Taos in this python-forum thread
BED module discussion on stackoverflow

Never tried them myself though.

Andreas

ADD COMMENT • link 13.1 years ago by Andreas ★ 2.5k

0

Entering edit mode

Yeah, that's one more quick and dirty BED parser that is not very robust.

ADD REPLY • link 13.1 years ago by Ingineer ▴ 120