convert csv file to bed file
1
0
Entering edit mode
2.8 years ago

I am trying to convert a csv file to bed what is the best command to do this?

bed • 6.8k views
ADD COMMENT
0
Entering edit mode

See the first two sentences of swbarnes2 comment here.

""csv" only means that commas are delimiters. No one will know what columns have what information in your csv."

The second paragraph in Alex Reynolds' answer on that same Biostars post outlines the process:

"The idea is that you convert the CSV file to UCSC BED (probably convert it to a tab-delimited file and use awk to print out specific columns in BED-field order)"

ADD REPLY
1
Entering edit mode
2.8 years ago
sbstevenlee ▴ 480

Below is a Python API solution using the pybed submodule I wrote.

Assume you have a CSV file named example.csv:

$ cat example.csv
chr1,100,200
chr2,400,500
chr3,100,200

Run below in Python after installing the fuc package which contains the pybed submodule:

>>> import pandas as pd
>>> from fuc import pybed
>>> df = pd.read_csv('example.csv', header=None)
>>> df.columns = ['Chromosome', 'Start', 'End']
>>> bf = pybed.BedFrame.from_frame(meta=[], data=df)
>>> bf.to_file('example.bed')

Check the resulting BED:

$ cat example.bed
chr1    100 200
chr2    400 500
chr3    100 200

Of course you could've just replaced , with \t directly on the original CSV file, but using the pybed submodule will robustly check for any potential errors that could arise during file format conversion.

ADD COMMENT
0
Entering edit mode

bed is zero based and end is non-inclusive.

ADD REPLY
0
Entering edit mode

Could you please elaborate how your comment is relevant to my answer?

ADD REPLY
0
Entering edit mode

In general, numbers (coordinates in this case) in generic formats (csv,tsv, txt, .xls(x), tables) are 1 based (unless declared other wise) and bed is zero based. Between example input and output, numbering is same, instead of zero based indexing.

ADD REPLY
0
Entering edit mode

csv isn't really a format at all, it's just a way of delimiting a text file, so I'm not sure how it could be zero or one based?

ADD REPLY
0
Entering edit mode

by that extension, any delimited file is not a format at all. Many people store numbers in csv format, not just text. 1 based numbering is generic, where as 0 based numbering is special in general representation of numbers. Since OP didn't mention the numbering method in CSV, assumption would be 1 based, not zero based. In addition, any example numbering uses 1 based numbering, than 0 based in numbering and that is what I assumed.

ADD REPLY
0
Entering edit mode

Thanks for the explanation. I see what you mean now, which will entirely depend on whether the OP's data is 0-based or 1-based to begin with. But thanks to your comment, he or she will now know the risk -- the pybed submodule won't add or subtract offset. If needed, the OP is recommended to do this before constructing a pybed.BedFrame object. For example, to add 1 to every Start position:

>>> df = pd.read_csv('example.csv', header=None)
>>> df.columns = ['Chromosome', 'Start', 'End']
>>> df.Start = df.Start + 1
>>> bf = pybed.BedFrame.from_frame(meta=[], data=df)
ADD REPLY
0
Entering edit mode

My understanding is that it's the other way round:

1 based:

chr1,100,200

0 based:

chr1 99 200
ADD REPLY

Login before adding your answer.

Traffic: 2004 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6