Question

Cluster Intervals By Gene Id In A Bed File

2

Entering edit mode

11.5 years ago

Rad ▴ 810

Hello everyone,

I have a Bed file with 5 columns where the 4th is a unique ID and the 5th is a geneID

I was trying to play with bedtools to cluster this bed file by gene ID and output a single line for each gene with the range (chr start end) of the region. Basically I want to cluster intervals. Example

chr1 10  1000 ID1 GeneID1
chr1 20  1300 ID2 GeneID1
chr1 1400  1600 ID3 GeneID1

I'm trying to get an output like

chr1 10 1600 GeneID1

Can anyone tell me if playing with bedtools is the best way of doing this or is it possible just by awk ? any idea ?

Thank you

bed • 3.2k views

ADD COMMENT • link updated 11.5 years ago by Pierre Lindenbaum 161k • written 11.5 years ago by Rad ▴ 810

score 5 · Answer 1 · 2012-10-11

5

Entering edit mode

11.5 years ago

Pierre Lindenbaum 161k

using awk and sqlite:

~$ echo -e "chr1\t10\t\t1000\tID1\tGeneID1\nchr1\t20\t\t1300\tID2\tGeneID1\nchr1\t1400\t\t1600\tID3\tGeneID1" |\
awk 'BEGIN{printf("create table t(chrom text,start int,end int, name text);\n");} {printf("insert into t(chrom,start,end,name) values(\"%s\",%s,%s,\"%s\");\n",$1,$2,$3,$5);} END {printf("select chrom,min(start),max(end),name from t group by chrom,name;\n");}' |\
sqlite3 tmp.sqlite

chr1|10|1600|GeneID1

ADD COMMENT • link 11.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Awesome, I added "drop table if exists t;" before table creation so that we can use it within a script Thx man !

ADD REPLY • link 11.5 years ago by Rad ▴ 810

0

Entering edit mode

you can also add some indexes on chrom and name if you have a large input...

ADD REPLY • link 11.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I just could not believe this awesomeness.

ADD REPLY • link 8.2 years ago by GouthamAtla 12k