Cluster Intervals By Gene Id In A Bed File
1
2
Entering edit mode
11.5 years ago
Rad ▴ 810

Hello everyone,

I have a Bed file with 5 columns where the 4th is a unique ID and the 5th is a geneID

I was trying to play with bedtools to cluster this bed file by gene ID and output a single line for each gene with the range (chr start end) of the region. Basically I want to cluster intervals. Example

chr1 10  1000 ID1 GeneID1
chr1 20  1300 ID2 GeneID1
chr1 1400  1600 ID3 GeneID1

I'm trying to get an output like

chr1 10 1600 GeneID1

Can anyone tell me if playing with bedtools is the best way of doing this or is it possible just by awk ? any idea ?

Thank you

bed • 3.2k views
ADD COMMENT
5
Entering edit mode
11.5 years ago

using awk and sqlite:

~$ echo -e "chr1\t10\t\t1000\tID1\tGeneID1\nchr1\t20\t\t1300\tID2\tGeneID1\nchr1\t1400\t\t1600\tID3\tGeneID1" |\
awk 'BEGIN{printf("create table t(chrom text,start int,end int, name text);\n");} {printf("insert into t(chrom,start,end,name) values(\"%s\",%s,%s,\"%s\");\n",$1,$2,$3,$5);} END {printf("select chrom,min(start),max(end),name from t group by chrom,name;\n");}' |\
sqlite3 tmp.sqlite

chr1|10|1600|GeneID1
ADD COMMENT
0
Entering edit mode

Awesome, I added "drop table if exists t;" before table creation so that we can use it within a script Thx man !

ADD REPLY
0
Entering edit mode

you can also add some indexes on chrom and name if you have a large input...

ADD REPLY
0
Entering edit mode

I just could not believe this awesomeness.

ADD REPLY

Login before adding your answer.

Traffic: 1489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6