Question: Cluster Intervals By Gene Id In A Bed File
gravatar for Rad
7.3 years ago by
Rad800 wrote:

Hello everyone,

I have a Bed file with 5 columns where the 4th is a unique ID and the 5th is a geneID

I was trying to play with bedtools to cluster this bed file by gene ID and output a single line for each gene with the range (chr start end) of the region. Basically I want to cluster intervals. Example

chr1 10  1000 ID1 GeneID1
chr1 20  1300 ID2 GeneID1
chr1 1400  1600 ID3 GeneID1

I'm trying to get an output like

chr1 10 1600 GeneID1

Can anyone tell me if playing with bedtools is the best way of doing this or is it possible just by awk ? any idea ?

Thank you

bed • 2.3k views
ADD COMMENTlink modified 7.3 years ago by Pierre Lindenbaum125k • written 7.3 years ago by Rad800
gravatar for Pierre Lindenbaum
7.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum125k wrote:

using awk and sqlite:

~$ echo -e "chr1\t10\t\t1000\tID1\tGeneID1\nchr1\t20\t\t1300\tID2\tGeneID1\nchr1\t1400\t\t1600\tID3\tGeneID1" |\
awk 'BEGIN{printf("create table t(chrom text,start int,end int, name text);\n");} {printf("insert into t(chrom,start,end,name) values(\"%s\",%s,%s,\"%s\");\n",$1,$2,$3,$5);} END {printf("select chrom,min(start),max(end),name from t group by chrom,name;\n");}' |\
sqlite3 tmp.sqlite

ADD COMMENTlink written 7.3 years ago by Pierre Lindenbaum125k

Awesome, I added "drop table if exists t;" before table creation so that we can use it within a script Thx man !

ADD REPLYlink written 7.3 years ago by Rad800

you can also add some indexes on chrom and name if you have a large input...

ADD REPLYlink written 7.3 years ago by Pierre Lindenbaum125k

I just could not believe this awesomeness.

ADD REPLYlink written 3.9 years ago by geek_y10k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 814 users visited in the last hour