Question: how to split "blocks" in one entry of bed file into different entries
0
gravatar for scottzijiezhang
2.6 years ago by
scottzijiezhang0 wrote:

Hi all, I have a peak calling results stored in bed file. Some of the peaks are stored in one entry as different blocks. I want to split those blocks each into a separate entry using the information provided in blockCount, blockSizes, blockStarts.

For example, Input:

chr   chromStart      chromEnd        name    score   strand  thickStart      thickEnd        itemRgb blockCount      blockSizes      blockStarts
chr1    943253  943774  SAMD11  2e-59   +       943253  943774  0       2       124,77, 0,444
chr1    944351  944581  SAMD11  5e-35   +       944351  944581  0       1       230,    0

output:

chr1    943253    943377    SAMD11    2e-59    +    943253    943377    0    1    124,    0
chr1    943697    943774    SAMD11    2e-59    +    943697    943774    0    1    77,    0
chr1    944351  944581  SAMD11  5e-35   +       944351  944581  0       1       230,    0

Note, in the example, the first entry contains two peaks (blocks) and the second contains only one. In my data, some of my entries contain two or multiple (more than two) peaks (blocks) while others contain only one.

Can this be done by some tools or codes? Any help is appreciated!

Thanks

Scott

peak bed format • 885 views
ADD COMMENTlink modified 2.6 years ago by Alex Reynolds28k • written 2.6 years ago by scottzijiezhang0

This looks like genepred format, I could find something to convert it to gtf which may be a step in the right direction.

ADD REPLYlink written 2.6 years ago by WouterDeCoster39k

It's bed12 format, for what it's worth.

ADD REPLYlink written 2.6 years ago by Devon Ryan90k
2
gravatar for Alex Reynolds
2.6 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

This should get you about 98% of the way there:

$ awk 'BEGIN{ OFS="\t" } NR>1{ blockCount=$10; blockSizesStr=$11; blockStartsStr=$12; blockSizesCount=split(blockSizesStr,blockSizes,","); blockStartsCount=split(blockStartsStr,blockStarts,","); for(i=1;i<=blockCount;i++) { $2+=blockStarts[i]; $3=$2+blockSizes[i]; print $0; } }' intervals.bed12
chr1    943253  943377  SAMD11  2e-59   +   943253  943774  0   2   124,77, 0,444
chr1    943697  943774  SAMD11  2e-59   +   943253  943774  0   2   124,77, 0,444
chr1    944351  944581  SAMD11  5e-35   +   944351  944581  0   1   230,    0

You can fiddle with the print statements, as needed.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Alex Reynolds28k

Thank you so much Alex! your code helps a lot! With just a minor change as below:

awk 'BEGIN{ OFS="\t" } NR>1{ originalStart=$2; newStart=$2; blockCount=$10; blockSizesStr=$11; blockStartsStr=$12; blockSizesCount=split(blockSizesStr,blockSizes,","); blockStartsCount=split(blockStartsStr,blockStarts,","); for(i=1;i<=blockCount;i++) { $2=$7+blockStarts[i]; $3=$2+blockSizes[i]; $10=1; $11=blockSizes[i]; $12=0; print $0; } }' con_peak.bed > tmp.bed

I was able to get a bed file with every entry as individual peaks.

ADD REPLYlink written 2.6 years ago by scottzijiezhang0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1519 users visited in the last hour