how to split "blocks" in one entry of bed file into different entries
1
0
Entering edit mode
4.6 years ago

Hi all, I have a peak calling results stored in bed file. Some of the peaks are stored in one entry as different blocks. I want to split those blocks each into a separate entry using the information provided in blockCount, blockSizes, blockStarts.

For example, Input:

chr   chromStart      chromEnd        name    score   strand  thickStart      thickEnd        itemRgb blockCount      blockSizes      blockStarts
chr1    943253  943774  SAMD11  2e-59   +       943253  943774  0       2       124,77, 0,444
chr1    944351  944581  SAMD11  5e-35   +       944351  944581  0       1       230,    0

output:

chr1    943253    943377    SAMD11    2e-59    +    943253    943377    0    1    124,    0
chr1    943697    943774    SAMD11    2e-59    +    943697    943774    0    1    77,    0
chr1    944351  944581  SAMD11  5e-35   +       944351  944581  0       1       230,    0

Note, in the example, the first entry contains two peaks (blocks) and the second contains only one. In my data, some of my entries contain two or multiple (more than two) peaks (blocks) while others contain only one.

Can this be done by some tools or codes? Any help is appreciated!

Thanks

Scott

bed format peak • 1.6k views
0
Entering edit mode

This looks like genepred format, I could find something to convert it to gtf which may be a step in the right direction.

0
Entering edit mode

It's bed12 format, for what it's worth.

2
Entering edit mode
4.6 years ago

This should get you about 98% of the way there:

$awk 'BEGIN{ OFS="\t" } NR>1{ blockCount=$10; blockSizesStr=$11; blockStartsStr=$12; blockSizesCount=split(blockSizesStr,blockSizes,","); blockStartsCount=split(blockStartsStr,blockStarts,","); for(i=1;i<=blockCount;i++) { $2+=blockStarts[i];$3=$2+blockSizes[i]; print$0; } }' intervals.bed12
chr1    943253  943377  SAMD11  2e-59   +   943253  943774  0   2   124,77, 0,444
chr1    943697  943774  SAMD11  2e-59   +   943253  943774  0   2   124,77, 0,444
chr1    944351  944581  SAMD11  5e-35   +   944351  944581  0   1   230,    0

You can fiddle with the print statements, as needed.

0
Entering edit mode

Thank you so much Alex! your code helps a lot! With just a minor change as below:

awk 'BEGIN{ OFS="\t" } NR>1{ originalStart=$2; newStart=$2; blockCount=$10; blockSizesStr=$11; blockStartsStr=$12; blockSizesCount=split(blockSizesStr,blockSizes,","); blockStartsCount=split(blockStartsStr,blockStarts,","); for(i=1;i<=blockCount;i++) {$2=$7+blockStarts[i];$3=$2+blockSizes[i];$10=1; $11=blockSizes[i];$12=0; print \$0; } }' con_peak.bed > tmp.bed

I was able to get a bed file with every entry as individual peaks.