Question: How to extend variable length intervals to the same final length?
1
2.5 years ago by
Ian5.5k
University of Manchester, UK
Ian5.5k wrote:

I have a large set of footprint intervals that range from 11 to 25bp For the purpose of motif discovery I would like to extend all intervals to, for example, 50bp. Intervals should be extended equally from both sides. I would usually use 'bedtools slop' for fixed length intervals, but this would not appear to work with variable length.

It would be great if anyone could advise me how to use bedtools, or something else. I have a nagging feeling I am missing something obvious, so apologies in advance!

modified 2.5 years ago by Alex Reynolds28k • written 2.5 years ago by Ian5.5k
2
2.5 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Here's a way that I think should extend both ends of BED elements to the desired target length:

``````\$ TARGET_LENGTH=50
\$ awk -vF=\${TARGET_LENGTH} 'BEGIN{ OFS="\t"; }{ len=\$3-\$2; diff=F-len; flank=int(diff/2); upflank=downflank=flank; if (diff%2==1) { downflank++; }; print \$1, \$2-upflank, \$3+downflank; }' in.bed | sort-bed - > out.bed
``````

Non-even length elements or a non-even target length will require flank lengths that are unequal. Sounds like this is not a problem.

You might adjust the logic to randomly pick which of `upflank` or `downflank` to decrement or increment in this case, so that you don't impart a bias from this adjustment (esp. if original elements are stranded, like footprints that will ultimately be mapped to TF binding sites or other stranded elements), e.g.:

``````\$ TARGET_LENGTH=50
\$ awk -vF=\${TARGET_LENGTH} 'BEGIN{ OFS="\t"; }{ len=\$3-\$2; diff=F-len; flank=int(diff/2); upflank=downflank=flank; if (diff%2==1) { if (rand() >= 0.5) { downflank++; } else { upflank--; } }; print \$1, \$2-upflank, \$3+downflank; }' in.bed | sort-bed - > out.bed
``````

Thank you for your answer! I was going to ask how it handles odd lengths. It is OK if one side has an extra base, as long as the final length is the same.

Thanks for the addition. After discussing this with a colleague this morning it was pointed out that finding the mid-point of each region and then extending out works equally well. I knew I had missed something!