Tool to calculate extreme most positions in a bed file for a given window
2
0
Entering edit mode
9.9 years ago

Hi I have want to identify the extreme most positions in a bed file from a specific position in a window frame,

E.g. if base position is chrX 154029186 154029187, and if following are the overlapping positions from another bed file in a specified window frame , then the tool should spit out

chrX    154029165    154029172


and

chrX    154028990    154028999


since they are extreme most positions in the frame

chrX     154029186     154029187     chrX     154029165     154029172
chrX     154029186     154029187     chrX     154028981     154028992
chrX     154029186     154029187     chrX     154028991     154029002
chrX     154029186     154029187     chrX     154028981     154028990
chrX     154029186     154029187     chrX     154028991     154029000
chrX     154029186     154029187     chrX     154028982     154028991
chrX     154029186     154029187     chrX     154028990     154028999

bed • 2.3k views
0
Entering edit mode

If I understood clearly then you have two bed files (e.g. "A.bed" and "B.bed") and you want to print only those co-ordinates of A.bed which doesn't overlap with B.bed.

If it is so then just do (install bedtools)

intersectBed -a A.bed -b B.bed > output.bed

0
Entering edit mode

Your best bet is to either script something with pybedtools or with R (in GenomicRanges). If the BED files are large, the former is probably more efficient.

0
Entering edit mode

Assuming your bedfiles are sorted correctly you can pipe your intersections into

awk '{if (NR == 1) extreme1 = $0} END {print extreme1"\n"$0}'


but there should be better way solving this.

0
Entering edit mode
9.9 years ago

I think for a simple min-max search (it seems you want the find the row with smallest start and the one with the largest end) a oneliner in awk would work well:

cat data.bed | awk ' BEGIN { min=1E10 } $2 < min { min=$2; min_row=$0 } max <$3 { max =$3; max_row=$0 } END { print min_row; print max_row;}'


but it could be that I misunderstood what you want.

Advice: when you create an example make sure to make it simple, for example use an example with short readable numbers 100, 200 etc rather than something large that is difficult to parse/compare.

0
Entering edit mode
9.9 years ago

I don't really understand the format of your dataset. However, an alternative is to calculate the distances between target and query elements with awk and write that value to an additional column. Use GNU sort to sort that column and then take the head or tail, depending on whether the minimum or maximum value is needed.