Question

How To Find The Closest Distance From Bed Files Between Genes And Repeats That Are Upstream

0

Entering edit mode

10.3 years ago

roll ▴ 350

How can I use the closestBed from bedtools to find the closest locations between two bed files. The important bit here is that i want them to be upstream and in correct oriantation.

When I use the -s option, it does not report anything (everything is -1).

Then I checked the -D a option. It is returning some results but not sure if it is the right thing.

The other thing to mention is that my genes bed file (lets call is gene.bed) is organized as

chr1 123 234 +
chr1 456 789 -

rather than end position being smaller to indicate the negative strand.

Whereas my repeats.bed file are organized as

chr1 239 456
chr3 456 987

Does bedtools get confused with this?

Which options should i use if i want to find the distance to nearest repeat that is upstream and in the correct orientation?

bedtools distance bed • 8.2k views

ADD COMMENT • link updated 10.3 years ago by Alex Reynolds 35k • written 10.3 years ago by roll ▴ 350

score 3 · Answer 1 · 2014-01-07

3

Entering edit mode

10.3 years ago

Devon Ryan 104k

> cat foo.bed
chr1    123    234    .    .    +
chr1    456    789    .    .    -

> cat bar.bed
chr1    239    456
chr1    800    900
chr3    456    987

> closestBed -D a -id -a foo.bed -b bar.bed 
chr1    456    789    .    .    -    chr1    800    900    -12

It helps if you properly format the BED file. You may also consider the -io option. You can trivially convert your "BED" file into a real BED file with awk '{printf("%s\t%s\t%s\t.\t.\t%s\n", $1, $2, $3, $4)}' incorrectly_formatted.bed > correctly_formatted.bed.

ADD COMMENT • link 10.3 years ago by Devon Ryan 104k

0

Entering edit mode

thanks, But what does -s or -S does? Shall i ignore it?

ADD REPLY • link 10.3 years ago by roll ▴ 350

0

Entering edit mode

-s and -S are about overlaps. If you want something upstream, then it shouldn't be overlapping (unless you want to count an overlap that extends upstream as a hit, which is also possible). For that, though, you would need strand information for everything in bar.bed (this, of course, wouldn't make any sense for repeats, which have no strand).

ADD REPLY • link 10.3 years ago by Devon Ryan 104k

score 3 · Answer 2 · 2014-01-07

If you're amenable to alternatives, you could use BEDOPS closest-features on strand-separated genes.

For instance, sort the genes and repeats, if not sorted:

$ sort-bed genes.bed > sorted-genes.bed
$ sort-bed repeats.bed > sorted-repeats.bed

Then separate your genes by strand with awk (or your tool of choice):

$ awk '($4=="+")' sorted-genes.bed > sorted-genes-forward.bed
$ awk '($4=="-")' sorted-genes.bed > sorted-genes-reverse.bed

Then run closest-features --dist on each gene set, against the repeats. The --dist operand reports the distances between the gene and the nearest upstream and downstream repeats.

In the forward strand case, we use awk to print out the first and second fields separated by the pipe character (|). The first field is the gene. The second field is the repeat element upstream from the forward-strand gene element:

$ closest-features --dist sorted-genes-forward.bed sorted-repeats.bed \
    | awk 'BEGIN {FS='|'} {print $1"|"$2}' - \
    > genes-repeats-forward-with-distances.bed

In the reverse strand case, we use awk to print out the first and third fields separated by the pipe character. The first field is the gene. The third field is the repeat element upstream from the reverse-strand gene element:

$ closest-features --dist sorted-genes-reverse.bed sorted-repeats.bed \
    | awk 'BEGIN {FS='|'} {print $1"|"$3}' - \
    > genes-repeats-reverse-with-distances.bed

If you want to put results back together at the end, just use BEDOPS bedops with the --everything operand to do a multiset union:

$ bedops --everything genes-repeats-*-with-distances.bed > final-answer.bed

Like this example shows, one nice thing about BEDOPS bedops is that it can do operations on many input files at once, not just two BED files.