How To Find The Closest Distance From Bed Files Between Genes And Repeats That Are Upstream
2
0
Entering edit mode
8.5 years ago
roll ▴ 330

How can I use the closestBed from bedtools to find the closest locations between two bed files. The important bit here is that i want them to be upstream and in correct oriantation.

When I use the -s option, it does not report anything (everything is -1).

Then I checked the -D a option. It is returning some results but not sure if it is the right thing.

The other thing to mention is that my genes bed file (lets call is gene.bed) is organized as

chr1 123 234 +
chr1 456 789 -


rather than end position being smaller to indicate the negative strand.

Whereas my repeats.bed file are organized as

chr1 239 456
chr3 456 987


Does bedtools get confused with this?

Which options should i use if i want to find the distance to nearest repeat that is upstream and in the correct orientation?

bedtools distance bed • 6.6k views
3
Entering edit mode
8.5 years ago
> cat foo.bed
chr1    123    234    .    .    +
chr1    456    789    .    .    -

> cat bar.bed
chr1    239    456
chr1    800    900
chr3    456    987

> closestBed -D a -id -a foo.bed -b bar.bed
chr1    456    789    .    .    -    chr1    800    900    -12


It helps if you properly format the BED file. You may also consider the -io option. You can trivially convert your "BED" file into a real BED file with awk '{printf("%s\t%s\t%s\t.\t.\t%s\n", $1,$2, $3,$4)}' incorrectly_formatted.bed > correctly_formatted.bed.

0
Entering edit mode

thanks, But what does -s or -S does? Shall i ignore it?

0
Entering edit mode

-s and -S are about overlaps. If you want something upstream, then it shouldn't be overlapping (unless you want to count an overlap that extends upstream as a hit, which is also possible). For that, though, you would need strand information for everything in bar.bed (this, of course, wouldn't make any sense for repeats, which have no strand).

3
Entering edit mode
8.5 years ago

If you're amenable to alternatives, you could use BEDOPS closest-features on strand-separated genes.

For instance, sort the genes and repeats, if not sorted:

$sort-bed genes.bed > sorted-genes.bed$ sort-bed repeats.bed > sorted-repeats.bed


Then separate your genes by strand with awk (or your tool of choice):

$awk '($4=="+")' sorted-genes.bed > sorted-genes-forward.bed
$awk '($4=="-")' sorted-genes.bed > sorted-genes-reverse.bed


Then run closest-features --dist on each gene set, against the repeats. The --dist operand reports the distances between the gene and the nearest upstream and downstream repeats.

In the forward strand case, we use awk to print out the first and second fields separated by the pipe character (|). The first field is the gene. The second field is the repeat element upstream from the forward-strand gene element:

$closest-features --dist sorted-genes-forward.bed sorted-repeats.bed \ | awk 'BEGIN {FS='|'} {print$1"|"$2}' - \ > genes-repeats-forward-with-distances.bed  In the reverse strand case, we use awk to print out the first and third fields separated by the pipe character. The first field is the gene. The third field is the repeat element upstream from the reverse-strand gene element: $ closest-features --dist sorted-genes-reverse.bed sorted-repeats.bed \
| awk 'BEGIN {FS='|'} {print $1"|"$3}' - \
> genes-repeats-reverse-with-distances.bed


If you want to put results back together at the end, just use BEDOPS bedops with the --everything operand to do a multiset union:

\$ bedops --everything genes-repeats-*-with-distances.bed > final-answer.bed


Like this example shows, one nice thing about BEDOPS bedops is that it can do operations on many input files at once, not just two BED files.