N Closest Genes To A Given Location
4
1
Entering edit mode
10.0 years ago
dfernan ▴ 710

Hi,

This is basically an extension of the following question already asked in biostar (http://biostars.org/post/show/53561/python-finding-gene-closest-to-a-given-location/).

Let us say I have a list of genomic regions (as a bed file), and also a list of genes (as a bed file). For each genomic region I want to find the 5 (or N to be general) closest genes. How would I try to do that? Any suggestions?

Thanks!

intersect bedtools • 8.0k views
2
Entering edit mode
10.0 years ago

The BEDOPS suite includes the closest-features tool, which finds the nearest feature ("gene" from a set of "genes") from a reference input (individual "region"). Overlaps are allowed. Both inputs are UCSC BED-formatted and lexicographically sorted.

As well as file input, the closest-features command accepts UNIX standard input. In other words, you can keep passing results off to the next search. I'll explain below how this makes it easy to find N closest genes.

First, the command:

$closest-features Regions.bed Genes.bed  will return the closest two genes in Genes.bed for each element in Regions.bed. The format of this output is, generally (see --help for options): [ region element #1 ] | [ upstream element ] | [ downstream element ] [ region element #2 ] | [ upstream element ] | [ downstream element ] ... [ region element #n ] | [ upstream element ] | [ downstream element ]  Note the use of the | delimiter to split the elements. We can make use of this when passing results to standard UNIX utilities. The following command yields the first closest gene downstream of the region element, using awk to split the result by the | delimiter and print out the third item (downstream element): $ closest-features Regions.bed Genes.bed \
| awk -F'|' '{print $3}' -  For sake of demonstration, I'm sticking to downstream items. Note that we can complicate things here if we want an upstream element, by either printing the second item, or using the --dist option with closest-features. This option returns distance in bases, along with the nearest "hits" up- and downstream. By passing those distances to the awk statement, a simple conditional on the distance value could print either the second item (upstream hit) or third item (downstream hit). Note that the output from this command — the first nearest gene — is in BED format, because it is an element in Genes.bed! This is really useful, because this means we can pass this back into closest-features as standard input, and it will try to find the nearest hit to that result. Because closest-features takes in reference regions in standard input, these searches can be chained together on the command line. This BEDOPS utility follows the GNU convention of using the hyphen to denote standard input. In other words, to search for the second closest gene upstream, we pass the results of the first search as a reference region for closest-features to search, using a standard UNIX pipe: $ closest-features Regions.bed Genes.bed \
| awk -F'|' '{print $3}' - \ | closest-features - Genes.bed \ | awk -F'|' '{print$3}' -


And we find the third closest feature, by passing that result to another search:

$closest-features Regions.bed Genes.bed \ | awk -F'|' '{print$3}' - \
| closest-features - Genes.bed \
| awk -F'|' '{print $3}' - | closest-features - Genes.bed \ | awk -F'|' '{print$3}' -


And so on and so on, piping N-1 times for N closest features (in this case, features that are downstream of the input region element).

The output of each Nth iteration of this command is a BED-formatted genomic coordinate that represents the Nth-closest element ("gene") to the region-of-interest.

Let's use these chained commands to build files containing the first through fifth closest features (from Genes.bed) to elements in Regions.bed:

$closest-features Regions.bed Genes.bed | awk -F'|' {print$3} - > 1.bed
$closest-features Regions.bed Genes.bed | awk -F'|' {print$3} - | closest-features - Genes.bed | awk -F'|' {print $3} - > 2.bed$ closest-features ... > 3.bed
$closest-features ... > 4.bed$ closest-features ... > 5.bed


(You could write a script to construct and execute these chained commands, in order to save time.)

Once you have the individual results, it's a simple matter of concatenating the region element with each result. We can use a simple UNIX paste command to do this, e.g.:

$paste -d'|' Regions.bed 1.bed 2.bed 3.bed 4.bed 5.bed > Results.bed  Each line of Results.bed should look something like: [ region-1 ] | [ first hit ] | [ second hit ] | ... | [ fifth hit ] [ region-2 ] | [ first hit ] | [ second hit ] | ... | [ fifth hit ] ... [ region-n ] | [ first hit ] | [ second hit ] | ... | [ fifth hit ]  Note that the inputs to closest-features should be lexicographically-sorted. BEDOPS provides a tool called sort-bed which does this. You only need to sort once, if unsorted: $ sort-bed Regions.bed > Regions.sorted.bed
$sort-bed Genes.bed > Genes.sorted.bed$ closest-features Regions.sorted.bed Genes.sorted.bed
...

0
Entering edit mode

Hi Alex,

I am in trouble with closest-features command of bedops. Could you please share a sample "Regions.bed" and "Genes.bed" files as reference? I have Chromosome, Start and End information and need to find out up-and-down stream genes. Thanks a lot for your input.

Bahar

0
Entering edit mode

0
Entering edit mode

I've gone through the documentation but could not help!

I have a Genes list .bed file as follows:

chr1    4058    4397    orf19.6115    1000    +
chr1    4408    4720    orf19.6114    1000    -
chr1    8596    8908    orf19.6113    1000    -
chr1    10717    11485    orf19.6112    1000    +
chr1    11647    11995    orf19.6110    1000    -


and Regions .bed file as follows:

chr1    24871    25386
chr1    32516    33701
chr1    39376    41131
chr1    42501    42796
chr1    43346    43531


Command I use is:

closest-features --closest coh1.bed A21Genes1.bed


and no result...

Can you please tell me where the error is?

Thanks a lot Alex.

0
Entering edit mode

I can try to replicate any problems here, but first I need a reproducible example. Can you post your files somewhere for me to download, with their filenames matching the command you are using, and please post the exact command you are using with these files?

0
Entering edit mode

Hi Alex,

I could fix the issue and get my result.

Thanks a lot...

1
Entering edit mode
10.0 years ago

You would use the exact same techniques as listed in the post that you link to.

That being said in general I find the concept of closest to be fuzzily defined. Almost always what people actually mean is to find the closest within a certain limit or window. Also forcing the number of genes to a predefined number is also likely to introduce artifacts. It is really rare that a observation could be attributed to exactly 5 but not 4 or 6 genes/snps etc.

The best way is to think about a range and count genes within that.

Thus instead of closestBed you would use say windowBed. In the case of interval trees you query for a range.

1
Entering edit mode
10.0 years ago
grosscol ▴ 40

You could basically use the approach given by Pierre in the post you linked to, but modify it slightly. First I would make sure the reference bed file is sorted by chromosome, start position, and end position. Perhaps using "sort -k 1,1 -k 2,2n -k3,3n >>ref.sorted.bed" after capturing the header information into the new file first.

#get intersections from input.bed to ref.bed (or single closest)
closestBed -t first -a input.bed -b ref.sorted.bed > interesects.txt


That should at least give you a listing of where to start for each. Next, you might iterate over every line of intersects and match the line in your reference, then cut out X above and X below.

# iterate over intersects.txt and find matching line in ref.sorted.bed
for line in intersects.txt
do
arr=$(echo$line | tr "\t" "\n")
# need to escape the + or - at field 9
patt=${arr[4]}$'\t'${arr[5]}$'\t'${arr[6]}$'\t'${arr[7]}$'\t'${arr[8]}$'\t''\'${arr[9]} # will only give you the matching line. Need to substitute multi-line matching here. grep$patt ref.sorted.bed
done


Off the top of my head I can't sort out how to match multiple lines with grep. Perl or Python can yield more elegant solutions. It might be worth it

The bedtools documentation specifies:

$cat A.bed chr1 100 101 rs1234$ cat B.bed
chr1  0  1000  geneA 100  +

\$ closestBed –a A.bed –b B.bed  –t first
chr1  100  101  rs1234  chr1  0  1000  geneA 100  +

1
Entering edit mode
10.0 years ago
grosscol ▴ 40

Thinking more about it, python would be a lot less painful. Something like this perhaps.

#Read sorted reference bed
infile = open("ref.sorted.bed","r")
infile.close()

infile = open("intersects.txt","r")
infile.close()

#get index of match in ref.
for ln in intersects:
spl=ln.split('\t')
index = refbed.find( spl[7] )


Then take the text surrounding index up and down as many lines as you need. This isn't going to work for large files, but might give you some ideas or someplace to start.