Question: Annotate Regions In Bed File With Nearest Downstream Gene
gravatar for jxchong
5.5 years ago by
Postdoc at the University of Washington
jxchong160 wrote:

I have a bed file with predicted TSS/promoter regions (by chromHMM and similar methods) and I would like to annotate each region from the bed file with the gene(s) it is likely to act upon.

I could use closestBed (bedtools) to get the closest genes purely by genomic distance, but it seems like I ought to be able to do better than that -- if a particular TSS region is downstream of the closest gene A, then we actually want to move on and find the closest gene B where TSS is upstream of B. The TSS regions are not stranded (the prediction simply gives coordinates).

Here are some examples, with the > representing direction of transcription and = representing relative genomic distance between regions:

======== >geneA> == TSS ==== >geneB-> ========
Desired result: TSS annotated with geneB

======== <geneA< ===== TSS == >geneB-> ========
Desired result: TSS annotated with geneB

Is there a way to do this with bedtools/other software without rolling my own script? Am I over-thinking this?

bed promoter • 5.7k views
ADD COMMENTlink modified 4.5 years ago by Biostar ♦♦ 20 • written 5.5 years ago by jxchong160
gravatar for Istvan Albert
5.5 years ago by
Istvan Albert ♦♦ 78k
University Park, USA
Istvan Albert ♦♦ 78k wrote:

IMO you are solving the wrong problem.

Your problem is not what bedtools does but how to reconcile potentially conflicting results. But the solution for that is not to reimplement the intersect operation - that would be just doing a lot of work that you don't need to do. Just intersect the upstream regions for both, produce an output file then write a script that process this output. With that the heavy lifting has already been done and you can concentrate your efforts in sorting out which TSS to keep.

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Istvan Albert ♦♦ 78k

Thanks, but no, that isn't what I want to do -- I don't want to detect the TSS for a given gene, I want to annotate all TSS with their nearest gene (because these are computational predictions, sometimes there are multiple separated TSS regions that are all clearly for the same gene. Additionally alternatively spliced transcripts for the same gene can have different TSS.)

Edit: Actually, upon re-reading your answer, I agree that a complete solution doesn't exist. I will use bedops to get halfway there and then write my own script to pick correctly between the nearest upstream/downstream genes.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by jxchong160
gravatar for Alex Reynolds
5.5 years ago by
Alex Reynolds26k
Seattle, WA USA
Alex Reynolds26k wrote:

Looking at your conditions, it sounds like you always want the closest downstream element, regardless of what's upstream or how near or far away that upstream element is.

By default, the BEDOPS closest-features application will report the full characteristics of both the nearest leftmost ("upstream") and rightmost ("downstream") elements.

You can therefore feed the output from this application to awk (or another interpreted language) to just report the TSS element and whatever nearest gene is downstream from it.

As a for instance, first sort your input files:

$ sort-bed unsorted.TSS.bed > TSS.bed
$ sort-bed unsorted.genes.bed > genes.bed

Then run the two sorted BED files through closest-features, piping the results to awk with the correct field separator to pick the downstream gene:

$ closest-features TSS.bed genes.bed \
    | awk FS="|" '{ \
        tssElement = $1; \
        upstreamGene = $2; \
        downstreamGene = $3; \  
        print tssElement"|"downstreamGene; \
    }' - \
    > answer.bed

The file answer.bed will be a sorted BED file that contains results in the following format:

[ tss-1 ] | [ nearest-downstream-gene-to-tss-1 ]
[ tss-2 ] | [ nearest-downstream-gene-to-tss-2 ]
[ tss-n ] | [ nearest-downstream-gene-to-tss-n ]

If you have a different condition for picking a nearest element, you could set up those rules in the awk block above. (Feel free to modify your question if you had something else in mind.)

To test features on the basis of strand, you could modify the awk block as follows:

$ closest-features TSS.bed genes.bed \
    | awk FS="|" '{ \
        tssRegion = $1; \
        upstreamGene = $2; \
        downstreamGene = $3; \  
        split(tssRegion, tssElements, "\t"); \
        split(upstreamGene, upstreamGeneElements, "\t"); \
        split(downstreamGene, downstreamGeneElements, "\t"); \
        tssStart = tssElements[1]; \
        tssStop = tssElements[2]; \
        upstreamGeneStart = upstreamGeneElements[1]; \
        upstreamGeneStop = upstreamGeneElements[2]; \
        upstreamGeneStrand = upstreamGeneElements[4]; \
        downstreamGeneStart = downstreamGeneElements[1]; \
        downstreamGeneStop = downstreamGeneElements[2]; \
        downstreamGeneStrand = downstreamGeneElements[4]; \
        trueUpstreamGeneDistance = 0; \
        trueDownstreamGeneDistance = 0; \
        if (upstreamGeneStrand == "+") { \
            trueUpstreamGeneDistance = tssStart - upstreamGeneStart; \
        } \
        else { \
            trueUpstreamGeneDistance = tssStart - upstreamGeneStop; \
        } \
        if (downstreamGeneStrand == "+") { \
            trueDownstreamGeneDistance = downstreamGeneStart - tssStop; \
        } \
        else { \
            trueDownstreamGeneDistance = downstreamGeneStop - tssStop; \
        } \
        if (trueUpstreamGeneDistance > trueDownstreamGeneDistance) { \
            print tssRegion"|"downstreamGene; \
        } \
        else { \
            print tssRegion"|"upstreamGene; \
        } \
    }' - \
    > answer.bed
ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Alex Reynolds26k

Hmm, I had looked into closest-features, but this method won't work on:


 chr3    3    4    TSSE


 chr3    1    2    geneE    0    -
 chr3    21    22    geneF    0    +

The result associates TSSE with geneF (instead of geneE, as desired).

However, I think by combining Istvan's answer and your answer, I believe I need to use bedops closest-features to get the nearest genes and then roll my own script to pick correctly between the two nearest genes.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by jxchong160

Neither of those two inputs are correct BED (per specification). You'll probably want to fix them before passing them to apps or scripts.

ADD REPLYlink written 5.5 years ago by Alex Reynolds26k

Oops, fixed now. (I'll need to add "chr" in, run bedops, strip out chr later so I can combine the data with my vcf) :P But... that's a separate issue.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by jxchong160

It's not clear to me what your rules are, such that geneE would be picked. But you can certainly add logic inside the awk block (or whichever language you prefer instead) to pick an element based on rules of your choice.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Alex Reynolds26k

Since TSSE has no strand info, it could be acting on either the + or - strand? TSSE is upstream of geneE on the - strand and upstream of geneF on the + strand, but it is closer to geneE on the - strand than it is to geneF on the + strand.

ADD REPLYlink written 5.5 years ago by jxchong160

That's easy, then. You use a split() call on tssElement, upstreamGene and downstreamGene within the awk block (the TSS and both genes are delimited with tab characters), look at the strand and start/stop coordinate values of each element, and then decide accordingly and print.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Alex Reynolds26k

Thanks for this solution, very helpful.  Though it seems like awk cannot take the vertical bar ("|") as a field separator.  I get the following error when I try to run the first simple script you've written above:

awk: syntax error at source line 1
 context is
     >>> | <<<
awk: bailing out at source line 1

Is this just a matter of having to change how the closest-features output is delimited?

ADD REPLYlink written 4.6 years ago by karl.f.erhard0

You might try a BEGIN { FS = "|"; } block in the awk script. You could also certainly choose a different delimiter in closest-features.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Alex Reynolds26k

Dear Alex, I have Active Enhancer peaks and I want to assign each peak to nearest TSS (downstream or upstream). My sorted TSS.bed looks like:

chr1    11873   14409
chr1    14361   29370

My sorted genes.bed (active enhancer peak file) looks like:

chr1    812640  813470  Rank_108039     5       .       2.51728 2.10797 0.59423
chr1    842313  842638  Rank_154173     3       .       2.34097 1.79807 0.35120

I tried closest-features TSS.bed genes.bed > answer.bed and result are:

chr1    11873   14409|NA|chr1   812640  813470  Rank_108039     5       .       2.51728 2.10797 0.59423
chr1    14361   29370|NA|chr1   812640  813470  Rank_108039     5       .       2.51728 2.10797 0.59423
chr1    17368   17436|NA|chr1   812640  813470  Rank_108039     5       .       2.51728 2.10797 0.59423

So as you can see 1 entry from genes.bed is assigned against multiple entries of TSS.bed whereas what I want is 1-to-1 mapping. Although it is possible that 2 or 3 different TSS are close to one gene but in my case initial 25 TSS have the same gene and I think I am doing something wrong here (I actually don't know how to modify the awk for my case). Thank you.

ADD REPLYlink written 2.4 years ago by Bioinformatist Newbie230

If you want the output of nearest TSS to gene/peak, then reverse the order of arguments: closest-features genes.bed TSS.bed:

USAGE: closest-features [Process-Flags] <input-file> <query-file>
   All input files must be sorted per sort-bed.
   The program accepts BED and Starch file formats
   May use '-' for a file to indicate reading from standard input (BED format only).

   For every element in <input-file>, determine the two elements from <query-file> falling
     nearest to its left and right edges (See NOTES below).  By default, echo the <input-file>
     element, followed by those left and right elements found in <query-file>.

In the reversed case, your input-file becomes genes.bed and query-file becomes TSS.bed. Each gene/peak gets two elements from TSS.bed that fall nearest to it, upstream and downstream.

In your first example, there are two peaks shown for each TSS. The first is NA because there are no peaks upstream of the first few TSS sites; the nearest peaks are all downstream. In any case, it sounds like you want the other result: the nearest TSSs to each peak, so reversing the order of file arguments should address that.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Alex Reynolds26k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1087 users visited in the last hour