Question

Annotate Regions In Bed File With Nearest Downstream Gene

1

Entering edit mode

10.9 years ago

jxchong ▴ 160

I have a bed file with predicted TSS/promoter regions (by chromHMM and similar methods) and I would like to annotate each region from the bed file with the gene(s) it is likely to act upon.

I could use closestBed (bedtools) to get the closest genes purely by genomic distance, but it seems like I ought to be able to do better than that -- if a particular TSS region is downstream of the closest gene A, then we actually want to move on and find the closest gene B where TSS is upstream of B. The TSS regions are not stranded (the prediction simply gives coordinates).

Here are some examples, with the > representing direction of transcription and = representing relative genomic distance between regions:

======== >geneA> == TSS ==== >geneB-> ========
Desired result: TSS annotated with geneB

======== <geneA< ===== TSS == >geneB-> ========
Desired result: TSS annotated with geneB

Is there a way to do this with bedtools/other software without rolling my own script? Am I over-thinking this?

promoter bed • 9.8k views

ADD COMMENT • link updated 9.9 years ago by Biostar 20 • written 10.9 years ago by jxchong ▴ 160

Ram · Answer 1 · 2013-06-19

7

Entering edit mode

10.9 years ago

Alex Reynolds 35k

Looking at your conditions, it sounds like you always want the closest downstream element, regardless of what's upstream or how near or far away that upstream element is.

By default, the BEDOPS closest-features application will report the full characteristics of both the nearest leftmost ("upstream") and rightmost ("downstream") elements.

You can therefore feed the output from this application to awk (or another interpreted language) to just report the TSS element and whatever nearest gene is downstream from it.

As a for instance, first sort your input files:

$ sort-bed unsorted.TSS.bed > TSS.bed
$ sort-bed unsorted.genes.bed > genes.bed

Then run the two sorted BED files through closest-features, piping the results to awk with the correct field separator to pick the downstream gene:

$ closest-features TSS.bed genes.bed \
    | awk FS="|" '{ \
        tssElement = $1; \
        upstreamGene = $2; \
        downstreamGene = $3; \  
        print tssElement"|"downstreamGene; \
    }' - \
    > answer.bed

The file answer.bed will be a sorted BED file that contains results in the following format:

[ tss-1 ] | [ nearest-downstream-gene-to-tss-1 ]
[ tss-2 ] | [ nearest-downstream-gene-to-tss-2 ]
...
[ tss-n ] | [ nearest-downstream-gene-to-tss-n ]

If you have a different condition for picking a nearest element, you could set up those rules in the awk block above. (Feel free to modify your question if you had something else in mind.)

To test features on the basis of strand, you could modify the awk block as follows:

$ closest-features TSS.bed genes.bed \
    | awk FS="|" '{ \
        tssRegion = $1; \
        upstreamGene = $2; \
        downstreamGene = $3; \  
        split(tssRegion, tssElements, "\t"); \
        split(upstreamGene, upstreamGeneElements, "\t"); \
        split(downstreamGene, downstreamGeneElements, "\t"); \
        tssStart = tssElements[1]; \
        tssStop = tssElements[2]; \
        upstreamGeneStart = upstreamGeneElements[1]; \
        upstreamGeneStop = upstreamGeneElements[2]; \
        upstreamGeneStrand = upstreamGeneElements[4]; \
        downstreamGeneStart = downstreamGeneElements[1]; \
        downstreamGeneStop = downstreamGeneElements[2]; \
        downstreamGeneStrand = downstreamGeneElements[4]; \
        trueUpstreamGeneDistance = 0; \
        trueDownstreamGeneDistance = 0; \
        if (upstreamGeneStrand == "+") { \
            trueUpstreamGeneDistance = tssStart - upstreamGeneStart; \
        } \
        else { \
            trueUpstreamGeneDistance = tssStart - upstreamGeneStop; \
        } \
        if (downstreamGeneStrand == "+") { \
            trueDownstreamGeneDistance = downstreamGeneStart - tssStop; \
        } \
        else { \
            trueDownstreamGeneDistance = downstreamGeneStop - tssStop; \
        } \
        if (trueUpstreamGeneDistance > trueDownstreamGeneDistance) { \
            print tssRegion"|"downstreamGene; \
        } \
        else { \
            print tssRegion"|"upstreamGene; \
        } \
    }' - \
    > answer.bed

ADD COMMENT • link 10.8 years ago by Alex Reynolds 35k

0

Entering edit mode

Hmm, I had looked into closest-features, but this method won't work on:

TSS.bed

 chr3    3    4    TSSE

genes.bed

 chr3    1    2    geneE    0    -
 chr3    21    22    geneF    0    +

The result associates TSSE with geneF (instead of geneE, as desired).

However, I think by combining Istvan's answer and your answer, I believe I need to use bedops closest-features to get the nearest genes and then roll my own script to pick correctly between the two nearest genes.

ADD REPLY • link 10.8 years ago by jxchong ▴ 160

0

Entering edit mode

Neither of those two inputs are correct BED (per specification). You'll probably want to fix them before passing them to apps or scripts.

ADD REPLY • link 10.8 years ago by Alex Reynolds 35k

0

Entering edit mode

Oops, fixed now. (I'll need to add "chr" in, run bedops, strip out chr later so I can combine the data with my vcf) :P But... that's a separate issue.

ADD REPLY • link 10.8 years ago by jxchong ▴ 160

0

Entering edit mode

It's not clear to me what your rules are, such that geneE would be picked. But you can certainly add logic inside the awk block (or whichever language you prefer instead) to pick an element based on rules of your choice.

ADD REPLY • link 10.8 years ago by Alex Reynolds 35k

0

Entering edit mode

Since TSSE has no strand info, it could be acting on either the + or - strand? TSSE is upstream of geneE on the - strand and upstream of geneF on the + strand, but it is closer to geneE on the - strand than it is to geneF on the + strand.

ADD REPLY • link 10.8 years ago by jxchong ▴ 160

0

Entering edit mode

That's easy, then. You use a split() call on tssElement, upstreamGene and downstreamGene within the awk block (the TSS and both genes are delimited with tab characters), look at the strand and start/stop coordinate values of each element, and then decide accordingly and print.

ADD REPLY • link 10.8 years ago by Alex Reynolds 35k

0

Entering edit mode

Thanks for this solution, very helpful. Though it seems like awk cannot take the vertical bar (|) as a field separator. I get the following error when I try to run the first simple script you've written above:

awk: syntax error at source line 1
 context is
     >>> | <<<
awk: bailing out at source line 1

Is this just a matter of having to change how the closest-features output is delimited?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by karl.f.erhard • 0

1

Entering edit mode

You might try a BEGIN { FS = "|"; } block in the awk script. You could also certainly choose a different delimiter in closest-features.

ADD REPLY • link 10.0 years ago by Alex Reynolds 35k

0

Entering edit mode

Dear Alex, I have Active Enhancer peaks and I want to assign each peak to nearest TSS (downstream or upstream). My sorted TSS.bed looks like:

chr1    11873   14409
chr1    14361   29370

My sorted genes.bed (active enhancer peak file) looks like:

chr1    812640  813470  Rank_108039     5       .       2.51728 2.10797 0.59423
chr1    842313  842638  Rank_154173     3       .       2.34097 1.79807 0.35120

I tried closest-features TSS.bed genes.bed > answer.bed and result are:

chr1    11873   14409|NA|chr1   812640  813470  Rank_108039     5       .       2.51728 2.10797 0.59423
chr1    14361   29370|NA|chr1   812640  813470  Rank_108039     5       .       2.51728 2.10797 0.59423
chr1    17368   17436|NA|chr1   812640  813470  Rank_108039     5       .       2.51728 2.10797 0.59423

So as you can see 1 entry from genes.bed is assigned against multiple entries of TSS.bed whereas what I want is 1-to-1 mapping. Although it is possible that 2 or 3 different TSS are close to one gene but in my case initial 25 TSS have the same gene and I think I am doing something wrong here (I actually don't know how to modify the awk for my case). Thank you.

ADD REPLY • link 7.8 years ago by Bioinformatist Newbie ▴ 270

1

Entering edit mode

If you want the output of nearest TSS to gene/peak, then reverse the order of arguments: closest-features genes.bed TSS.bed:

USAGE: closest-features [Process-Flags] <input-file> <query-file>
   All input files must be sorted per sort-bed.
   The program accepts BED and Starch file formats
   May use '-' for a file to indicate reading from standard input (BED format only).

   For every element in <input-file>, determine the two elements from <query-file> falling
     nearest to its left and right edges (See NOTES below).  By default, echo the <input-file>
     element, followed by those left and right elements found in <query-file>.

In the reversed case, your input-file becomes genes.bed and query-file becomes TSS.bed. Each gene/peak gets two elements from TSS.bed that fall nearest to it, upstream and downstream.

In your first example, there are two peaks shown for each TSS. The first is NA because there are no peaks upstream of the first few TSS sites; the nearest peaks are all downstream. In any case, it sounds like you want the other result: the nearest TSSs to each peak, so reversing the order of file arguments should address that.

ADD REPLY • link 7.8 years ago by Alex Reynolds 35k

score 4 · Answer 2 · 2013-06-18

4

Entering edit mode

10.9 years ago

Istvan Albert 100k

IMO you are solving the wrong problem.

Your problem is not what bedtools does but how to reconcile potentially conflicting results. But the solution for that is not to reimplement the intersect operation - that would be just doing a lot of work that you don't need to do. Just intersect the upstream regions for both, produce an output file then write a script that process this output. With that the heavy lifting has already been done and you can concentrate your efforts in sorting out which TSS to keep.

ADD COMMENT • link 10.9 years ago by Istvan Albert 100k

0

Entering edit mode

Thanks, but no, that isn't what I want to do -- I don't want to detect the TSS for a given gene, I want to annotate all TSS with their nearest gene (because these are computational predictions, sometimes there are multiple separated TSS regions that are all clearly for the same gene. Additionally alternatively spliced transcripts for the same gene can have different TSS.)

Edit: Actually, upon re-reading your answer, I agree that a complete solution doesn't exist. I will use bedops to get halfway there and then write my own script to pick correctly between the nearest upstream/downstream genes.

ADD REPLY • link 10.8 years ago by jxchong ▴ 160