Question

Looking for a tool for finding 'nearest feature' given genomic coordinates?

1

Entering edit mode

8.7 years ago

Daniel ★ 4.0k

I have a batch of 'noteworthy positions' (50,000+) and I'm trying to determine the best way to batch annotate them with my annotated genome (GTF format or anything which it can be converted to).

Does anyone know a package that already does this? I have considered writing a basic script but perhaps such a thing exists already.

My intention is to provide the coordinates and return columns such as:

Nearest    Is within    Intron/exon    Distance    positional annotation
gene       gene/UTR/                   from TSS    (if exists)
           Promotor?

Thanks

ChIP-Seq genome • 4.6k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.7 years ago by Daniel ★ 4.0k

2

Entering edit mode

The answer here provides three alternatives, they all probably do what you want.

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by h.mon 35k

0

Entering edit mode

Perfect, thank you.

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by Daniel ★ 4.0k

Ram · Accepted Answer · 2015-08-20

To get the nearest gene, you could use BEDOPS closest-features:

$ closest-features --closest positions.bed genes.bed > answer.bed

To answer if within gene/UTR/promoter/intron/exon, you could use BEDOPS bedmap --indicator (or similar options, depending on what level of detail you need):

$ bedmap --echo --indicator positions.bed genes.bed > answer_genes.bed
$ bedmap --echo --indicator positions.bed UTR.bed > answer_UTR.bed
$ bedmap --echo --indicator positions.bed promoter.bed > answer_promoter.bed

etc.

To get the distance from the nearest TSS, you could use BEDOPS closest-features again:

$ closest-features --closest --dist positions.bed TSS.bed > answer.bed

To get annotations of any overlapping elements, you could use BEDOPS bedmap --echo-map:

$ bedmap --echo --echo-map positions.bed annotations.bed > answer.bed

If your annotations are in a non-BED format, you could use BEDOPS convert2bed to turn them into BED:

$ gtf2bed < annotations.gtf > annotations.bed
$ gff2bed < annotations.gff > annotations.bed
$ vcf2bed < annotations.vcf > annotations.bed

etc.

Because BEDOPS tools use Unix I/O streams, you can convert inside a larger BEDOPS operation pipeline:

$ gff2bed < annotations.gff | closest-features --closest --dist positions.bed - > answer.bed

Or you can use process substitution, if you use bash:

$ closest-features --closest --dist positions.bed <(gff2bed < annotations.gff) > answer.bed

To turn BED-formatted annotations into BED-formatted lists of UTRs, introns, exons and promoters, etc. is easy and a cursory search of Biostars or seqanswers can help locate small scripts in awk or Perl etc. that can do this for various annotation subcategories.

For instance, to turn a BED-formatted list of stranded gene annotations into 1k proximal promoter regions:

$ awk '{ \
        if ($6=="+") { \
            print $1"\t"($2 - 1000)"\t"$2"\t"$4"\t"$5"\t"$6; \
        } \
        else { \
            print $1"\t"$3"\t"($3 + 1000)"\t"$4"\t"$5"\t"$6; \
        } \
    }' genes.bed \
    > promoters.bed

Then it is a matter of using promoters.bed in downstream BEDOPS operations.