Looking for a tool for finding 'nearest feature' given genomic coordinates?
1
1
Entering edit mode
8.7 years ago
Daniel ★ 4.0k

I have a batch of 'noteworthy positions' (50,000+) and I'm trying to determine the best way to batch annotate them with my annotated genome (GTF format or anything which it can be converted to).

Does anyone know a package that already does this? I have considered writing a basic script but perhaps such a thing exists already.

My intention is to provide the coordinates and return columns such as:

Nearest    Is within    Intron/exon    Distance    positional annotation
gene       gene/UTR/                   from TSS    (if exists)
           Promotor?

Thanks

ChIP-Seq genome • 4.6k views
ADD COMMENT
2
Entering edit mode

The answer here provides three alternatives, they all probably do what you want.

ADD REPLY
0
Entering edit mode

Perfect, thank you.

ADD REPLY
3
Entering edit mode
8.7 years ago

To get the nearest gene, you could use BEDOPS closest-features:

$ closest-features --closest positions.bed genes.bed > answer.bed

To answer if within gene/UTR/promoter/intron/exon, you could use BEDOPS bedmap --indicator (or similar options, depending on what level of detail you need):

$ bedmap --echo --indicator positions.bed genes.bed > answer_genes.bed
$ bedmap --echo --indicator positions.bed UTR.bed > answer_UTR.bed
$ bedmap --echo --indicator positions.bed promoter.bed > answer_promoter.bed

etc.

To get the distance from the nearest TSS, you could use BEDOPS closest-features again:

$ closest-features --closest --dist positions.bed TSS.bed > answer.bed

To get annotations of any overlapping elements, you could use BEDOPS bedmap --echo-map:

$ bedmap --echo --echo-map positions.bed annotations.bed > answer.bed

If your annotations are in a non-BED format, you could use BEDOPS convert2bed to turn them into BED:

$ gtf2bed < annotations.gtf > annotations.bed
$ gff2bed < annotations.gff > annotations.bed
$ vcf2bed < annotations.vcf > annotations.bed

etc.

Because BEDOPS tools use Unix I/O streams, you can convert inside a larger BEDOPS operation pipeline:

$ gff2bed < annotations.gff | closest-features --closest --dist positions.bed - > answer.bed

Or you can use process substitution, if you use bash:

$ closest-features --closest --dist positions.bed <(gff2bed < annotations.gff) > answer.bed

To turn BED-formatted annotations into BED-formatted lists of UTRs, introns, exons and promoters, etc. is easy and a cursory search of Biostars or seqanswers can help locate small scripts in awk or Perl etc. that can do this for various annotation subcategories.

For instance, to turn a BED-formatted list of stranded gene annotations into 1k proximal promoter regions:

$ awk '{ \
        if ($6=="+") { \
            print $1"\t"($2 - 1000)"\t"$2"\t"$4"\t"$5"\t"$6; \
        } \
        else { \
            print $1"\t"$3"\t"($3 + 1000)"\t"$4"\t"$5"\t"$6; \
        } \
    }' genes.bed \
    > promoters.bed

Then it is a matter of using promoters.bed in downstream BEDOPS operations.

ADD COMMENT
0
Entering edit mode

This is absolutely perfect, thank you. The holy trinity of upvote, bookmark and accept!

ADD REPLY

Login before adding your answer.

Traffic: 2002 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6