Question

Variant annotation using several .BED files

0

Entering edit mode

9.5 years ago

drautuna ▴ 60

So, I have a data file containing several hundred variants in the following format:

CHR #       START POS      END POS     VARIANT ID    
1              100                   1000             rs1
1              1200                  1400            rs2

I ran the latter through Annovar to get the gene each variant was in (or its nearest gene and its distance) as well as the region (intronic, exonic, etc) each variant was in. The output had the following columns

GENE/NEARESTGENE   REGION  CHR#   STARTPOS    ENDPOS    REFALLELE   ALTALLELE  VARIANTID
SOMEGENE                   exonic     1         100                 1000           A                  G                 rs1
SOMEGENE2                 intergenic 1        1200                1400           G                 T                    rs2

I moved some columns around to make a file with the following columns, lets call this file.txt

CHR#   STARTPOS    ENDPOS   GENE/NEARESTGENE   REGION  VARIANTID
1           100              1000           SOMEGENE                 exonic           rs1
1           1200             1400           SOMEGENE2               intergenic      rs2

Now, I have several database files - something around 10 - of promoters, TSS, enhancers, etc, all of which in .bed format looking like the following -> lets call these database1.txt ... database10.txt

database1.bed

CHR #       STARTPOS     ENDPOS      LABEL--FOR-THE-REGION/NAME
1               500                600                promoter1
...

database10.bed

CHR #       STARTPOS     ENDPOS      LABEL--FOR-THE-REGION/NAME
1                1100              1250               enhancer1

I want to annotate the "file.txt" using the files from databases1..10.bed. For each database, I would like to add a column to file.txt for that database, and then show the name for that annotation

desired output

CHR#   STARTPOS    ENDPOS   GENE/NEARESTGENE   REGION  VARIANTID   database1.txt  ... database10.txt
1           100              1000           SOMEGENE                 exonic           rs1            promoter1                 -----
1           1200             1400           SOMEGENE2               intergenic      rs2               ---                     enhancer1

The closest thing I can get to this is bedtools annotate, but it doesnt give the specific label (such as promoter1, or enhancer1) for each annotation - instead it just gives the number of intervals overlapped, or the proportion of intervals overlapped out of the total number in the database

Bedtools intersect does a better job, but it only shows the variants that intersected with the database regions. Since I'm working with multiple .bed databases, I'd have to run each one separately, separates the column, and concatenate all of them together to get my desired output. It also doesn't handle the region and gene columns well.

Annovar does the same thing bedtools intersect does, except with less functionality in that it doesn;t even show the specific label.

Thanks for the help.

annovar intersect annotation bedtools • 4.4k views

ADD COMMENT • link updated 3.1 years ago by Ram 43k • written 9.5 years ago by drautuna ▴ 60

Ram · Answer 1 · 2014-10-29

bedtools intersect has new functionality to allow intersections against multiple files. Basic docs are here.

cat q.bed
chr1    0    50

cat d1.bed
chr1    12    18    a1
chr1    17    25    a2

cat d2.bed
chr1    28    30    b1

cat d3.bed
chr1    40    50    c1

# basic intersection with all three files (no annotation)
bedtools intersect -a q.bed -b d1.bed d2.bed d3.bed
chr1    12    18
chr1    17    25
chr1    28    30
chr1    40    50

# report the full record from each file. column 4 is the file _number_
bedtools intersect -a q.bed -b d1.bed d2.bed d3.bed -wa -wb
chr1    0    50    1    chr1    12    18    a1
chr1    0    50    1    chr1    17    25    a2
chr1    0    50    2    chr1    28    30    b1
chr1    0    50    3    chr1    40    50    c1

# report the full record from each file. column 4 is now the file _label_
bedtools intersect -a q.bed -b d1.bed d2.bed d3.bed -wa -wb -names d1 d2 d3

chr1    0    50    d1    chr1    12    18    a1
chr1    0    50    d1    chr1    17    25    a2
chr1    0    50    d2    chr1    28    30    b1
chr1    0    50    d3    chr1    40    50    c1

# report the full record from each file. column 4 is now the filename
bedtools intersect -a q.bed -b d1.bed d2.bed d3.bed -wa -wb -filenames
chr1    0    50    d1.bed    chr1    12    18    a1
chr1    0    50    d1.bed    chr1    17    25    a2
chr1    0    50    d2.bed    chr1    28    30    b1
chr1    0    50    d3.bed    chr1    40    50    c1

Ram · Answer 2 · 2014-10-29

BEDOPS bedmap with bedops can probably help, e.g. assuming sorted inputs:

$ bedops --everything database*.bed | bedmap --echo --echo-map-id-uniq file.bed - > annotated_file.bed

Basically, you use bedops to take the union of your sorted database annotations, and you pipe the union-ed set to bedmap to map against your sorted regions-of-interest (file.bed).

BEDOPS bedops handles more than two files, so you can write --everything from all your annotation files to a UNIX standard output stream, piping it to other tools (including BEDOPS tools) that handle standard input. This lets you do downstream operations in one pass.

The options passed to bedmap specify what attributes or details you want back from any mapped elements.

Here, we use --echo to print out the regions-of-interest, and --echo-map-id-uniq to print out a non-redundant list of IDs from the merged database annotations, which overlap each region-of-interest. The answer is written to a standard file called annotated_file.bed.

More information and links to binaries are available in the docs. Source is available on Github.