Quick way to annotate a target exome BED file with gene name information
0
0
Entering edit mode
4.2 years ago

I have a whole exome target BED file which looks something like this

1   879287  879533
1   880073  880180
1   880436  880526
1   880897  881033
1   881552  881666

I want to annotate this file with gene name information something like this:

1   879287  879533  gene A
1   880073  880180  gene B
1   880436  880526  gene C
1   880897  881033  gene D
1   881552  881666  gene E

I could see several posts on biostars which I found quite close to a real solution. Can someone point to a working one?

What I tried is downloading a BED format file from UCSC table browser and intersecting that with my BED file using bedtools intersect however several regions are missing.

exon bed annotate gene • 2.8k views
ADD COMMENT
0
Entering edit mode

What I tried is downloading a BED format file from UCSC table browser and intersecting that with my BED file using bedtools intersect however several regions are missing.

how do you know it ? show us the mismatching lines please . Check the files are sorted and the chromsomes names are the same ('1'!='chr1')

ADD REPLY
0
Entering edit mode

What I did was

  1. Downloaded GTF from Ensembl FTP (Homo_sapiens.GRCh37.87.gtf)

  2. Then I made a genes.bed file from it

    awk -F "\t" '$3=="gene" {print $1 "\t"  $4 "\t" $5 "\t" $9 }' Homo_sapiens.GRCh37.87.gtf | sed 's/;/\t/g' | awk -F "\t" '{print $1 "\t"  $2 "\t" $3 "\t" $6}' | sed 's/gene_name "//g' | sed 's/"//g' | grep -v GL  > genes.bed
    
  3. Now I am intersecting this file with my target_exome.bed file and there are around 1200 regions which do not overlap to any gene

    bedtools intersect -a Exome_Target_hg19.bed -b genes.bed -C  | awk -F "\t" '$4==0{print}' | head
    
    1   1262290 1262412 0
    1   1262620 1263143 0
    1   32042749    32043059    0
    1   32044804    32044868    0
    1   32048749    32048842    0
    1   32049061    32049176    0
    1   32050278    32050402    0
    1   32050486    32050637    0
    1   32050751    32050941    0
    1   32051040    32051086    0
    
ADD REPLY
0
Entering edit mode
{print $1 "\t"  $4 "\t" $5 "\t" $9 }'

should be

{print $1 "\t"  (int($4)-1) "\t" $5 "\t" $9 }'
ADD REPLY
0
Entering edit mode

Thanks Pierre Lindenbaum for the edit; however, the problem remains as it is

ADD REPLY
0
Entering edit mode

lakhujanivijay, BEDTools will do this via any GTF from GENCODE, Ensembl, NCBI, etc. You could also obtain annotation from UCSC and use that. BEDTools has many different functions and parameters - you'll get the right combination eventually.

ADD REPLY
0
Entering edit mode

Is this BED file based on GENCODE/Ensembl? This could explain why certain genes are missing since GENCODE/Ensembl is more comprehensive than RefSeq.

ADD REPLY
0
Entering edit mode

https://github.com/imgag/ngs-bits/blob/master/doc/tools/BedAnnotateGenes.md - however it takes some time to install the database.

Many R packages do gene-based annotation.

ADD REPLY
0
Entering edit mode

Are you looking for gene symbols (MTOR), or Ensembl identifiers (ENSG*)?

We should be able to help you using the Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables) or Data Integrator (http://genome.ucsc.edu/cgi-bin/hgIntegrator). There are also computational solutions using the mysql server or premade gtf files (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/) depending on what output you are looking for.

If you email us your question with your attached file to genome@soe.ucsc.edu we can take a look. Or if the file is too large, then the 1200 regions that don't match, as well as an example of your desired output.

ADD REPLY

Login before adding your answer.

Traffic: 2077 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6