Question: Quick way to annotate a target exome BED file with gene name information
0
gravatar for lakhujanivijay
5 weeks ago by
lakhujanivijay4.8k
India
lakhujanivijay4.8k wrote:

I have a whole exome target BED file which looks something like this

1   879287  879533
1   880073  880180
1   880436  880526
1   880897  881033
1   881552  881666

I want to annotate this file with gene name information something like this:

1   879287  879533  gene A
1   880073  880180  gene B
1   880436  880526  gene C
1   880897  881033  gene D
1   881552  881666  gene E

I could see several posts on biostars which I found quite close to a real solution. Can someone point to a working one?

What I tried is downloading a BED format file from UCSC table browser and intersecting that with my BED file using bedtools intersect however several regions are missing.

gene exon bed annotate • 109 views
ADD COMMENTlink written 5 weeks ago by lakhujanivijay4.8k

What I tried is downloading a BED format file from UCSC table browser and intersecting that with my BED file using bedtools intersect however several regions are missing.

how do you know it ? show us the mismatching lines please . Check the files are sorted and the chromsomes names are the same ('1'!='chr1')

ADD REPLYlink written 5 weeks ago by Pierre Lindenbaum127k

What I did was

  1. Downloaded GTF from Ensembl FTP (Homo_sapiens.GRCh37.87.gtf)

  2. Then I made a genes.bed file from it

    awk -F "\t" '$3=="gene" {print $1 "\t"  $4 "\t" $5 "\t" $9 }' Homo_sapiens.GRCh37.87.gtf | sed 's/;/\t/g' | awk -F "\t" '{print $1 "\t"  $2 "\t" $3 "\t" $6}' | sed 's/gene_name "//g' | sed 's/"//g' | grep -v GL  > genes.bed
    
  3. Now I am intersecting this file with my target_exome.bed file and there are around 1200 regions which do not overlap to any gene

    bedtools intersect -a Exome_Target_hg19.bed -b genes.bed -C  | awk -F "\t" '$4==0{print}' | head
    
    1   1262290 1262412 0
    1   1262620 1263143 0
    1   32042749    32043059    0
    1   32044804    32044868    0
    1   32048749    32048842    0
    1   32049061    32049176    0
    1   32050278    32050402    0
    1   32050486    32050637    0
    1   32050751    32050941    0
    1   32051040    32051086    0
    
ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by lakhujanivijay4.8k
{print $1 "\t"  $4 "\t" $5 "\t" $9 }'

should be

{print $1 "\t"  (int($4)-1) "\t" $5 "\t" $9 }'
ADD REPLYlink written 5 weeks ago by Pierre Lindenbaum127k

Thanks Pierre Lindenbaum for the edit; however, the problem remains as it is

ADD REPLYlink written 5 weeks ago by lakhujanivijay4.8k

Is this BED file based on GENCODE/Ensembl? This could explain why certain genes are missing since GENCODE/Ensembl is more comprehensive than RefSeq.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by ATpoint31k

https://github.com/imgag/ngs-bits/blob/master/doc/tools/BedAnnotateGenes.md - however it takes some time to install the database.

Many R packages do gene-based annotation.

ADD REPLYlink written 5 weeks ago by German.M.Demidov1.5k

Are you looking for gene symbols (MTOR), or Ensembl identifiers (ENSG*)?

We should be able to help you using the Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables) or Data Integrator (http://genome.ucsc.edu/cgi-bin/hgIntegrator). There are also computational solutions using the mysql server or premade gtf files (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/) depending on what output you are looking for.

If you email us your question with your attached file to genome@soe.ucsc.edu we can take a look. Or if the file is too large, then the 1200 regions that don't match, as well as an example of your desired output.

ADD REPLYlink written 4 weeks ago by Luis Nassar360
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1074 users visited in the last hour