Question: Annotate .bed file with gene names
3
gravatar for stevenlang123
6.3 years ago by
stevenlang123180
United States
stevenlang123180 wrote:

Hi y'all, 

I have a .bed file file currently formatted like this: 

chr     start   stop
1    14520    14812
1    65409    65725
1    65731    66073
1    69381    69700
1    721281    722042
1    752816    753135 

I would like to get something that looks like this  (where for overlapping exons, use the exon boundary, unless this boundary was <10bp, in which case expand the probes to include at minimum 10bp of sequence): 

chr start   stop    name
1   69090   70008   OR4F5
1   565876  566576  
1   801642  802733  
1   861321  861393  SAMD11
1   865534  865716  SAMD11 

Is there a way that I can use UCSC or another tool to accomplish this?

next-gen assembly genome • 11k views
ADD COMMENTlink modified 3.6 years ago by windsur20 • written 6.3 years ago by stevenlang123180
4
gravatar for Manvendra Singh
6.3 years ago by
Manvendra Singh2.1k
Berlin, Germany
Manvendra Singh2.1k wrote:

you can download the bed files of GTF from UCSC carrying gene names in separate coloumn.

Once you got the bed file (or download GTF and convert it into bed format).

then you should extend your bed file with 10 basepairs by some awk command

awk '{ print $1,$2-10,$3+10}' OFS="\t" your_file.bed > new_file.bed 

 (if you have strand info then should do it taking strands care of) 

then intersect with downloaded file

bedtools intersect -a new_file.bed -b downloaded_file.bed -f 1 > results.bed

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Manvendra Singh2.1k

Thank you very much!!

ADD REPLYlink written 6.3 years ago by stevenlang123180

If you separated gtf and bed files for each strand (+ and -) is there a way to merge the resulting files together? Also what is there are multiple features that match your new_file? I am interested in also getting 'exon, intron, 3'UTR, 5'UTR' features too.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by CrisMar80
4
gravatar for Alex Reynolds
6.3 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Grab exons, e.g. via GENCODE:

$ wget ftp://ftp.sanger.ac.uk/pub/gencode/release_21/gencode.21.annotation.gtf.gz
$ gunzip --stdout gencode.v21.annotation.gtf.gz \
    | gtf2bed - \
    | grep "exon" \
    > gencode.exons.bed

Then use bedmap to map exon IDs to your regions of interest (roi.bed):

$ bedmap --echo --echo-map-id-uniq roi.bed gencode.exons.bed > answer.bed

If you need to, use awk to pre-process the exons file based on your criteria:

$ awk '{if (($3-$2) > 10) { print $0 } else { print $1"\t"$2"\t"($2+10)"\t"$4}}' gencode.exons.bed > expanded.exons.bed

Then map on that result.

ADD COMMENTlink written 6.3 years ago by Alex Reynolds31k

Improving this with a pipe from curl to gunzip (without writing gtf.gz):

curl -s "ftp://ftp.sanger.ac.uk/pub/gencode/release_21/gencode.21.annotation.gtf.gz" |
   gunzip -c |
   gtf2bed - |
   <your code>

 

:-)

 

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by PoGibas4.8k

Thanks for the feedback. Sometimes it is useful to keep the annotations file, but keep the compressed version to save disk space, and extracting it only as needed to do analyses. Particularly, network access to download a large file can be a costly part of analyses, in terms of time, especially repeating it unnecessarily. In any case, there are lots of ways to use wget or curl to follow either approach.

ADD REPLYlink written 6.3 years ago by Alex Reynolds31k
0
gravatar for windsur
3.6 years ago by
windsur20
windsur20 wrote:

Is it possible do the same thing from a bam file? I generated the bed file using bamtobed but I do not know what option should I choice to get the name of the gene and the exon. thanks!

ADD COMMENTlink written 3.6 years ago by windsur20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2402 users visited in the last hour
_