Question: how to get gene names and gene id given co ordinates from ensembl
0
gravatar for novicebioinforesearcher
14 months ago by

I need to get annotations for co ordinates that are from junctions.bed file (tophat) eg.

file1:

  1 4764597 4767606 -
  1 4764597 4766491 -
  1 4766882 4767606 -
  1 4767729 4772649 -
  1 4767729 4768829 -
  1 4767729 4775654 -
  1 4772382 4772649 -
  1 4772814 4774159 -

using ensembl mouse.37.67 gtf or gff what is the best way to get the gene id and gene name for these co ordinates(~211000 rows)?

couple of questions regarding this 1) read that bedtools can do this but not sure if it takes gtf (from documentation it only takes BAM/BED/GFF/VCF) 2) can i convert gtf to gff? 3) if i have to code myself how do i factor the (-) or (+) strand

any pointers appreciated.

rna-seq ensembl • 672 views
ADD COMMENTlink modified 14 months ago by Emily_Ensembl15k • written 14 months ago by novicebioinforesearcher40
3
gravatar for Devon Ryan
14 months ago by
Devon Ryan81k
Freiburg, Germany
Devon Ryan81k wrote:
  1. bedtools intersect accepts a GTF file.
  2. GTF is a vague subversion (sometimes called 2.2) of GFF.
  3. bedtools can do strand-specific intersects, though whether that makes sense or not is up to you.
ADD COMMENTlink written 14 months ago by Devon Ryan81k

thanks devon, I am confused with the results,

command used

bedtools intersect -wa -a mm9.gff -b junction.bed > bedtools.out.txt
cat junction.bed | wc -l
211618
cat bedtools.out.txt | wc -l 
746085

am I not suppose to get the same number of lines?

ADD REPLYlink modified 14 months ago • written 14 months ago by novicebioinforesearcher40

I imagine it intersects everything, so filter out the exons/CDS/transcript entries from the results.

ADD REPLYlink written 14 months ago by Devon Ryan81k
3
gravatar for Alex Reynolds
14 months ago by
Alex Reynolds24k
Seattle, WA USA
Alex Reynolds24k wrote:

Via BEDOPS bedmap and gtf2bed:

$ bedmap --echo --echo-map-id-uniq --delim '\t' junctions.bed <(gtf2bed < annotations.gtf) > answer.bed

Replace --echo-map-id-uniq with --echo-map if you want the entire annotation.

Replace gtf2bed with gff2bed if your annotations are in GFF format.

You can use awk to easily split UCSC BED files by stranded subsets, before operation.

$ awk '($6=="+")' in.bed > in.forwardStranded.bed

Replace the 6 in $6 with the column number of your strand, if it is not in the sixth column (per UCSC).

ADD COMMENTlink modified 14 months ago • written 14 months ago by Alex Reynolds24k

the question is do we need to account for strand while performing such overlap and annotation.

ADD REPLYlink written 14 months ago by novicebioinforesearcher40
1

It depends on your junctions.bed and annotations files. If your reference file is stranded and the annotations are stranded, then separate the reference file by strand, separate the annotations by strand, and run the set operations as shown on the separated files.

For the forward-stranded sets:

$ awk '($6=="+")' junctions.bed > junctions.fw.bed
$ gff2bed < annotations.gff | awk '($6=="+")' > annotations.fw.bed
$ bedmap --echo --echo-map-id-uniq --delim '\t' junctions.fw.bed annotations.fw.bed > answer.fw.bed

For the reverse-stranded sets:

$ awk '($6=="-")' junctions.bed > junctions.rv.bed
$ gff2bed < annotations.gff | awk '($6=="-")' > annotations.rv.bed
$ bedmap --echo --echo-map-id-uniq --delim '\t' junctions.rv.bed annotations.rv.bed > answer.rv.bed

Then you could do a multiset union operation on the stranded answers to get a single file with all your overlaps:

$ bedops --everything answer.fw.bed answer.rv.bed > answer.bed
ADD REPLYlink modified 14 months ago • written 14 months ago by Alex Reynolds24k

awesome thanks will try..

ADD REPLYlink written 14 months ago by novicebioinforesearcher40
1

Please remember to change the column number when running awk on junctions.bed, if the strand is not in the sixth column.

ADD REPLYlink written 14 months ago by Alex Reynolds24k

`/bedmap --echo --echo-map-id-uniq --delim '\t' junctions.bed <(gtf2bed < Mus_musculus.NCBIM37.67.gtf) > answer.bed Usage: gtf2bed FILE.gtf

gtf2bed: error: Expected single argument (GTF file)`

not sure why is it throwing this error.

ADD REPLYlink written 14 months ago by novicebioinforesearcher40

The convert2bed binary (what runs the BEDOPS version of gtf2bed) would not print any error messages with that text. Do you possibly have another non-BEDOPS script called gtf2bed in your environment's PATH? You might do something like which gtf2bed or locate gtf2bed to find out where this non-BEDOPS script is, and rename or move it out of your PATH.

ADD REPLYlink written 14 months ago by Alex Reynolds24k

removed the "<" sign and it did not throw any error but my answer.bed file just has co ordinates no meta info, ./bedmap --echo --echo-map-id-uniq --delim '\t' junctions.bed <(gtf2bed Mus_musculus.NCBIM37.67.gtf) > answer.bed input file top few lines

11      98566330        98566433        -       
11      98566295        98566433        -       
11      98566581        98566836        -       
11      98566916        98567693        -       
11      98567792        98568140        -       
11      98568230        98568432        -       
11      98568230        98568851        -       
11      98568613        98568851        -

output file is the same..

ADD REPLYlink written 14 months ago by novicebioinforesearcher40

You must pass input to gtf2bed via standard input. In other words, gtf2bed in.gtf will fail if you are using the BEDOPS version of gtf2bed. You must specify gtf2bed < in.gtf if you are using the BEDOPS version of gtf2bed. If you are using a non-BEDOPS version of gtf2bed, I don't know what it produces, so I don't know if or how it will work.

ADD REPLYlink written 14 months ago by Alex Reynolds24k

If you want to be more explicit and also avoid problems with some other non-BEDOPS tool on your system, you can also use convert2bed --input=gtf --output=bed < in.gtf in place of gtf2bed < in.gtf. Also, you should have BEDOPS tools findable in your PATH (e.g. copied to /usr/local/bin or similar), or this command will fail.

ADD REPLYlink modified 14 months ago • written 14 months ago by Alex Reynolds24k

worked like charm, thank you for prompt response!

  11    98566330    98566433    -   ENSMUSG00000017210
  11    98566295    98566433    -   ENSMUSG00000017210
  11    98566581    98566836    -   ENSMUSG00000017210
  11    98566916    98567693    -   ENSMUSG00000017210
  11    98567792    98568140    -   ENSMUSG00000017210
  11    98568230    98568432    -   ENSMUSG00000017210
  11    98568230    98568851    -   ENSMUSG00000017210
ADD REPLYlink modified 14 months ago • written 14 months ago by novicebioinforesearcher40

I had a question regarding the output, some places show blank against the co ordinates and some of them have multiple gene names(i was wondering that junctions should be unique), but when run it via biomart they have geneid and name for eg bedmap output

  1 6235060 6235239 +   ENSMUSG00000025907
  1 6235425 6235506 +   ENSMUSG00000025907
  1 6235587 6235679 +   ENSMUSG00000025907
  1 6235740 6237787 +   ENSMUSG00000025907
  1 6235740 6235832 +   
  1 6237890 6238147 +   ENSMUSG00000025907
  1 6238273 6238357 +   ENSMUSG00000025907
  1 6240248 6251011 +   ENSMUSG00000025907
  1 6240248 6241012 +   ENSMUSG00000025907
  1 6240248 6240636 +   
  1 6251176 6252915 +   ENSMUSG00000025907
  1 4857613 4868108 +   ENSMUSG00000025903;ENSMUSG00000033813
  1 4868213 4876825 +   ENSMUSG00000025903;ENSMUSG00000033813;ENSMUSG00000062588

ensembl output

   Ensembl Gene ID  Associated Gene Name
   ENSMUSG00000025907   Rb1cc1
ADD REPLYlink modified 14 months ago • written 14 months ago by novicebioinforesearcher40
1

Some regions will be blank if there are no annotations which overlap those regions.

Some regions will have more than one annotation mapped to them, if there are multiple overlaps.

In bedmap, you could use --skip-unmapped to remove elements without overlaps. You can also add the --fraction-map 1 option to ensure that 100% of the annotation overlaps — falls entirely within — your region of interest.

This can be too stringent, so you could adjust this fractional value between >0.0 (almost no overlap) and 1.0 (full overlap).

Take a look at the --fraction-* options in bedmap --help or the online docs for a fuller explanation of these constraint operations.

ADD REPLYlink written 14 months ago by Alex Reynolds24k
1

Also, your regions of interest are not sorted correctly, which can cause issues with the results.

Use sort-bed junctions.bed > junctions.sorted.bed and then use junctions.sorted.bed for any further downstream set operations, or you could get incorrect output.

ADD REPLYlink written 14 months ago by Alex Reynolds24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1576 users visited in the last hour