Question: how to extract the genomic positions and chromosome number for a list of genes
0
gravatar for Star
16 months ago by
Star20
Star20 wrote:

Hi all,

I have the list of genes having ensembl id's like "ENSG00000272379.1", I want to retrieve the corresponding chromosome number and start and end of the gene location on the chromosome. I have tried using the Biomart Ensembl (http://asia.ensembl.org/biomart/martview/e96b1b88c4e0cbaf1b9d7442ed9f9b68) but it does not process all the genes in the text file. For example, I have 6000 genes and it outputs the results of only 1200 genes. My input file looks like this

ENSG00000272379.1
ENSG00000175600.11
ENSG00000224017.1
ENSG00000112137.12

I dont know where I am doing it wrong. Any advice would be appreciated. Thanks.

ADD COMMENTlink modified 16 months ago by tiago2112871.2k • written 16 months ago by Star20
1
gravatar for finswimmer
16 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

Hello,

you define IDs with version numbers. Some of the versions are not available anymore in the current release. What you can do is:

  1. Filter for Gene Stable ID(s) without version number(cut -d"." -f1 ensg.txt > ensg_noversion.txt can be used to create a file without version numbers)
  2. Goto grch37.ensembl.org and try and there. But be aware that the position you will receive are based on GRCh37/hg19 and not GRCh38/hg38.

fin swimmer

ADD COMMENTlink written 16 months ago by finswimmer13k

Thankyou for the reply!!! I did it using R as well as Biomart interaface. I had a total of 6605 genes. However I get the required information for only 6414 genes. The data for 191 genes are missing (using GRCh38.p12). Is there any way to get the required information for all the genes?

ADD REPLYlink written 16 months ago by Star20

Could you please provide some of the ids that are missing?

ADD REPLYlink written 16 months ago by finswimmer13k

Hi, Few are some of the genes missing from the data are listed below. However when I searched these genes in GRCh37 build, it mapped to the respective genes but no results were available for GRCh38.p12.

ENSG00000264469.1
ENSG00000179837.6
ENSG00000272216.1
ENSG00000225490.1
ENSG00000186275.7
ADD REPLYlink written 16 months ago by Star20
1
gravatar for tiago211287
16 months ago by
tiago2112871.2k
USA
tiago2112871.2k wrote:

The numbers after the dot are the gene version. It might be that you have very old gene versions.

You may try to use R to get your information. Removing the Ensembl gene version. Like this:

#Get gene names annotation
source("http://bioconductor.org/biocLite.R")
BiocInstaller::biocLite("biomaRt")
library(biomaRt)
biolist <- as.data.frame(listMarts())
ensembl=useMart("ensembl")
esemblist <- as.data.frame(listDatasets(ensembl))
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
filters = listFilters(ensembl)
attributes = listAttributes(ensembl)

t2g<-getBM(attributes=c('ensembl_gene_id',"ensembl_gene_id_version",'chromosome_name','start_position','end_position'), mart = ensembl)

my_ids <- data.frame(ensembl_gene_id_version=c("ENSG00000272379.1","ENSG00000175600.11","ENSG00000224017.1", "ENSG00000112137.12"))
my_ids$ensembl_gene_id <- gsub("\\..*","", my_ids$ensembl_gene_id_version)

my_ids.version <- merge(my_ids, t2g, by= 'ensembl_gene_id')

_

> my_ids.version
  ensembl_gene_id ensembl_gene_id_version.x ensembl_gene_id_version.y chromosome_name start_position end_position
1 ENSG00000112137        ENSG00000112137.12        ENSG00000112137.17               6       12716805     13290484
2 ENSG00000175600        ENSG00000175600.11        ENSG00000175600.15               7       40134977     40860763
3 ENSG00000224017         ENSG00000224017.1         ENSG00000224017.1               7       41101604     41133507
4 ENSG00000272379         ENSG00000272379.1         ENSG00000272379.1               6       13290018     13290490
ADD COMMENTlink modified 16 months ago • written 16 months ago by tiago2112871.2k

Thankyou for the code Tiago211287. But at the end my file (my_ids.version) is returned empty. I had replaced the line

> my_ids <- data.frame(ensembl_gene_id_version=c("ENSG00000272379.1","ENSG00000175600.11","ENSG00000224017.1","ENSG00000112137.12"))

with

`test <- read.table("MetaXcanOutput-BiomartInput.txt")
 my_ids <- data.frame(ensembl_gene_id_version=c(test$v1))`

where "MetaXcanOutput-BiomartInput.txt" is the file containing almost 6000 gene ids along with the version number. Moreover, the data frame object "my_ids" contains two identical column as follows.

ensembl_gene_id_version      ensembl_gene_id
          6544                6544
          4060                4060
          5340                5340

I am pretty new to data science and R. But according to my understanding, in the last code "my_ids.version <- merge(my_ids, t2g, by= 'ensembl_gene_id')" it cannot find the proper 'ensembl_gene_id' due to which "my_ids.version" is empty. Am I right? Can you further suggest?

ADD REPLYlink modified 16 months ago • written 16 months ago by Star20

You need a column with your regular gene_id_versions called ensembl_gene_id_version. And Another column with the edited ensembl_gene_id without versions. created with gsub.

my_ids$ensembl_gene_id <- gsub("\\..*","", my_ids$ensembl_gene_id_version)

Only then you can merge, using:

my_ids.version <- merge(my_ids, t2g, by= 'ensembl_gene_id')
ADD REPLYlink modified 16 months ago • written 16 months ago by tiago2112871.2k

Its done!!! Thank you so much. But there is some problem. In the input file I have 6605 gene ids, however I get the results for the 6414 genes (I tried the same with the Biomart online tool as well). 191 genes are missing. Which means that those genes are not present in the database. What could be possible solution for it? How can I get the information about the remaining (all) genes?

ADD REPLYlink modified 16 months ago • written 16 months ago by Star20

Can you please share a sample of the remaining id's?

ADD REPLYlink written 16 months ago by tiago2112871.2k

Hi, Few are some of the genes missing from the data are listed below. However when I searched these genes in GRCh37 build, it mapped to the respective genes but no results were available for GRCh38.p12.

ENSG00000264469.1
ENSG00000179837.6
ENSG00000272216.1
ENSG00000225490.1
ENSG00000186275.7
ADD REPLYlink modified 16 months ago • written 16 months ago by Star20

Is there a way to extract GRCh37 using biomart in R?

ADD REPLYlink written 16 months ago by Star20

You can try to convert your retired IDs using the ID Conversion tool for GRCh37. Or maybe this Biostar link can help you to access the GRCh37 using biomaRt.

This Ensembl page contains information about converting between both assemblies.

ADD REPLYlink modified 16 months ago • written 16 months ago by tiago2112871.2k

Hello aammarah.632

All of the above suggestions are great - BioMart will work with versioned IDs but you need to select the correct format, and if they are not existing in the current database (regardless of the version) then no results will be found.

I took a look at a couple of your IDs, I can see that they are not in the dedicated GRCh37 site which is the database we continue to update with new data. However, I could find them in the archive site for GRCh37 from 2014, which is not updated so remains a snap shot of the data from 2014. This suggests to me that these genes are no longer in the current database, probably because the annotation has been reviewed and they were found to no longer be correct as new data (e.g. cDNA, protein, or EST) has become available.

You could pass your list of lost IDs through the archive's BioMart either on the website or through the R package - you can see how to do the latter here - you need to specify release 75: How To Use Archived Version Of Ensembl In Biomart. If you want to you can extract the coordinates and map them to GRCh38 using our Assembly converter.

ADD REPLYlink modified 16 months ago • written 16 months ago by Erin_Ensembl410

hmm right. Thankyou. I have extracted the positions of all the genes from GRCh37. Thanks for the help tiago211287 and Erin_Ensembl.

ADD REPLYlink modified 16 months ago • written 16 months ago by Star20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1879 users visited in the last hour