Question: Converting Ensembl gene id/gene id version to hgnc symbol using Biomart r package
0
gravatar for sahar850
12 weeks ago by
sahar8500
sahar8500 wrote:

Hi,

I need to convert data from TCGA in the form of ensembl gene id version to hgnc symbol using Biomat r package. After creating a data frame containing all the ensembl gene id,I tried this loop code:

 for (i in 1:length(data[,1])) {
        data[i,1] <- getBM(attributes=c('hgnc_symbol'),filters = 'ensembl_gene_id', values =
        sub("\\..*", "", data[i,1]), mart = ensembl) 
      }

But I keep getting this error message:

 Error in x[[jj]][iseq] <- vjj : replacement has length zero

I also tried this code:

 hgnc_id <- getBM(attributes=c('hgnc_symbol'),filters = 'ensembl_gene_id_version', values =  data[,1], mart = ensembl)

In this case I only get 15000 out of the 60000 genes

 hgnc_id <- getBM(attributes=c('hgnc_symbol'),filters = 'ensembl_gene_id', values = sub("\\..*", "", data[,1]), mart = ensembl)

In this case I only get 30000 out of the 60000 genes

Anyone had a similar problem or can offer a solution?

ensembl gene id tcga R • 357 views
ADD COMMENTlink modified 12 weeks ago by RamRS19k • written 12 weeks ago by sahar8500
1

Side note: It's ensembl, there's no e at the end of the word.

ADD REPLYlink written 12 weeks ago by RamRS19k

First off, I'd recommend using parameter names when you call functions, so commands are explicit. This is especially useful with the sub and gsub, as x, pattern and replacement are really weirdly positioned in these functions.

Does sub(pattern="\\..*", replacement="", x=data[1:15,1]) give you the expected output in the expected format (vector)? I recall needing to use sapply to get an unlisted vector of results from gsub.

ADD REPLYlink written 12 weeks ago by RamRS19k

Tnx for the tip, i will add the parameters names (it's actually the first time i'm using R) To the current subject, the sub works fain, i just tried running the code on parts of the data and the error is given in the 532 element which is: ENSG00000036549.11 and ENSG00000036549 after the sub, really cant see why it stopped specifically there... all the element before it actually got the hgnc symbol.

i will try to use try catch so it will skip an index the ones who make this error pop out (if its possible i R...) but if someone have a better solution it will be helpful

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by sahar8500

What happens when you query with just ENSG00000036549? Compare that to a couple of calls made with different gene ids, and you should see where your code breaks.

ADD REPLYlink written 12 weeks ago by RamRS19k

It was the first thin i did, i get the same error message listed above...

ADD REPLYlink written 12 weeks ago by sahar8500

What is your R version?

ADD REPLYlink written 11 weeks ago by RamRS19k

my R version is 3.5.0

ADD REPLYlink written 11 weeks ago by sahar8500

IMO 3.5 might not be mature yet - I've had problems working on 3.5 too. Can you try working on 3.4.1 maybe? You can use conda to install 3.4.1 without affecting your 3.5 installation:

conda create --name r341
source activate r341
conda install -c bioconda r=3.4.1

Once done, you can check which R, ensure it points to the conda environment specific R and install bioconductor.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by RamRS19k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 756 users visited in the last hour