Question

Converting Ensembl gene id/gene id version to hgnc symbol using Biomart r package

0

Entering edit mode

5.6 years ago

sahar850 • 0

Hi,

I need to convert data from TCGA in the form of ensembl gene id version to hgnc symbol using Biomat r package. After creating a data frame containing all the ensembl gene id,I tried this loop code:

 for (i in 1:length(data[,1])) {
        data[i,1] <- getBM(attributes=c('hgnc_symbol'),filters = 'ensembl_gene_id', values =
        sub("\\..*", "", data[i,1]), mart = ensembl) 
      }

But I keep getting this error message:

 Error in x[[jj]][iseq] <- vjj : replacement has length zero

I also tried this code:

 hgnc_id <- getBM(attributes=c('hgnc_symbol'),filters = 'ensembl_gene_id_version', values =  data[,1], mart = ensembl)

In this case I only get 15000 out of the 60000 genes

 hgnc_id <- getBM(attributes=c('hgnc_symbol'),filters = 'ensembl_gene_id', values = sub("\\..*", "", data[,1]), mart = ensembl)

In this case I only get 30000 out of the 60000 genes

Anyone had a similar problem or can offer a solution?

R tcga ensembl gene id • 4.4k views

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 5.6 years ago by sahar850 • 0

1

Entering edit mode

Side note: It's ensembl, there's no e at the end of the word.

ADD REPLY • link 5.6 years ago by Ram 43k

0

Entering edit mode

First off, I'd recommend using parameter names when you call functions, so commands are explicit. This is especially useful with the sub and gsub, as x, pattern and replacement are really weirdly positioned in these functions.

Does sub(pattern="\\..*", replacement="", x=data[1:15,1]) give you the expected output in the expected format (vector)? I recall needing to use sapply to get an unlisted vector of results from gsub.

ADD REPLY • link 5.6 years ago by Ram 43k

0

Entering edit mode

Tnx for the tip, i will add the parameters names (it's actually the first time i'm using R) To the current subject, the sub works fain, i just tried running the code on parts of the data and the error is given in the 532 element which is: ENSG00000036549.11 and ENSG00000036549 after the sub, really cant see why it stopped specifically there... all the element before it actually got the hgnc symbol.

i will try to use try catch so it will skip an index the ones who make this error pop out (if its possible i R...) but if someone have a better solution it will be helpful

ADD REPLY • link 5.6 years ago by sahar850 • 0

0

Entering edit mode

What happens when you query with just ENSG00000036549? Compare that to a couple of calls made with different gene ids, and you should see where your code breaks.

ADD REPLY • link 5.6 years ago by Ram 43k

0

Entering edit mode

It was the first thin i did, i get the same error message listed above...

ADD REPLY • link 5.6 years ago by sahar850 • 0

0

Entering edit mode

What is your R version?

ADD REPLY • link 5.6 years ago by Ram 43k

0

Entering edit mode

my R version is 3.5.0

ADD REPLY • link 5.6 years ago by sahar850 • 0

0

Entering edit mode

IMO 3.5 might not be mature yet - I've had problems working on 3.5 too. Can you try working on 3.4.1 maybe? You can use conda to install 3.4.1 without affecting your 3.5 installation:

conda create --name r341
source activate r341
conda install -c bioconda r=3.4.1

Once done, you can check which R, ensure it points to the conda environment specific R and install bioconductor.

ADD REPLY • link 5.6 years ago by Ram 43k