getGeneLengthAndGCContent in EDASeq complains about Ensemble gene ID
1
0
Entering edit mode
3.8 years ago

I am trying to use the getGeneLengthAndGCContent function from the EDASeq library to retrieve gene lengths for c. elegans. I retrieved Ensemble gene ids from Biomart for c.elegans and I'm using those as input to getGeneLengthAndGCContent (I picked "ensemble_gene_id" from the filters of the worm annotation Mart). Here's an example of how I'm trying to use the function:

getGeneLengthAndGCContent("WBGene00001042", org, mode=c("biomart", "org.db"))

When I try to run this, I get

Error in getGeneLengthAndGCContent("WBGene00001042", org, mode = c("biomart",  : 
  Only ENTREZ or ENSEMBL gene IDs are supported.

However, it does seem to me like WBGene00001042 is an Ensembl ID (here's an example of a c elegans gene in Ensembl: http://uswest.ensembl.org/Caenorhabditis_elegans/Gene/Summary?g=WBGene00001042;r=III:5415556-5419565;t=W03A5.7.1) I don't see any other possible ids in Ensembl (I tried gene name, that didn't work either)

EDIT: Scrolling down to the bottom of the page, there is a table which includes "coding sequence length." I am now trying to figure out how to retrieve this independent of getGeneLengthAndGCContent

gene R • 2.2k views
ADD COMMENT
2
Entering edit mode
3.8 years ago
ATpoint 81k

The source code which you can download from Bioconductor indicates that this error is thrown when the function cannot auto-detect the ID type, so ENTREZ or Ensembl. It is the script getLengthAndGC.R and in this lines 19-21.

The auto-check function for the gene names is:

function(id)
{
    type <- NA
    if(grepl("^[Ee][Nn][Ss][A-Za-z]{0,3}[Gg][0-9]+", id)) type <- "ensembl"
    else if(grepl("^[0-9]+$", id)) type <- "entrez"
    else if(grepl("^[Yy][A-Za-z]{2}[0-9]{3}[A-Za-z]", id)) type <- "sgd"
    else if(grepl("^[Aa][Tt][0-9][A-Za-z][0-9]{5}", id)) type <- "tair"
    return(type)
}

Obviously your gene name escaped this classification.

Simple workaround is to turn off this check by replacing lines 19-21 with:

#id.type <- .autoDetectGeneIdType(id[1])
id.type<-"ensembl"
#ifis.na(id.type))
#    stop("Only ENTREZ or ENSEMBL gene IDs are supported.")

You don't even need the entire package, this function alone is sufficient to retrieve the information you want:

> getGeneLengthAndGCContent("WBGene00001042", "cel")
Connecting to BioMart ...
Downloading sequence ...
      length           gc 
1329.0000000    0.3069977
ADD COMMENT

Login before adding your answer.

Traffic: 1520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6