Question: Mapping From Gene Names To Gene Lengths In R?
2
gravatar for John St. John
7.4 years ago by
John St. John1.2k
San Francisco, CA, Cancer Therapeutics Innovation Group
John St. John1.2k wrote:

I have a list of genes, and their ENTREZ and HUGO Ids, along with raw counts of RNA-seq reads mapping to those genes. I want to come up with a quick approximation of either RPKM or TPM.

To do that all I am missing is the length of each gene. Is there a quick way to get at this information in R? Maybe something like mapping ENTREZ to RefSeq ID, and then grabbing the length of whatever the first RefSeq transcript happens to be for each gene?

Thanks!

gene R length entrez • 7.3k views
ADD COMMENTlink modified 2.6 years ago by Biostar ♦♦ 20 • written 7.4 years ago by John St. John1.2k
9
gravatar for lefebvrf
7.4 years ago by
lefebvrf100
Montreal
lefebvrf100 wrote:

Quick and dirty total exonic lengths, leveraging GenomeFeatures and the Hs org package:

  hg19GeneLengths <- function(symbols)
    {
        require(TxDb.Hsapiens.UCSC.hg19.knownGene) 
        require(org.Hs.eg.db)
        exons.db = exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by='gene')    
        egs    = unlist(  mget(symbols[ symbols %in% keys(org.Hs.egSYMBOL2EG) ],org.Hs.egSYMBOL2EG) )
        sapply(egs,function(eg)
        {
            exons = exons.db[[eg]]
            exons = reduce(exons)
            sum( width(exons) )
        })
    }
    hg19GeneLengths( c('STAT1','CXCL10','ACTB','PDCD1') )


 STAT1 CXCL10   ACTB  PDCD1 
  4943   1216   2253   2112
ADD COMMENTlink written 7.4 years ago by lefebvrf100
2
gravatar for David W
7.4 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

John,

You could consider rentrez, which is a pretty low-level R library for the entrez API (which I wrote!).

In this case you would want to find links to other databases from your gene IDs, and then get the summary file for the refseq record:

library(rentrez)

links <- entrez_link(db = "all", ids = 353174, dbfrom = "gene")
names(links)


##  [1] "gene_bioproject"            "gene_biosystems_all"       
##  [3] "gene_cdd"                   "gene_dbvar"                
##  [5] "gene_genome"                "gene_geoprofiles"          
##  [7] "gene_gtr"                   "gene_homologene"           
##  [9] "gene_nuccore"               "gene_nuccore_mgc"          
## [11] "gene_nuccore_pos"           "gene_nuccore_refseqrna"    
## [13] "gene_nucest"                "gene_nucest_clust"         
## [15] "gene_nucleotide"            "gene_nucleotide_clust"     
## [17] "gene_nucleotide_mgc"        "gene_nucleotide_mgc_url"   
## [19] "gene_nucleotide_pos"        "gene_omim"                 
## [21] "gene_pcassay"               "gene_pcassay_proteintarget"
## [23] "gene_pccompound"            "gene_pcsubstance"          
## [25] "gene_pmc"                   "gene_probe"                
## [27] "gene_protein"               "gene_protein_refseq"       
## [29] "gene_pubmed"                "gene_pubmed_citedinomim"   
## [31] "gene_pubmed_rif"            "gene_snp"                  
## [33] "gene_snp_genegenotype"      "gene_taxonomy"             
## [35] "gene_unigene"               "gene_unists"               
## [37] "file"

Once you have ID for the refseq record, you can fetch an (unparsed) XML file with entrez_summary:

e_sum <- entrez_summary(db = "nuccore", id = links$gene_nuccore_refseqrna)
print(e_sum)

## 
## http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSummary_041029.dtd">
## <eSummaryResult>
##   <DocSum>
##     <Id>206725455</Id>
##     <Item Name="Caption" Type="String">NM_180990</Item>
##     <Item Name="Title" Type="String">Homo sapiens zinc activated ligand-gated ion channel (ZACN), mRNA</Item>
##     <Item Name="Extra" Type="String">gi|206725455|ref|NM_180990.3|[206725455]</Item>
##     <Item Name="Gi" Type="Integer">206725455</Item>
##     <Item Name="CreateDate" Type="String">2003/05/15</Item>
##     <Item Name="UpdateDate" Type="String">2012/11/11</Item>
##     <Item Name="Flags" Type="Integer">512</Item>
##     <Item Name="TaxId" Type="Integer">9606</Item>
##     <Item Name="Length" Type="Integer">1473</Item>
##     <Item Name="Status" Type="String">live</Item>
##     <Item Name="ReplacedBy" Type="String"/>
##     <Item Name="Comment" Type="String"></Item>
##   </DocSum>
## </eSummaryResult>
##

Presmusing this is the process you would go through, then wrapping it up in a function (and using Xpath to extract information from the XML) is straightforward:

get_entrez_length <- function(e_id) {
    links <- entrez_link(db = "nuccore", ids = e_id, dbfrom = "gene")
    summary_xml <- entrez_summary(db = "nuccore", id = links$gene_nuccore_refseqrna)
    len <- xpathSApply(summary_xml, "//Item[@Name='Length']", xmlValue)
    return(as.integer(len))
}

get_entrez_length(353174)

## [1] 1473

If you do use rentrez and have any questions/suggestions/bugs let me know - still developing it so happy to make it more useful.

ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by David W4.7k
0
gravatar for ed.liaw
7.4 years ago by
ed.liaw90
ed.liaw90 wrote:

Hey John :D

I don't have an exact solution for you, but maybe this information would help (wouldn't necessarily be quick though):

It looks like someone was coding up an R interface to eutils: http://code.google.com/p/reutils/source/browse/branches/lite/eutils.r?r=24 . I looks to me like esearch and esummary are functional enough to take queries into an XML tree. Might be more of a hassle than its worth, though.

ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by ed.liaw90
0
gravatar for Irsan
7.4 years ago by
Irsan7.2k
Amsterdam
Irsan7.2k wrote:

biomaRt?

ADD COMMENTlink written 7.4 years ago by Irsan7.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 802 users visited in the last hour