Question: Convert RefseqID to EntrezID
0
gravatar for tomoya
7 days ago by
tomoya0
tomoya0 wrote:

Hi, I have a set of genes with Refseq ids (ex. XM_020713141.1) and I want to convert it to EntrezID (ex. 101165603) for further analysis. I find similar question that said clusterProfiler is suitable for this purpose. [GeneBank accession 2 Entrez gene id ][1] However, I'd tried to find out Medaka (Oryzias latipes) annotation database in Bioconductor annotation packages because I use Medaka for research, there is only major species packages. Is there a way to access NCBI medaka annotation database to covert IDs? Or could you provide me some other method to solve this problem? I would be grateful if you could help me.

R gene • 212 views
ADD COMMENTlink modified 6 days ago • written 7 days ago by tomoya0

I am not sure, but you can try convert on DAVID or use Ensembl,http://grch37.ensembl.org/index.html

ADD REPLYlink written 7 days ago by zhao030

https://www.biotools.fr/ Try this

ADD REPLYlink written 7 days ago by Susmita Mandal40

Sorry I don't think it has Oryzias latipes

ADD REPLYlink written 7 days ago by Susmita Mandal40

Maybe UniProt Retrive/ID mapping is user-friendly and could help: https://www.uniprot.org/uploadlists/ You can submit list of RefSeq ID and then add a Entrez column to output table and then download it.

ADD REPLYlink written 7 days ago by rimgubaev120

Moving this to a comment. Once you select RefSeq id as input, the only output option is UniProtKB id. So this may require two passes if it works at all.

ADD REPLYlink modified 7 days ago • written 7 days ago by genomax67k

Thank you for many suggestions! These are very useful for me and I successfully get almost all EntrezIDs by using biomaRt.

However, I still have some questions. Although I get almost all EntrezIDs, some are missing (results show NA). For example, XR_002293119.2 or XM_004081009.3 or XM_023961859.1. But when I try to search the EntrezID in NCBI website, I can find these EntrezID are 101158738, 101170377, 101155047.

I also tried to change attributes from entrezid to wikigene_id, but results were same (all show NA). Do you think this is because the difference of database version and is there a way to earn these EntrezID?

ADD REPLYlink written 6 days ago by tomoya0
1

Since you are interested in Entrez IDs and starting with RefSeq accessions, why not use an NCBI tool? EDirect works fine for this.

printf 'XR_002293119.2\nXM_004081009.3\nXM_023961859.1' \
    | epost -db nuccore -format acc \
    | elink -db nuccore -target gene -name nuccore_gene \
    | esummary -format uid
101170377
101158738
101155047
ADD REPLYlink modified 6 days ago • written 6 days ago by vkkodali1.1k

I think it is because those genes are not part of the current Ensembl release (so either you wait for an update or use vkkodali's method): http://www.ensembl.org/Multi/Search/Results?q=XM_023961859

ADD REPLYlink written 6 days ago by benformatics840

Thank you very much, both of you for your comments. I understand this is because these genes are not include in current Ensembl release and EDirect can solve this.

Thanks to vkkodali comment, I notice if I want to use EDirect by R, I can use reutils or rentrez. And I tried below command learning from above command,

multiple.ids <- c("XR_002293119.2","XM_004081009.3","XM_023961859.1")
refseq <- epost(multiple.ids, "nuccore")
refseq2 <- elink(refseq, dbFrom = "nuccore", dbTo = "gene", linkname = "nuccore_gene")
esummary(refseq2)

But I can't earn EntrezID like above.

I was wondering if you could help me again. Thank you.

ADD REPLYlink written 6 days ago by tomoya0

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question.

esummary(refseq2)

You have to specifically ask for esummary -format uid. I am not sure how you do that in R.

ADD REPLYlink modified 6 days ago • written 6 days ago by genomax67k

Sorry for twice. I notice I need to reply like this.

I see. I need to specify the format, but I still struggling how to specify "uid" by reutils.

ADD REPLYlink modified 5 days ago • written 5 days ago by tomoya0

Have you checked to see what is in refseq2?

Edit: I see

> head(refseq2)
List of linked UIDs from database ‘nuccore’ to ‘gene’.
[1] "101170377" "101158738" "101155047"
ADD REPLYlink modified 5 days ago • written 5 days ago by genomax67k

Oh, I already get the results. Thank you for pointing out!

Sorry for several times, I still have one more question. The order of outputs is not same to inputs. So I'd also like to keep the order of outputs or extract both (refseqID and EntrezID in a same order) to find out which refseqID is link to specific EntrezID. I thought the option "correspondence" can keep the order, but it doesn't work.

ADD REPLYlink modified 5 days ago • written 5 days ago by tomoya0
3
gravatar for benformatics
7 days ago by
benformatics840
ETH Zurich
benformatics840 wrote:

Assuming the IDs you have are all derived from the Refseq predicted mRNA (e.g. XM_####).

R solution:

library(biomaRt)

mart <- useMart("ENSEMBL_MART_ENSEMBL",dataset="olatipes_gene_ensembl",host="www.ensembl.org")
BM.info <- getBM(attributes=c('entrezgene','refseq_mrna_predicted'),mart = mart)

## make a function to remove weird numbers in your annotation names
trim.numbers <- function(name){ gsub("\\.[0-9]","",name) }

## match your trimmed refseq IDs to the dataframe and pull out the corresponding entrez id - example below
BM.info$entrezgene[match(trim.numbers('XM_020713141.1'),BM.info$refseq_mrna_predicted)]
[1] 101165603
ADD COMMENTlink written 7 days ago by benformatics840
## how it can be used with multiple ids...
## select ids
multiple.ids <- c("XM_020704464","XM_011491436","XM_020702270","XM_023957409","XM_011476326")
## find entrez ids
BM.info$entrezgene[match(trim.numbers(multiple.ids),BM.info$refseq_mrna_predicted)]
[1] 101165143 101173426 101155179 101167210 101162526
ADD REPLYlink written 7 days ago by benformatics840
0
gravatar for vkkodali
7 days ago by
vkkodali1.1k
United States
vkkodali1.1k wrote:

Point-and-click

  1. Go to Batch Entrez and upload your list of RefSeq accessions. Choose 'Nucleotide' as the database. Click the 'Retrieve' button.
  2. Once you are in the results page, you will find 'Find related data' widget on the right hand side. From the drop-down list, choose 'Gene'. Click 'Find Items' button.
  3. If you just want the list of the unique identifiers, use the 'Send To' menu on the top right corner and choose 'UI List' as the format.

EDirect

Check out bit.ly/entrez-direct for more information. The command to use here would be this:

epost -db nuccore -input <input_file> -format acc \
    | elink -db nuccore -target gene -name nuccore_gene \
    | esummary -format uid

If you need to do this using R, you may want to check out packages such as reutils and rentrez.

ADD COMMENTlink written 7 days ago by vkkodali1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1334 users visited in the last hour