Question: Retrieving Pubchem Ids
2
gravatar for Nitin
2.1 years ago by
Nitin60
Nitin60 wrote:

Hi all,

I have list of compound names for which i want to retrieve Pubchem CIDs..to acheive this i wrote a biopython script as follows but it doenst seem to working

from Bio import Entrez

Entrez.email = "sainitin7@gmail.com"

infile = open("data", "r")

out_put = open("ids_data.csv","w")

for line in infile.readlines():

  single_id = line

  #Post list of ids to database

  handle= Entrez.epost("pccompound",names=single_id)

  record = Entrez.read(handle)

  #history

  webEnv=record["WebEnv"]

  queryKey=record["QueryKey"]

  #Retreiving information

  data = Entrez.esummary(db="pccompound",webenv=webEnv,query_key=queryKey)

  res=Entrez.read(data)

  for compound in res:    

    Name = compound["SynonymList"]

    Cid = compound["Id"]

    print "%s:%s" %(Name,Cid)

    out_put.write("%s:%s\n" %(Name,Cid))

out_put.close()

Ideally i want a output as follows

Biruvidine : 446727

Can any body help

Thanks in advance

Nit

ADD COMMENTlink modified 2.1 years ago by Peter3.8k • written 2.1 years ago by Nitin60

Can you fix the formatting? The example is very hard to read, and you didn't show the current output. It sounds like given an PubChem identifier like SID 74891762 you want to get back 'Brivudine: CID446727' - is that right?

ADD REPLYlink written 2.1 years ago by Peter3.8k

Can you fix the formatting? The example is very hard to read.

ADD REPLYlink written 2.1 years ago by Peter3.8k

And what do you mean by "it doenst seem to working"? What is the error message, if any?

ADD REPLYlink written 2.1 years ago by Neilfws41k

Thanks for fixing the formatting. Could you also include an example of the text in ids_data.csv so we have both sample input AND the desired output?

ADD REPLYlink written 2.1 years ago by Peter3.8k
1
gravatar for Wolf Ihlenfeldt
2.1 years ago by
Wolf Ihlenfeldt140 wrote:

a) You are not using the right tools. Here is a simpler and more robust solution using the Cactvs Chemoinformatics toolkit www.xemistry.com/academic for free academic version):

foreach name [split [string trim [read_file data]] "\n"] {
        if {[catch {ens create $name} eh]} {
                puts "$name : not resolved"
        } elseif {[catch {ens get $eh E_CID} cid]} {
                puts "$name: no CID"
                ens delete $eh
        } else {
                puts "$name: $cid"
                ens delete $eh
        }
}

b) Even this script does not work with "Biruvidine". Because the proper name of that compound is "Brivudine". The correct name resolves easily.

Interactive lookup of the name set for a CID:

cactvs>ens create CID446727
ens0
cactvs>ens get ens0 E_NAMESET
{5-[(E)-2-bromoethenyl]-1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)oxolan-2-yl]pyrimidine-2,4-dione} {5-[(E)-2-bromovinyl]-1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)tetrahydrofuran-2-yl]pyrimidine-2,4-dione} {5-[(E)-2-bromovinyl]-1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)-2-tetrahydrofuranyl]pyrimidine-2,4-dione} {5-[(E)-2-bromovinyl]-1-[(2R,4S,5R)-4-hydroxy-5-methylol-tetrahydrofuran-2-yl]pyrimidine-2,4-quinone} 69304-47-8 (E)-5-(2-Bromovinyl)-2'-deoxyuridine (E)-5-(2-Bromovinyl)-deoxyuridine BVDU {Brivudina [INN-Spanish]} Brivudine {Brivudine [INN]} {Brivudinum [INN-Latin]} {CCRIS 2831} Helpin {NSC 633770} {Uridine, 5-(2-bromoethenyl)-2'-deoxy-, (E)-} {Uridine, 5-(2-bromovinyl)-2'-deoxy-, (E)-} trans-5-(2-Bromovinyl)-2'-deoxyuridine Lopac0_000175 (E)-5-(2-Bromovinyl)-dUrd AIDS-070967 AIDS070967 BV-dUrd BrVdUrd Brivudin UA-618 EU-0100175 A-176 5-BROMOVINYLDEOXYURIDINE BVD Bromovinyldeoxyuridine RP-101 Zostex
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Wolf Ihlenfeldt140

What do you mean 'You are not using the right tools'? He's using the NCBI Entrez API to query the NCBI PubChem database which seems like a sensible idea.

ADD REPLYlink written 2.1 years ago by Peter3.8k

Yes, of course it is. But he is burdening himself with all the details, and that can be avoided.

Of course the Cactvs solution uses the same API behind the scenes (for the structure to CID part, the name resolution is actually primarily relying on the more extensive NCI resolver and uses PubChem/Entrez only as a fallback, so this solution will work with cpd names that are not in PubChem). The toolkit code implements error checking, has implicit retrial and timeout handling, etc. Entrez is not exactly the most robust interface in practical operation.

ADD REPLYlink written 2.1 years ago by Wolf Ihlenfeldt140
1
gravatar for Peter
2.1 years ago by
Peter3.8k
Scotland, UK
Peter3.8k wrote:

I would have expected to do this with EFetch, but the NCBI don't seem to support this database with EFetch http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html

Here's how I would do it for one ID, a PubChem identifier like CID 446727 - you can of course generalise this to read the IDs from a file etc and use epost and the history as you were above.

from Bio import Entrez
Entrez.email = "sainitin7@gmail.com"
record = Entrez.read(Entrez.esummary(db="pccompound", id="446727", retmode="xml"))
for entry in record:
    print entry['SynonymList']

I presume that the first synonym is the one you want. In this case the list you get back is:

['Brivudine', 'BVDU', 'Helpin', 'Brivudin', "(E)-5-(2-Bromovinyl)-2'-deoxyuridine", 'Bromovinyldeoxyuridine', 'Brivudinum [INN-Latin]', 'Brivudina [INN-Spanish]', 'CCRIS 2831', '69304-47-8', 'Brivudine (INN)', 'Brivudine [INN]', "Uridine, 5-(2-bromoethenyl)-2'-deoxy-, (E)-", '(E)-5-(2-Bromovinyl)-deoxyuridine', 'NSC 633770', "trans-5-(2-Bromovinyl)-2'-deoxyuridine", "Uridine, 5-(2-bromovinyl)-2'-deoxy-, (E)-", 'Zostex', 'BVD', '5-BVDU', 'E-5-(2-bromovinyl)-dUrd', 'Z-5-(2-bromovinyl)-dUrd', 'Brivudinum', 'Brivudina', 'BrVdUrd', 'NSC633770', 'BV-dUrd', "5-(2-bromovinyl)-2'-deoxyuridine", 'Bromvinyldesoxyuridin', "5-(2-bromoethenyl)-2'-deoxyuridine", 'Zostex (TN)', "(Z)-5-(2-bromovinyl)-2'-deoxyuridine", 'Lopac0_000175', 'C11H13BrN2O5', 'CHEMBL31634', '5-BROMOVINYLDEOXYURIDINE', 'AC1L9K12', '(E)-5-(2-Bromovinyl)-dUrd', 'UNII-2M3055079H', 'RP-101', 'UA-618', 'CCG-204270', 'NCGC00093656-01', 'NCGC00093656-02', 'NCGC00093656-03', 'LS-160809', 'A-176', 'EU-0100175', 'B 9647', 'D07249', '5-[(E)-2-bromoethenyl]-1-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)oxolan-2-yl]pyrimidine-2,4-dione']
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Peter3.8k
1

Just download the complete CID-synonym file from the Compound "Extras" folder and grep for the name and all CIDs associated. Or use eUtils which is supported.

ADD REPLYlink written 2.1 years ago by Evan Bolton40

I think this posts answers the reverse of the question - going from CID to name set. The problem was getting the CID from a name.

ADD REPLYlink written 2.1 years ago by Wolf Ihlenfeldt140

You're probably right. It would have helped it the question included an example input as well as the hoped for output.

ADD REPLYlink written 2.1 years ago by Peter3.8k

The basic idea is correct, though, i.e. use esummary.

ADD REPLYlink written 2.1 years ago by Neilfws41k

Starting with a name, use esearch?

ADD REPLYlink written 2.1 years ago by Peter3.8k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 685 users visited in the last hour