Find Amino Acid Change For Snp Using Eutils

Entering edit mode

13.8 years ago

Dpsguy ▴ 140

Hi…I am just starting out with exploring Entrez Eutils using Biopython. What I need to do is find the amino acid change for a list of rsIDs of missense SNPs. I cannot figure out how to do that. I guess the answer would lie in the xml generated by this query:

handle = Entrez.efetch(db="snp", id="6046", retmode="xml")

But when I try

record = Entrez.read(handle)

It gives me an error like: The Bio.Entrez parser cannot handle XML data that make use of XML namespaces.

I don’t know why this is happening. Maybe I am missing something obvious here…

Is it even possible to get my required information using eutils? If not, can you suggest any other means (except doing it manually for every SNP)?

Thanks in advance.

eutils biopython snp dbsnp • 7.5k views

ADD COMMENT • link updated 12.1 years ago by Daniel E Cook ▴ 280 • written 13.8 years ago by Dpsguy ▴ 140

Entering edit mode

13.8 years ago

Martijn Vermaat ▴ 190

This works for me:

response = Entrez.efetch(db='SNP', id='6046', rettype='flt', retmode='xml')
minidom.parseString(response.read())

ADD COMMENT • link 13.8 years ago by Martijn Vermaat ▴ 190

Entering edit mode

There is possibly more than one amino acid change associated with the SNP, but you can get the annotated ones from your response by looking in the RsStruct elements (or from the HGVS descriptions on NP references in the hgvs elements). E.g. calling .getElementsByTagName('hgvs') on the parsed document could be the first step. Consult some general documentation on XML DOM navigation if you need more information.

ADD REPLY • link 13.8 years ago by Martijn Vermaat ▴ 190

Entering edit mode

Thanks for the tip! Seems like etree can also do the job. But then back to my original question: how do I get the amino acid change from this xml? I am not very familiar with xml and was relying on the Entrez parser to do the job for me. I have no experience with etree or minidom

ADD REPLY • link 13.8 years ago by Dpsguy ▴ 140

Entering edit mode

13.8 years ago

Peter 6.0k

Which version of Biopython do you have? Mine is the latest and it says:

NotImplementedError: The Bio.Entrez parser cannot handle XML data that make use of XML namespaces

You can try another Python XML parser instead. For some reason the NCBI give very different XML back for the SNP database than all their other databases, and the Bio.Entrez parser can't cope: https://redmine.open-bio.org/issues/2771

Interestingly you can try putting http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=6046&retmode=xml into validators like http://www.validome.org/xml/validate/ (says it might be OK) or http://validator.w3.org/ which says its invalid.

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 13.8 years ago by Peter 6.0k

Entering edit mode

I don’t think using another parser would help. From http://eutils.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html :

eFetch utility generates an invalid XML for SNP, so currently it doesn't work through SOAP. The bug is being fixed.

This page seems to have been last updated in 2009, though. Too long a time to get a bug fixed.

So what other options do I have?

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 13.8 years ago by Dpsguy ▴ 140

Entering edit mode

First of all tell the NCBI about this, it will help them to rank priorities if they know how many people are having trouble with this. Also check out what other formats they offer for the SNP database...

ADD REPLY • link 13.8 years ago by Peter 6.0k

Entering edit mode

I wrote to NCBI and the reply was: "SNP data is also available through SOAP web service, which requires this snp specific efetch wsdl:http://eutils.ncbi.nlm.nih.gov/soap/v2.0/efetch_snp.wsdl How the XML object is requested and parsed by the bio.python is more a question for its developers since we do not have resources to trouble shoot third party software."

The best direct query according to them is http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=6046&rettype=xml&retmode=text

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 13.8 years ago by Dpsguy ▴ 140

Entering edit mode

@ Peter: Yes you are right...it does gave the error that you have mentioned. I have edited my question accordingly.

ADD REPLY • link 13.8 years ago by Dpsguy ▴ 140

Entering edit mode

But all this talk about invalid xml and parsers does nothing to answer my original question that is in the title. Now that I have the parsed xml using minidom (see below), how do I use that to get the amino acid change for a mutation?

ADD REPLY • link 13.8 years ago by Dpsguy ▴ 140

Entering edit mode

12.1 years ago

Daniel E Cook ▴ 280

I wrote a function to parse the data from flat files. This is a work in progress, but maybe this can be of some help to someone:

	def pull_vars(var_set,line_start,line,multi=False):
	"""
	This function parses data from flat files in one of three ways:

	1.) Pulls variables out of a particular line when defined as "variablename=[value]"
	2.) Pulls variables based on a set position within a line.
	3.) Defines variables that can be identified based on a limited possible set of values.

	"""
	lineset = [x.split(' \| ') for x in line if x.startswith(line_start)]
	if len(lineset) == 0:
	return
	# If the same line exists multiple times - place results into an array
	if multi == True:
	pulled_vars = []
	for line in lineset:
	cur_set = {}
	for k,v in var_set.items():
	if type(v) == str:
	try:
	cur_set[k] = [x for x in line if x.startswith(v)][0].replace(v,'')
	except:
	pass
	elif type(v) == int:
	try:
	cur_set[k] = line[v]
	except:
	pass
	else:
	try:
	cur_set[k] = [x for x in line if x in v][0]
	except:
	pass
	pulled_vars.append(cur_set)
	return pulled_vars
	else:
	# Else if the line is always unique, output single dictionary
	line = lineset[0]
	pulled_vars = {}
	for k,v in var_set.items():
	if type(v) == str:
	try:
	pulled_vars[k] = [x for x in line if x.startswith(v)][0].replace(v,'')
	except:
	pass
	elif type(v) == int:
	try:
	pulled_vars[k] = line[v]
	except:
	pass
	else:
	try:
	pulled_vars[k] = [x for x in line if x in v][0]
	except:
	pass
	return pulled_vars

	def get_snp(q):
	"""
	This function takes as input a list of snp identifiers and returns
	a parsed dictionary of their data from Entrez.
	"""

	response = Entrez.efetch(db='SNP', id=','.join(q), rettype='flt', retmode='flt').read()
	r = {} # Return dictionary variable
	# Parse flat file response
	for snp_info in filter(None,response.split('\n\n')):
	# Parse the First Line. Details of rs flat files available here:
	# ftp://ftp.ncbi.nlm.nih.gov/snp/specs/00readme.txt
	snp = snp_info.split('\n')
	# Parse the 'rs' line:
	rsId = snp[0].split(" \| ")[0]
	r[rsId] = {}

	# rs vars
	rs_vars = {"organism":1,
	"taxId":2,
	"snpClass":3,
	"genotype":"genotype=",
	"rsLinkout":"submitterlink=",
	"date":"updated "}

	# rs vars
	ss_vars = {"ssId":0,
	"handle":1,
	"locSnpId":2,
	"orient":"orient=",
	"exemplar":"ss_pick=",
	}

	# SNP line variables:
	SNP_vars = {"observed":"alleles=",
	"value":"het=",
	"stdError":"se(het)=",
	"validated":"validated=",
	"validProbMin":"min_prob=",
	"validProbMax":"max_prob=",
	"validation":"suspect=",
	"AlleleOrigin":['unknown','germline','somatic','inherited','paternal','maternal','de-novo','bipaternal','unipaternal','not-tested','tested-inconclusive'],
	"snpType":['notwithdrawn','artifact','gene-duplication','duplicate-submission','notspecified','ambiguous-location;','low-map-quality']}

	# CLINSIG line variables:
	CLINSIG_vars = {"ClinicalSignificance":['probable-pathogenic','pathogenic','other']}

	# GMAF line variables
	GMAF_vars = {"allele":"allele=",
	"sampleSize":"count=",
	"freq":"MAF="}

	# CTG line variables
	CTG_vars = {"groupLabel":"assembly=",
	"chromosome":"chr=",
	"physmapInt":"chr-pos=",
	"asnFrom":"ctg-start=",
	"asnTo":"ctg-end=",
	"loctype":"loctype=",
	"orient":"orient="}

	# LOC line variables
	LOC_vars = {"symbol":1,
	"geneId":"locus_id=",
	"fxnClass":"fxn-class=",
	"allele":"allele=",
	"readingFrame":"frame=",
	"residue":"residue=",
	"aaPosition":"aa_position="}

	# LOC line variables
	SEQ_vars = {"gi":1,
	"source":"source-db=",
	"asnFrom":"seq-pos=",
	"orient":"orient="}

	r[rsId]['rs'] = pull_vars(rs_vars,"rs",snp)
	r[rsId]['ss'] = pull_vars(ss_vars,"ss",snp,True)
	r[rsId]['SNP'] = pull_vars(SNP_vars,"SNP",snp)
	r[rsId]['CLINSIG'] = pull_vars(CLINSIG_vars,"CLINSIG",snp)
	r[rsId]['GMAF'] = pull_vars(GMAF_vars,"GMAF",snp)
	r[rsId]['CTG'] = pull_vars(CTG_vars,"CTG",snp,True)
	r[rsId]['LOC'] = pull_vars(LOC_vars,"LOC",snp,True)
	r[rsId]['SEQ'] = pull_vars(SEQ_vars,"SEQ",snp,True)
	return r


	snp = get_snp(["12009","122"])

view raw biostars-72058.py hosted with ❤ by GitHub

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 12.1 years ago by Daniel E Cook ▴ 280

Entering edit mode

13.7 years ago

Dpsguy ▴ 140

I guess I found a workable solution using the hints provided by Martijn Vermaat. I reproduce my code below:

flag = 0
rsid = '6046'
res = minidom.parseString(Entrez.efetch(db='snp', id=rsid, retmode='xml').read())
nodes = res.getElementsByTagName('hgvs') 
for node in nodes:
    if 'NP_' in node.firstChild.nodeValue:
        flag = 1
        val = node.firstChild.nodeValue
        regex1 = r'[A-Z][a-z]+'
        regex2 = r'[0-9]+'
        aa = re.findall(regex1, val)
        pos = re.findall(regex2, val)
        print aa[0] + " > " + aa[1] + " Position: " + pos[2]
if flag == 0:
    print "SNP not in coding region"

The output is the following:

Arg > Gln Position: 413
Arg > Leu Position: 413
Arg > Pro Position: 413
Arg > Gln Position: 391
Arg > Leu Position: 391
Arg > Pro Position: 391

If anyone can provide a better method or code, your suggestions are most welcome.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 13.7 years ago by Dpsguy ▴ 140

Entering edit mode

Guess I'll be immodest and accept my own answer.

ADD REPLY • link 13.7 years ago by Dpsguy ▴ 140