Question: Biopython Reporting "Notimplementederror" When Parsing Xml Data Returned From Snp Database In Ncbi
0
gravatar for Paddie
8.6 years ago by
Paddie0
Paddie0 wrote:

I'm working on a simple SNP-caller using NCBI's SNP database. Using Biopython's excellent library, I've easily and successfully queried the given SNP I'm interested in, and verified that it is actually returning meaningful data. What I'm most interested in however is converting returned data into python objects, thus the Entrez.read/parse functions. I've lifted this example pretty much directly from the docs at link text:

>>> from Bio import Entrez
>>> handle = Entrez.efetch("snp", id="1805007",rettype="xml")
>>> rec = Entrez.read(handle)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Bio/Entrez/__init__.py", line 351, in read
    record = handler.read(handle)
  File "Bio/Entrez/Parser.py", line 169, in read
    self.parser.ParseFile(handle)
  File "Bio/Entrez/Parser.py", line 254, in startNamespaceDeclHandler
    raise NotImplementedError("The Bio.Entrez parser cannot handle XML data that make use of XML namespaces")
NotImplementedError: The Bio.Entrez parser cannot handle XML data that make use of XML namespaces

Is the database simply not returning proper XML (that seems to be the case wrt. google chrome), or is this simply a TODO on the library? Is there a different format that the library definitely understands?

Thanks!

python biopython snp • 3.4k views
ADD COMMENTlink modified 5.8 years ago by Daniel E Cook240 • written 8.6 years ago by Paddie0
2

the SNP database uses quite different xml to the rest of the NCBI - check out this question for some workarounds http://biostar.stackexchange.com/questions/12262/find-amino-acid-change-for-snp-using-eutils

ADD REPLYlink written 8.6 years ago by David W4.8k

Brilliant! That's exactly what I needed - can't believe I didn't find it on my first search. Or think about using a different parser :P

ADD REPLYlink written 8.6 years ago by Paddie0
1
gravatar for Daniel E Cook
7.4 years ago by
Daniel E Cook240
Chicago
Daniel E Cook240 wrote:

I wrote a function that may be of some help:

def pull_vars(var_set,line_start,line,multi=False):
    """
    This function parses data from flat files in one of three ways:

    1.) Pulls variables out of a particular line when defined as "variablename=[value]"
    2.) Pulls variables based on a set position within a line.
    3.) Defines variables that can be identified based on a limited possible set of values.

    """
    lineset = [x.split(' | ') for x in line if x.startswith(line_start)]
    if len(lineset) == 0:
        return 
    # If the same line exists multiple times - place results into an array
    if multi == True:
        pulled_vars = []
        for line in lineset:
            cur_set = {}
            for k,v in var_set.items():
                if type(v) == str:
                    try:
                        cur_set[k] = [x for x in line if x.startswith(v)][0].replace(v,'')
                    except:
                        pass
                elif type(v) == int:
                    try:
                        cur_set[k] = line[v]
                    except:
                        pass
                else:
                    try:
                        cur_set[k] = [x for x in line if x in v][0]
                    except:
                        pass
            pulled_vars.append(cur_set)
        return pulled_vars    
    else:
    # Else if the line is always unique, output single dictionary
        line = lineset[0]
        pulled_vars = {}
        for k,v in var_set.items():
            if type(v) == str:
                try:
                    pulled_vars[k] = [x for x in line if x.startswith(v)][0].replace(v,'')
                except:
                    pass
            elif type(v) == int:
                try:
                    pulled_vars[k] = line[v]
                except:
                    pass
            else:
                try:
                    pulled_vars[k] = [x for x in line if x in v][0]
                except:
                    pass
        return pulled_vars

def get_snp(q):
    """ 
    This function takes as input a list of snp identifiers and returns 
    a parsed dictionary of their data from Entrez.
    """

    response = Entrez.efetch(db='SNP', id=','.join(q), rettype='flt', retmode='flt').read()
    r = {} # Return dictionary variable
    # Parse flat file response
    for snp_info in filter(None,response.split('\n\n')):
        # Parse the First Line. Details of rs flat files available here:
        # ftp://ftp.ncbi.nlm.nih.gov/snp/specs/00readme.txt
        snp = snp_info.split('\n')
        # Parse the 'rs' line:
        rsId = snp[0].split(" | ")[0]
        r[rsId] = {}

        # rs vars
        rs_vars = {"organism":1,
                   "taxId":2,
                   "snpClass":3,
                   "genotype":"genotype=",
                   "rsLinkout":"submitterlink=",
                   "date":"updated "}

        # rs vars
        ss_vars = {"ssId":0,
                   "handle":1,
                   "locSnpId":2,
                   "orient":"orient=",
                   "exemplar":"ss_pick=",
                   }

        # SNP line variables:
        SNP_vars = {"observed":"alleles=",
                    "value":"het=",
                    "stdError":"se(het)=",
                    "validated":"validated=",
                    "validProbMin":"min_prob=",
                    "validProbMax":"max_prob=",
                    "validation":"suspect=",
                    "AlleleOrigin":['unknown','germline','somatic','inherited','paternal','maternal','de-novo','bipaternal','unipaternal','not-tested','tested-inconclusive'],
                    "snpType":['notwithdrawn','artifact','gene-duplication','duplicate-submission','notspecified','ambiguous-location;','low-map-quality']}

        # CLINSIG line variables:
        CLINSIG_vars = {"ClinicalSignificance":['probable-pathogenic','pathogenic','other']}

        # GMAF line variables
        GMAF_vars = {"allele":"allele=",
                     "sampleSize":"count=",
                     "freq":"MAF="}

        # CTG line variables
        CTG_vars = {"groupLabel":"assembly=",
                    "chromosome":"chr=",
                    "physmapInt":"chr-pos=",
                    "asnFrom":"ctg-start=",
                    "asnTo":"ctg-end=",
                    "loctype":"loctype=",
                    "orient":"orient="}

        # LOC line variables
        LOC_vars = {"symbol":1,
                    "geneId":"locus_id=",
                    "fxnClass":"fxn-class=",
                    "allele":"allele=",
                    "readingFrame":"frame=",
                    "residue":"residue=",
                    "aaPosition":"aa_position="}

        # LOC line variables
        SEQ_vars = {"gi":1,
                    "source":"source-db=",
                    "asnFrom":"seq-pos=",
                    "orient":"orient="}

        r[rsId]['rs']       = pull_vars(rs_vars,"rs",snp)
        r[rsId]['ss']       = pull_vars(ss_vars,"ss",snp,True)
        r[rsId]['SNP']      = pull_vars(SNP_vars,"SNP",snp)
        r[rsId]['CLINSIG']  = pull_vars(CLINSIG_vars,"CLINSIG",snp)
        r[rsId]['GMAF']     = pull_vars(GMAF_vars,"GMAF",snp)
        r[rsId]['CTG']      = pull_vars(CTG_vars,"CTG",snp,True)
        r[rsId]['LOC']      = pull_vars(LOC_vars,"LOC",snp,True)
        r[rsId]['SEQ']      = pull_vars(SEQ_vars,"SEQ",snp,True)
    return r


snp = get_snp(["12009","122"])
ADD COMMENTlink written 7.4 years ago by Daniel E Cook240

Hi Daniel, i can't find HGVS for snp. How can i retrieve HGVS data by SNP in you custom function.

Thnks for reply.

ADD REPLYlink written 5.8 years ago by alexey.zf0
0
gravatar for Daniel E Cook
5.8 years ago by
Daniel E Cook240
Chicago
Daniel E Cook240 wrote:

Hey Alexey - I am just seeing this reply. Did you ever figure it out? I may be able to track down an answer for you...

Dan

ADD COMMENTlink written 5.8 years ago by Daniel E Cook240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2161 users visited in the last hour