Question: Using STDIN with BioPython's PDB methods
2
gravatar for jrj.healey
7 months ago by
jrj.healey12k
United Kingdom
jrj.healey12k wrote:

I'm writing some code to get sequences back from PDB files, wrapping the BioPython functionality.

Here's the code at the moment:

#!/usr/bin/env python
'''Convert a PDB file in to its representative sequence.'''

import argparse
import sys
import os

try:
    from Bio.PDB import *
    from Bio import SeqIO
except ImportError:
    sys.stderr.write('Could not import from the Bio module. Try installing with `python -m pip install biopython`')
    sys.exit(1)


def get_args():
    """Parse command line arguments"""

    try:
        parser = argparse.ArgumentParser(
            description='Convert a PDB file to its representative sequence(s) via BioPython.')
        parser.add_argument('-i', '--input', action='store', default=sys.stdin,
                            help='Input PDB file (defaults from STDIN for pipes).')
        parser.add_argument('-o', '--output', action='store', default=sys.stdout,
                            help='Output sequence file (defaults to STDOUT for pipes).')
        if len(sys.argv) == 1:
            parser.print_help(sys.stderr)
            exit(1)
    except:
        sys.stderr.write("An exception occurred with argument parsing. Check your provided options.")

    return parser.parse_args()


def main():
    args = get_args()
    PDB, PPB = PDBParser(), PPBuilder()
    if args.input is sys.stdin:
        structure = PDB.get_structure("File from stdin", args.input)
    else:
        structure = PDB.get_structure(os.path.basename(args.input), args.input)

    sequences = [peptide.get_sequence() for peptide in PPB.build_peptides(structure)]
    for each in sequences:
        print(each)

if __name__ == '__main__':
    main()

Now, the problem arises that I want to be able to pipe a PDB file in to the script as well as be able to pass files. Ordinarily, this would be pretty trivial via sys.stdin, however BioPython seemingly does something a little unusual with it's file objects.

If I pass open(sys.stdin), it (rightly) protests that it expects a string for coercing to Unicode etc, but recieved a file object. Ok, says I, it should be enough to pass sys.stdin to the get_structure() constructor method, as per the line

            structure = PDB.get_structure("File from stdin", args.input)

However this results in the following error:

  File "pdb2seq.py", line 55, in <module>
    main()
  File "pdb2seq.py", line 46, in main
    structure = PDB.get_structure("File from stdin", args.input)
  File "/home/wms_joe/bin/miniconda2/lib/python2.7/site-packages/Bio/PDB/PDBParser.py", line 86, in get_structure
    self._parse(handle.readlines())
ValueError: I/O operation on closed file

Seems obvious enough, the file is closed some how. The BioPython source for this method does the following:

  def get_structure(self, id, file): 
 69          """Return the structure. 
 70   
 71          Arguments: 
 72           - id - string, the id that will be used for the structure 
 73           - file - name of the PDB file OR an open filehandle 
 74   
 75          """ 
 76          with warnings.catch_warnings(): 
 77              if self.QUIET: 
 78                  warnings.filterwarnings("ignore", category=PDBConstructionWarning) 
 79   
 80              self.header = None 
 81              self.trailer = None 
 82              # Make a StructureBuilder instance (pass id of structure as parameter) 
 83              self.structure_builder.init_structure(id) 
 84   
 85              with as_handle(file, mode='rU') as handle: 
 86                  self._parse(handle.readlines()) 
 87   
 88              self.structure_builder.set_header(self.header) 
 89              # Return the Structure instance 
 90              structure = self.structure_builder.get_structure() 
 91   
 92          return structure

So my guess is that BioPython's as_handle() method is doing something to the file handle resulting in it being closed prematurely? However, I can't see anything obvious in as_handle()'s source that would result in the handle being closed?

46  def as_handle(handleish, mode='r', **kwargs): 
 47      r"""Context manager to ensure we are using a handle. 
 48   
 49      Context manager for arguments that can be passed to SeqIO and AlignIO read, write, 
 50      and parse methods: either file objects or path-like objects (strings, pathlib.Path 
 51      instances, or more generally, anything that can be handled by the builtin 'open' 
 52      function). 
 53   
 54      When given a path-like object, returns an open file handle to that path, with provided 
 55      mode, which will be closed when the manager exits. 
 56   
 57      All other inputs are returned, and are *not* closed. 
 58   
 59      Arguments: 
 60       - handleish  - Either a file handle or path-like object (anything which can be 
 61                      passed to the builtin 'open' function: str, bytes, and under 
 62                      Python >= 3.6, pathlib.Path, os.DirEntry) 
 63       - mode       - Mode to open handleish (used only if handleish is a string) 
 64       - kwargs     - Further arguments to pass to open(...) 
 65   
 66      Examples 
 67      -------- 
 68      >>> with as_handle('seqs.fasta', 'w') as fp: 
 69      ...     fp.write('>test\nACGT') 
 70      >>> fp.closed 
 71      True 
 72   
 73      >>> handle = open('seqs.fasta', 'w') 
 74      >>> with as_handle(handle) as fp: 
 75      ...     fp.write('>test\nACGT') 
 76      >>> fp.closed 
 77      False 
 78      >>> fp.close() 
 79   
 80      Note that if the mode argument includes U (for universal new lines) 
 81      this will be removed under Python 3 where is is redundant and has 
 82      been deprecated (this happens automatically in text mode). 
 83   
 84      """ 
 85      # If we're running under a version of Python that supports PEP 519, try 
 86      # to convert `handleish` to a string with `os.fspath`. 
 87      if hasattr(os, 'fspath'): 
 88          try: 
 89              handleish = os.fspath(handleish) 
 90          except TypeError: 
 91              # handleish isn't path-like, and it remains unchanged -- we'll yield it below 
 92              pass 
 93   
 94      if isinstance(handleish, basestring): 
 95          if sys.version_info[0] >= 3 and "U" in mode: 
 96              mode = mode.replace("U", "") 
 97          if 'encoding' in kwargs: 
 98              with codecs.open(handleish, mode, **kwargs) as fp: 
 99                  yield fp 
100          else: 
101              with open(handleish, mode, **kwargs) as fp: 
102                  yield fp 
103      else: 
104          yield handleish

Anyone know what (if anything) I can do to make this work?

pipes biopython stdin i/o pdb • 388 views
ADD COMMENTlink modified 7 months ago • written 7 months ago by jrj.healey12k
2

You have forgotten to say what version of Biopython, Python, etc you are using. Also could you simplify your example?

Does this work for you - I was using Python 2.7.10 or Python 3.7.0rc1 with Biopython 1.73dev0 (i.e. a prerelease of 1.73 from git) and things seem fine?

$ python -c "import sys; from Bio.PDB import PDBParser; print(PDBParser().get_structure('XXX', sys.stdin))" < 1A8O.pdb
<Structure id=XXX>

It is unlikely to be the problem, but does that example file PDB file work for you? It is from Tests/PDB/ in the Biopython source code.

ADD REPLYlink modified 7 months ago • written 7 months ago by Peter5.8k

Ah good point thanks Peter:

Python 2.7.12 :: Continuum Analytics, Inc. and Python 3.6.0 :: Continuum Analytics, Inc. give the same issues.

Biopython is 1.72 on Py3 and 1.71 on Py2.

Your example code does work for me, and also works for the PDB file I was attempting previously. The only thing that occurs to me is that there is then perhaps an issue with how argparse yields its default instead?

It seems the following code omitting argparse works:

import sys
import os
from Bio.PDB import *
from Bio import SeqIO

def main():
    PDB, PPB = PDBParser(), PPBuilder()
#    if args.input is sys.stdin:
#        structure = PDB.get_structure("File from stdin", args.input)
#    else:
#        structure = PDB.get_structure(os.path.basename(args.input), args.input)
    structure = PDB.get_structure("File from stdin", sys.stdin)
    sequences = [peptide.get_sequence() for peptide in PPB.build_peptides(structure)]
    for each in sequences:
        print(each)

if __name__ == '__main__':
    main()
ADD REPLYlink modified 7 months ago • written 7 months ago by jrj.healey12k

Strange but perhaps you are right and something happens inside argparse? I would suggest setting the defaults to the string "-" (minus sign) in the command line API, and when you come to call get_structure use:

if args.input == "-":
     structure = PDB.get_structure("File from stdin", sys.stdin)
else:
     structure = PDB.get_structure(os.path.basename(args.input), args.input)

Some tools support the minus sign as short hand for stdin or stdout, so this is a bonus.

ADD REPLYlink written 7 months ago by Peter5.8k

Yeah I had thought about offering - as a shorthand instead, though it seemed to me it should be possible to do without it altogether. That certainly works as a work around, though I'm surprised argparse interferes like this. I've never come across an issue like that before.

ADD REPLYlink written 7 months ago by jrj.healey12k

I can see from the code you care about the peptide fragments as inferred from the structural data (breaking at discontinuities), otherwise it would be one line to get the PDB sequence out as FASTA format:

$ python -c "import sys; from Bio import SeqIO; SeqIO.convert(sys.stdin, 'pdb-atom', sys.stdout, 'fasta')" < 1A8O.pdb
>1A8O:A
MDIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANPDCKTILKALGPGAT
LEEMMTACQG
ADD REPLYlink written 7 months ago by Peter5.8k

Yep, in this particular case I care about preserving the fragments (though do not care about heteroatoms etc).

ADD REPLYlink modified 7 months ago • written 7 months ago by jrj.healey12k
3
gravatar for jrj.healey
7 months ago by
jrj.healey12k
United Kingdom
jrj.healey12k wrote:

I haven't gotten to the bottom of what it is that BioPython doesn't like about argparse, but the solution I've come up with to allow cat A123.pdb | python script.py to work is the following (using Peter's suggestion):

# Imports etc abridged...
def get_args():    
    try:
        parser = argparse.ArgumentParser(
            description='Convert a PDB file to its representative sequence(s) via BioPython.')
        parser.add_argument('-i', '--input', action='store',
                            help='Input PDB file (defaults from STDIN for pipes).')
        parser.add_argument('-o', '--output', default=sys.stdout, action='store',
                            help='Output sequence file (defaults to STDOUT for pipes).')
    except:
        sys.stderr.write("An exception occurred with argument parsing. Check your provided options.")
        sys.exit(1)

    return parser.parse_args()


def main():
    args = get_args()
    PDB, PPB = PDBParser(), PPBuilder()
    if (args.input is None) or (args.input == '-'):
        structure = PDB.get_structure("XXX", sys.stdin)
    else:
        structure = PDB.get_structure(os.path.basename(args.input), args.input)

    sequences = [peptide.get_sequence() for peptide in PPB.build_peptides(structure)]
    for each in sequences:
        print(each)

if __name__ == '__main__':
    main()
ADD COMMENTlink modified 7 months ago • written 7 months ago by jrj.healey12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1209 users visited in the last hour