Hello everyone,
I'm helping to write a Python application for storage and simple processing of RNA/DNA/protein sequences. I come from a software background and have no formal biology training at all (beyond high school). I just have a question about how/why Biopython pull information from GenBank.
I'm using Biopython to get a SeqRecord object for a GenBank ID using this function:
def get_seqrecord(gid, db="nucleotide"):
try:
handle = Entrez.efetch(db=db, id=gid, rettype="gb", retmode="text")
return SeqIO.read(handle,"genbank")
except:
return None
This is fine, but the user also wants to store the "molecule type" for the sequence, from the LOCUS line, e.g.
LOCUS JX978171 39269 bp DNA linear INV 14-MAR-2013
would store "DNA"
LOCUS AY994149 2957 bp mRNA linear INV 01-JUN-2005
would store mRNA.
From what I can tell, Biopython parses that line and extracts the molecule type and uses it to set the sequence alphabet type. In the process, the precise molecule type is lost.
My questions are:
- Is there a reason it does this? Should I not be relying on the molecule type from the LOCUS line?
- Short of manually parsing the LOCUS line (which I would just do by copy-pasting the relevant Biopython code into my own function), is there a way I can get this information for a particular GenBank ID?
Thanks in advance for any help. Sorry if this is a really stupid question, I'm still learning huge swathes of this stuff.