Question

Sequence Parsed Through Biopython Different Than In The Database

0

Entering edit mode

10.7 years ago

Mateusz ▴ 70

Hi,

I am trying to access gene sequences of a given organism using BioPython. After search on a Gene database using e.g. "Escherichia coli[organism]" as query, I am downloading XML file containing genes locus and identifiers. Now, using those information I want to retrieve sequence from the entry in nucleotide database. For this purpose I type:

In [144]: handle = Entrez.efetch(db="nucleotide", id="NC_005327", rettype='gb', retmode='text')
In [145]: rec = SeqIO.read(handle, 'genbank')
In [146]: rec.seq
Out[146]: UnknownSeq(92353, alphabet = IUPACAmbiguousDNA(), character = 'N')
In [147]: print rec.seq.tostring()
Out[147]: NNNNNNNNNNNNNNNNNNNN (.....) NNNNNNNNNNNNNNNNNNNNNNNNN

When I am accessing nucleotide db entry of a given accession through the website, sequence is there and respective genes can be mapped onto it. What's wrong? Am I screwing something during parsing?

Thanks in advance for any insight!

sequence parsing biopython entrez • 3.5k views

ADD COMMENT • link updated 10.7 years ago by Peter 6.0k • written 10.7 years ago by Mateusz ▴ 70

score 1 · Answer 1 · 2013-08-20

The NCBI changed the Entrez behaviour so now retype='gb' can give your the annotation WITHOUT the actual sequence (GenBank lite?). Change it to retype='gbwithparts'.

You can see this for yourself by:

>>> print Entrez.efetch(db="nucleotide", id="NC_005327", rettype='gb', retmode='text').read()
...
     CDS             91849..92250
                     /gene="pemK"
                     /locus_tag="pC15-1a_102"
                     /codon_start=1
                     /transl_table=11
                     /product="hypothetical protein"
                     /protein_id="NP_957647.1"
                     /db_xref="GI:41057027"
                     /db_xref="GeneID:2716471"
                     /translation="MLKYQLKNENGWMHRRLVRRKSDMERGEIWLVSLDPTAGHEQQG
                     TRPVLIVTPAAFNRVTRLPVVVPVTSGGNFARTAGFAVSLDGVGIRTTGVVRCDQPRT
                     IDMKARGGKRLERVPETIMNEVLGRLSTILT"
CONTIG      join(AY458016.1:1..92353)
//

Versus:

>>> print Entrez.efetch(db="nucleotide", id="NC_005327", rettype='gbwithparts', retmode='text').read()
...
    92221 cttggccgcc tgtccactat tctgacttga acatggggtt tgaggggcaa ctggatgaaa
    92281 acgtacgatt tagggcacta aaaccgctgt tgtcccacca ttctggtgat tcccaaacgt
    92341 tatttggcta aaa
//