SeqIO.index does not contain all records downloaded from NCBI

0

Entering edit mode

4.8 years ago

stueckmann.daniel • 0

Hello,

I'm trying to download a Genbank file representing 20,000 sequences before parsing it into python using BioPython's SeqIO.parse(). However, some of the files are malformed causing this to throw an error, so I've instead been trying to use SeqIO.index() to debug the files.

If I download the sequences as a Genbank file, and run

d = SeqIO.index('filepath.gb', 'gb')
len(d)

It gives me a length of 211 instead of over 20,000. Even more strange, if I download the sequences as a Genbank (full) file, the same code gives me a length of 740.

My question is 1) why would these two files have different numbers of entries, and 2) how can I get access to all 20,000 sequences?

If anyone is interested in re-creating the search I ran

(("Bacillus"[porgn txid1386] AND ( "300000"[SLEN] : "10000000"[SLEN] ))) NOT project

Under the nucleotide database. The corresponding Genbank files are 790 MB and 8.7 GB (full). Any help would be greatly appreciated.

biopython genbank sequence ncbi • 643 views

ADD COMMENT • link updated 4.7 years ago by Biostar 20 • written 4.8 years ago by stueckmann.daniel • 0

Login before adding your answer.