Hello,
I'm trying to download a Genbank file representing 20,000 sequences before parsing it into python using BioPython's SeqIO.parse(). However, some of the files are malformed causing this to throw an error, so I've instead been trying to use SeqIO.index() to debug the files.
If I download the sequences as a Genbank file, and run
d = SeqIO.index('filepath.gb', 'gb')
len(d)
It gives me a length of 211 instead of over 20,000. Even more strange, if I download the sequences as a Genbank (full) file, the same code gives me a length of 740.
My question is 1) why would these two files have different numbers of entries, and 2) how can I get access to all 20,000 sequences?
If anyone is interested in re-creating the search I ran
(("Bacillus"[porgn txid1386] AND ( "300000"[SLEN] : "10000000"[SLEN] ))) NOT project
Under the nucleotide database. The corresponding Genbank files are 790 MB and 8.7 GB (full). Any help would be greatly appreciated.