SeqIO.index does not contain all records downloaded from NCBI
0
0
Entering edit mode
4.8 years ago

Hello,

I'm trying to download a Genbank file representing 20,000 sequences before parsing it into python using BioPython's SeqIO.parse(). However, some of the files are malformed causing this to throw an error, so I've instead been trying to use SeqIO.index() to debug the files.

If I download the sequences as a Genbank file, and run

d = SeqIO.index('filepath.gb', 'gb')
len(d)

It gives me a length of 211 instead of over 20,000. Even more strange, if I download the sequences as a Genbank (full) file, the same code gives me a length of 740.

My question is 1) why would these two files have different numbers of entries, and 2) how can I get access to all 20,000 sequences?

If anyone is interested in re-creating the search I ran

(("Bacillus"[porgn txid1386] AND ( "300000"[SLEN] : "10000000"[SLEN] ))) NOT project

Under the nucleotide database. The corresponding Genbank files are 790 MB and 8.7 GB (full). Any help would be greatly appreciated.

biopython genbank sequence ncbi • 643 views
ADD COMMENT

Login before adding your answer.

Traffic: 2721 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6