I've been trying to generate a single file containing all the bacterial RefSeq from NCBI. I followed this similar previous discussion: Ncbi Refseq Viral Genomes and slightly modified the perl script posted by bw: Ncbi Refseq Viral Genomes where I changed:
$organism = 'viruses' to $organism= 'bacteria'
This script picked up over 2 million sequences and then I realized that it was picking up a complete genomic assemblies and each contigs from shotgun sequencing as well. Any help will be appreciated to modify the existing bw's script to exclude contigs or any other alternative methods to accomplish this.
I've been trying to generate a single file containing all the bacterial RefSeq
I would not recommend to store all bacterial full-length genomic sequences in a single FASTA file. You will not be able to handle such a huge file efficiently in praxis. For any large data collection you need an index. The most easiest way to create an index is exploiting the file system. Create a directory and store each sequence in a separate file. Then the filenames in that directory are the index.
$organism='viruses' to $organism='bacteria'
You have to find an appropriate query term for Eutils which will result only the sequences you are interested in.
'Bacteria[Organism]' will restrict search to eubacterial sequences
'complete[Properties]' will restrict search to sequences tagged as complete (including WGS)
'WGS[Properties]' will restrict search to contigs from WGS genomes
'srcdb_refseq[prop]' will restrict search to sequences which have been promoted into the redundant NCBI refsequence database
Thus you may use the query "Bacteria[Organism] AND complete[Properties] NOT WGS[Properties] AND srcdb_refseq[prop]". You can try it on the command line: