A strategy to get Bacterial NCBI RefSeqs excluding contigs in an one file
Entering edit mode
9.1 years ago
akh22 ▴ 110

I've been trying to generate a single file containing all the bacterial RefSeq from NCBI. I followed this similar previous discussion: Ncbi Refseq Viral Genomes and slightly modified the perl script posted by bw: Ncbi Refseq Viral Genomes where I changed:

$organism = 'viruses' to $organism= 'bacteria'

This script picked up over 2 million sequences and then I realized that it was picking up a complete genomic assemblies and each contigs from shotgun sequencing as well. Any help will be appreciated to modify the existing bw's script to exclude contigs or any other alternative methods to accomplish this.


sequence RNA-Seq • 2.5k views
Entering edit mode
9.1 years ago
piet ★ 1.8k

I've been trying to generate a single file containing all the bacterial RefSeq

I would not recommend to store all bacterial full-length genomic sequences in a single FASTA file. You will not be able to handle such a huge file efficiently in praxis. For any large data collection you need an index. The most easiest way to create an index is exploiting the file system. Create a directory and store each sequence in a separate file. Then the filenames in that directory are the index.

$organism='viruses' to $organism='bacteria'

You have to find an appropriate query term for Eutils which will result only the sequences you are interested in.

  • 'Bacteria[Organism]' will restrict search to eubacterial sequences
  • 'complete[Properties]' will restrict search to sequences tagged as complete (including WGS)
  • 'WGS[Properties]' will restrict search to contigs from WGS genomes
  • 'srcdb_refseq[prop]' will restrict search to sequences which have been promoted into the redundant NCBI refsequence database

Thus you may use the query "Bacteria[Organism] AND complete[Properties] NOT WGS[Properties] AND srcdb_refseq[prop]". You can try it on the command line:

wget -O - 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&rettype=count&term=Bacteria[Organism]+AND+complete[Properties]+NOT+WGS[Properties]+AND+srcdb_refseq[prop]'

Entering edit mode

Hi Piet,

Thanks for your response. The one giant fasta file of all the bacteria refseq will be used as a ref sequence for a read assembly and subsequent data-mining. It is simpler if all the 10631 ref sequences will be contained in the single file rather than doing going through each of 10631 sequences individually.



Login before adding your answer.

Traffic: 1179 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6