Question: A strategy to get Bacterial NCBI RefSeqs excluding contigs in an one file
0
gravatar for akh22
4.3 years ago by
akh2210
United States
akh2210 wrote:

I've been trying to generate a single file containing all the bacterial RefSeq from NCBI.  I followed   this similar previous discussion; 

Ncbi Refseq Viral Genomes

and slightly modified the perl script posted by bw ;

A: Ncbi Refseq Viral Genomes

where I changed ;

  1. $organism = 'viruses' to $organism= 'bacteria'  

This script picked up over 2 million sequences and then I realized that it was picking up a complete genomic assemblies and each contigs from shotgun sequencing as well.  Any help will be appreciated to modify the existing bw's  script to exclude contigs or any other alternative methods to accomplish this.

 

Thanks.  

 

 

 

 

 

 

 

rna-seq sequence • 1.6k views
ADD COMMENTlink modified 10 months ago by Biostar ♦♦ 20 • written 4.3 years ago by akh2210
0
gravatar for piet
4.3 years ago by
piet1.6k
planet earth
piet1.6k wrote:

>I've been trying to generate a single file containing all the bacterial RefSeq

I would not recommend to store all bacterial full-length genomic sequences in a single FASTA file. You will not be able to handle such a huge file efficiently in praxis. For any large data collection you need an index. The most easiest way to create an index is exploiting the file system. Create a directory and store each sequence in a separate file. Then the filenames in that directory are the index.

>$organism='viruses' to $organism='bacteria'

You have to find an appropriate query term for Eutils which will result only the sequences you are interested in. 

  • 'Bacteria[Organism]' will restrict search to eubacterial sequences
  • 'complete[Properties]' will restrict search to sequences tagged as complete (including WGS)
  • 'WGS[Properties]' will restrict search to contigs from WGS genomes
  • 'srcdb_refseq[prop]' will restrict search to sequences which have been promoted into the redundant NCBI refsequence database

Thus you may use the query "Bacteria[Organism] AND complete[Properties] NOT WGS[Properties] AND srcdb_refseq[prop]". You can try it on the command line:

wget -O - 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&rettype=count&term=Bacteria[Organism]+AND+complete[Properties]+NOT+WGS[Properties]+AND+srcdb_refseq[prop]'
<eSearchResult>
        <Count>10631</Count>
</eSearchResult>

 

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by piet1.6k
0
gravatar for akh22
4.3 years ago by
akh2210
United States
akh2210 wrote:

Hi Piet, 

Thanks for your response.   The one giant fasta file of all the bacteria refseq will be used as a ref sequence for a read assembly and subsequent data-mining.   It is simpler if all the 10631 ref sequences will be contained in the single file rather than doing going through each of 10631 sequences individually. 

Aki

 

 

ADD COMMENTlink written 4.3 years ago by akh2210
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1315 users visited in the last hour