Tool: SubDBer: Flexible taxonomy-based subsampling of sequence databases
gravatar for qiyunzhu
4.8 years ago by
United States
qiyunzhu130 wrote:

Dear community,

I present SubDBer, a small program that allows users to build customized sequence databases that only include data of interest at designated comprehensiveness, based on resampling of large, standard databases.

The sampling criteria are variable. For example, starting with NCBI nr, one wants to only include Proteobacteria sequences, and wants to keep one organism per genus only, but meanwhile wants all Escherichia spp. retained. The program will do the job.

This program may be useful in various situations. It not only greatly saves computation time by discarding large amount of unwanted sequences, but also concentrates firepower to a narrow range that actually draws the user's interest (evenly truncated databases cannot do this). Moreover, with the ease of creating user-defined databases, one could always make their toolkit up-to-date and standardized.



python -in nt -out sub_nt -outfmt blast -within 2 -exclude 1117 -rank genus -size 1 -keep 816,838,1263


This command takes the NCBI nt database as input -> starts with all organisms from Bacteria (TaxID: 2) except for Cyanobacteria (1117) -> picks one representative organism per genus -> except for Bacteroides (816), Prevotella (838) and Ruminococcus (1263), in which all organisms are included -> creates a new BLAST database sub_nt that contains sequence data from the selected organisms.

The program is available at GitHub. There is no dependency.


ADD COMMENTlink written 4.8 years ago by qiyunzhu130

Is there a publication with more information about your program?

ADD REPLYlink written 4.0 years ago by sarahschmedes0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1822 users visited in the last hour