Question

Tool:SubDBer: Flexible taxonomy-based subsampling of sequence databases

0

Entering edit mode

8.9 years ago

qiyunzhu ▴ 130

Dear community,

I present SubDBer, a small program that allows users to build customized sequence databases that only include data of interest at designated comprehensiveness, based on resampling of large, standard databases.

The sampling criteria are variable. For example, starting with NCBI nr, one wants to only include Proteobacteria sequences, and wants to keep one organism per genus only, but meanwhile wants all Escherichia spp. retained. The program will do the job.

This program may be useful in various situations. It not only greatly saves computation time by discarding large amount of unwanted sequences, but also concentrates firepower to a narrow range that actually draws the user's interest (evenly truncated databases cannot do this). Moreover, with the ease of creating user-defined databases, one could always make their toolkit up-to-date and standardized.

Example:

python subDBer.py -in nt -out sub_nt -outfmt blast -within 2 -exclude 1117 -rank genus -size 1 -keep 816,838,1263

This command takes the NCBI nt database as input -> starts with all organisms from Bacteria (TaxID: 2) except for Cyanobacteria (1117) -> picks one representative organism per genus -> except for Bacteroides (816), Prevotella (838) and Ruminococcus (1263), in which all organisms are included -> creates a new BLAST database sub_nt that contains sequence data from the selected organisms.

The program is available at GitHub. There is no dependency.

subsampling blast genome taxonomy • 1.8k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by qiyunzhu ▴ 130

0

Entering edit mode

Is there a publication with more information about your program?

ADD REPLY • link 8.2 years ago by sarahschmedes • 0