Question: Filter local BLAST DB by organism
0
gravatar for daniello
2.3 years ago by
daniello0
Frankfurt
daniello0 wrote:

Hi there,

here are several threads about creating a local BLAST database filtered by organism. With:

blastdbcmd -db nr -entry all -outfmt "%g %T" | awk ' { if ($2 == 9606) { print $1 } } ' | blastdbcmd -db nr -entry_batch - -out human_sequences.txt

it is possible to filter the DB for only human entries (txid: 9606). Nice!

But did anyone actually did this? Is splitted the job due to the file sizes into 60 single jobs for the NR database. This works really fine, except for the parts 08, 15 and 34 of the NR database. The jobs are running ridiculously long and the file sizes are 12 times bigger than the original DB-file (at the point where I stopped the script). Also, the created files seem to contain redundant copies of some entries, which causes the filesize. Is this intended? Why should it need multiple copies of one FASTA entry for creating the 'human NR'-database later?

Any suggestions?

blast database filter • 1.1k views
ADD COMMENTlink modified 2.2 years ago by Biostar ♦♦ 20 • written 2.3 years ago by daniello0

why do you want to filter when you can restrict your search against certain organism or IDs.

ADD REPLYlink written 2.3 years ago by Prasad1.5k

This requires the -remote option which again queries the NCBI servers. I assume that this outsources the complete BLAST search or is this only for organism restriction in this case?

ADD REPLYlink written 2.3 years ago by daniello0
1

using options like gilist, seqidlist, negative_gilist etc from the standalone blast tool you can achieve the restricted search.

ADD REPLYlink written 2.3 years ago by Prasad1.5k

So I'll create a list by the first 2/3 of the command above

blastdbcmd -db nr -entry all -outfmt "%g %T" | awk ' { if ($2 == 9606) { print $1 } } ' > gi_list.list

and run with -gilist gi_list.list? That sounds so uncomplicated :) Thanks!

ADD REPLYlink written 2.3 years ago by daniello0

You could remove deplicate gi IDs based on the protein name except unnamed/unknown for example 119597083, 119597084, 119597085, 119597086 all codes for actin related protein 2/3 complex, subunit 1B, 41kDa, isoform CRA_a which are 100% identical end-2-end [was surprised by the existence of multiple copies of same in NR database]

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Prasad1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1155 users visited in the last hour