Question: How To Get All Proteins Smaller Than 200 Amino Acids Out Of Ncbi Nr Database?
1
gravatar for Niek De Klein
7.2 years ago by
Niek De Klein2.5k
Netherlands
Niek De Klein2.5k wrote:

I want to get all proteins from the NCBI nr datbase that are smaller than 200 amino acids. I want to use them to make a local database to blast for a target small protein. I tried downloading nr.gz from ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA which is described as:

Sequence databases in FASTA format for use with the stand-alone BLAST programs.
These databases must be formatted using formatdb before they can be used with BLAST.

This was the closest thing I could find to get all the fasta sequences, but the database fasta format is not something I ever worked with, and because of the size of the file (10gb) I can only manage to open it with Less, and as far as I've seen it seems to be mostly the sequence headers.
So I'm looking for a way to download all nr protein sequences OR a different way to do a BLAST search against all proteins <= 200 a.a.

Thanks, Niek

ncbi fasta protein download • 2.8k views
ADD COMMENTlink written 7.2 years ago by Niek De Klein2.5k
6
gravatar for Damian Kao
7.2 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

You can run your blast with an entrez query string of:

1:200[slen]

That'll restrict your subject sequence to be between 1 and 200 amino acids.

ADD COMMENTlink written 7.2 years ago by Damian Kao15k

Does this lower the e-values because the database size gets smaller?

ADD REPLYlink written 7.2 years ago by Niek De Klein2.5k
1
gravatar for Miguel Pignatelli
7.2 years ago by
Miguel Pignatelli140 wrote:

You can download the blast formatted blast database and use the following line to get a Blast formatted database with all the sequences smaller than 200bp:

[?]

fastacmd -p T -D 1 | gawk '{if(substr($1,1,1) == ">") {if (NR>1) {printf "\n%s\t", substr($1,1,length($1)-1)} else {printf "%s\t", substr($1,1,length($1)-1)}} else {printf "%s", $0}} END{printf "\n"}' | gawk 'BEGIN{OFS="\n"}length($2) < 401{print $1,$2}' | formatdb -p T -n nr_smaller_than_200a -i stdin

It first uses fastacmd to convert the Blast db in fasta format (if you already have it in fasta, you can skip this step). The first gawk command transforms fasta sequences in tab-delimited (tbl) format. The second gawk filters by length (<201aa) and outputs again in fasta format. The final formatdb convert the sequences (<=200aa) in a new database with name "nr_smaller_than_200aa".

ADD COMMENTlink written 7.2 years ago by Miguel Pignatelli140
0
gravatar for Malcolm.Cook
7.2 years ago by
Malcolm.Cook1.0k
kansas, usa
Malcolm.Cook1.0k wrote:

You can also blast against just NCBI's short nr proteins by providing the entrez query '1:200[slen]' as a filter on the blast web page.

Or, if you prefer to run from command line and don't want to download any fasta databases, assuming you've installed BLAST+ from NCBI, you can use use these options to your blast command:

   -db nr -remote -entrez_query '1:200[slen]'
ADD COMMENTlink written 7.2 years ago by Malcolm.Cook1.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1216 users visited in the last hour