Question

How To Get All Proteins Smaller Than 200 Amino Acids Out Of Ncbi Nr Database?

1

Entering edit mode

13.3 years ago

Niek De Klein ★ 2.6k

I want to get all proteins from the NCBI nr datbase that are smaller than 200 amino acids. I want to use them to make a local database to blast for a target small protein. I tried downloading nr.gz from ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA which is described as:

Sequence databases in FASTA format for use with the stand-alone BLAST programs.
These databases must be formatted using formatdb before they can be used with BLAST.

This was the closest thing I could find to get all the fasta sequences, but the database fasta format is not something I ever worked with, and because of the size of the file (10gb) I can only manage to open it with Less, and as far as I've seen it seems to be mostly the sequence headers.

So I'm looking for a way to download all nr protein sequences OR a different way to do a BLAST search against all proteins <= 200 a.a.

Thanks,
Niek

ncbi fasta protein • 4.8k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 13.3 years ago by Niek De Klein ★ 2.6k

score 7 · Answer 1 · 2012-03-19

7

Entering edit mode

13.3 years ago

Damian Kao 16k

You can run your blast with an entrez query string of:

1:200[slen]

That'll restrict your subject sequence to be between 1 and 200 amino acids.

ADD COMMENT • link 13.3 years ago by Damian Kao 16k

0

Entering edit mode

Does this lower the e-values because the database size gets smaller?

ADD REPLY • link 13.3 years ago by Niek De Klein ★ 2.6k

score 1 · Answer 2 · 2012-03-19

You can download the blast formatted blast database and use the following line to get a Blast formatted database with all the sequences smaller than 200bp:

[?]

fastacmd -p T -D 1 | gawk '{if(substr($1,1,1) == ">") {if (NR>1) {printf "\n%s\t", substr($1,1,length($1)-1)} else {printf "%s\t", substr($1,1,length($1)-1)}} else {printf "%s", $0}} END{printf "\n"}' | gawk 'BEGIN{OFS="\n"}length($2) < 401{print $1,$2}' | formatdb -p T -n nr_smaller_than_200a -i stdin

It first uses fastacmd to convert the Blast db in fasta format (if you already have it in fasta, you can skip this step). The first gawk command transforms fasta sequences in tab-delimited (tbl) format. The second gawk filters by length (<201aa) and outputs again in fasta format. The final formatdb convert the sequences (<=200aa) in a new database with name "nr_smaller_than_200aa".

score 0 · Answer 3 · 2012-03-20

You can also blast against just NCBI's short nr proteins by providing the entrez query '1:200[slen]' as a filter on the blast web page.

Or, if you prefer to run from command line and don't want to download any fasta databases, assuming you've installed BLAST+ from NCBI, you can use use these options to your blast command:

   -db nr -remote -entrez_query '1:200[slen]'