How to download all the protein sequences (FASTA files) that contain a specific short sequence of amino acids on NCBI, such as AKIAE. How to achieve this efficiently on NCBI?
2
0
Entering edit mode
4.2 years ago
taojincs ▴ 50

How can I find and download all the protein sequences (FASTA files) that contain a specific short sequence of amino acids/a specific motif on NCBI, such as AKIAE. How to achieve this efficiently on NCBI?

I tried http://research.bioinformatics.udel.edu/peptidematch/index.jsp http://www.genome.jp/tools/motif/MOTIF2.html But too few sequences are found and I am wondering how to find as many target sequences as possible on NCBI. Thanks.

NCBI Protein • 2.8k views
1
Entering edit mode
0
Entering edit mode
4.2 years ago
Bioaln ▴ 350

1.) Download protein sequences (ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz) 2.) grep the motif for individual file (many options here, for separate files, you can use e.g. a simple bash for loop: for j in ls; do cat $j | grep 'motif';echo$j;done... or use a fasta parser in e.g. python)

0
Entering edit mode
4.2 years ago

You can also try UniProt's peptide search service http://www.uniprot.org/peptidesearch, although this is using the same service as the interface you tried at U Delaware.

If you have any sequences that are returned at NCBI but not by the UniProt service, it would be interesting to see examples. Databases have different scopes, and different redundancy criteria, and in my opinion sheer numbers should not be the only way to assess completeness or significance of your results.

Please feel free to contact the UniProt helpdesk with examples that you can find at NCBI but not in UniProt.