How to BLAST more than 5 sequences against UniProt database?
2
0
Entering edit mode
8 months ago
Riq ▴ 50

I have more than 100 sequences which I want to BLAST against UniProt database. However, the web version of the BLAST (https://www.uniprot.org/blast) limits the number of sequences to only 5. In such a case, is there a programmatic way to BLAST the UniProt database or other approach to BLAST many sequences at once?

BLAST Protein UniProt • 1.4k views
ADD COMMENT
1
Entering edit mode

If you are willing to forgo trembl part then you could use swissprot at NCBI protein web blast.

ADD REPLY
0
Entering edit mode

Thanks! This is a fast way to check if manually annotated entries are enough as database.

ADD REPLY
4
Entering edit mode
8 months ago

For any number of query sequences, if you have room in your hardrive and you are willing to go command-line, you can download UniRef FASTA files from https://ftp.uniprot.org/pub/databases/uniprot/current_release/uniref, format them with makeblastdb and run BLASTP/BLASTX against them with something like:

makeblastdb -in uniref50.fasta -dbtype prot
blastp -query input.fasta -db /path/to/uniref50.fasta

Note that UniRef sets contain clusters of UniProt and Uniparc sequences, read more at https://www.uniprot.org/help/uniref. They are available with 100%, 90% or 50% redundancy cutoffs, and the corresponding compressed FASTA files take 99GB, 43GB and 12GB respectively.

ADD COMMENT
1
Entering edit mode

@b.contreras.moreira Uncompressed UniRef100 FASTA file size is 943 GB, which is way beyond my local computer storage and uncompressed UniRef50 FASTA file is 24.4 GB. I realized that my sequences are all supposed to be Human proteins and therefore I downloaded the human proteome (Swiss-Prot + TrEMBL) from UniProt and ran BLAST locally, which is more efficient way.

ADD REPLY
0
Entering edit mode

Thanks, it is very helpful. Will there be a difference in final output (E-score, Percent Identity, Query cover) if UniRef50 is used instead of UniRef100 other than faster sequence similarity searches?

ADD REPLY
0
Entering edit mode

Sequences within a UniRef50 cluster share at least 50% sequence identity with each other. They group together a broader range of protein sequences, including more distant homologs. UniRef100 clusters contain sequences with a sequence identity of 100%. These clusters are more focused and tend to represent closely related sequences. Hence, the ouput will change depending on which cluster you are using.

ADD REPLY
0
Entering edit mode

Will there be a difference in final output (E-score, Percent Identity, Query cover) if UniRef50 is used instead of UniRef100

Yes. Since the database content is going to be different as shown by the size differences noted above. Blast depends on the database contents to generate statistics and alignments.

ADD REPLY
1
Entering edit mode
8 months ago

If you cannot use local BLAST as suggested above (which may be an excellent alternative, be it with UniProtKB, UniRef or just UniProtKB/Swiss-Prot), UniProt recommends programmatic access as described in this EBI help page: https://ebi-biows.gitdocs.ebi.ac.uk/documentation/webservices/

Go to the "Sequence similarity search" section and select NCBI BLAST+.

ADD COMMENT

Login before adding your answer.

Traffic: 1112 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6