Hello, I need to construct a database containing all protein sequences belonging to the same pfam family. I used blastp to retrieve them but I would like to know if it could be done in one step instead of doing an endless job of downloading every sequence that aligns to the query. Thank you very much in advance.
If you did the blast at NCBI site there is an option to download all matching sequences (in a variety of formats) by scrolling down to the descriptions section on the blast results page, selecting any (or all) hits and then choosing the Download button and format you need.
Thank you! The problem is that the matching sequences for my query are not all the sequences belonging to the same family. I mean, if I select a different query and blast it, something like 100 new sequences appear also with the others obtained with the previous query. Then I do not know how to include ONLY those new and avoid those obtained with the first matching.
Are you referring to psi-blast by any chance? If so you may want to try delta-blast. That can save you time and also avoid false positives.
No, it is blastp by default. When I send a biochemically characterized protein sequence, it gives me a total of 6500 sequences and when I send a different sequences also characterized it gives me 6700. Thus, if a merge 6500+6700 there are a lot of duplicates that I need to remove.