For example, I have two proteins from two different families. To estimate if some species have them both I would run two rounds of blast and see.
But what If I have not only one species but three hundred, how to automize this search? As output, I want to get a list of species and for each species information about the existence of each protein in this species. I work with bacteria and this co-occurrence can be a signal of interactions between these proteins. Additionally, I don't want to see co-occurrence of these proteins in every bacteria, I want to estimate what percent of bacteria that have the first protein has the second protein.
Sorry for the simple questions, I'm new in this field, and sometimes it's very hard to find the point to start. Thanks in advance)
Are you doing this with code or with the web? Probably want to do something like:
1) Do blastp (query A and query B vs database of the proteomes of the 300 species). You will get a set of hits from the BLASTP, with one or more proteins from each strain. Each hit protein has a hsp whose e-val/bit score exceeds the threshold settings you set when you run the program. "what bacteria have this protein" can be subjective depending on how you're defining a homolog - you will be defining it using these settings.
2) At this point you will probably want to consolidate your results to get the best hit for each strain. So if you consider 10 hits for a single query, five to strain A (A-protein #1 to A-protein #5) and five to strain B (B-protein #1 to B-protein #5), where A and B are two of the 300 strains, you want two output sequences (best A protein and best B protein) with respect to the single query. The number of output sequences here = number of strains mentioned below.
3) query with max hits (max_query) can be considered as the reference you're choosing to compare against for co-occurance, unless you have a specific protein in mind for this. So then % co-occurance = number of strains (max_query) / number of strains (other query) x 100.
This not something you'd be able to do with the web-based
BLASTinterface. You will have to run things locally. And you'll need to do some post-processing in
pythonto get this done.
The gist of what I'm suggesting: put all your query proteins in a single file. Then use a
for loopor something to search with these against each target bacterial proteome. Ensure you get the output as a table with the query's header, target header, and e-value at the bare minimum (I think the standard
BLASToutput should provide all of this). You'd have 300 tables, one for each species. Just concatenate all the tables together using
Then import this concatenated result table into
python. You might have to do a bit of data munging to get another column which contains the species identifier separately (use regular expressions to get this). What you'll have to do then is group the tables by species and query protein, sort those groupings by e-value, and filter to retain only the best e-value for each grouping. This will give you the candidate homolog for each query protein in each bacterial species (if it exists).
Then what you need to do is split the table into two: one for each query protein. Thereafter you could simply take the intersection of these two new tables (or more precisely the intersection of the species identifier columns from these), and that should tell you which species have homologs for both proteins.
Also I really, really recommend NOT using
BLAST. If these are all protein sequences, please use
DIAMOND. If you're using nucleotide sequences, I'd recommend using
MMseqs2at high sensitivity. Both tools I've mentioned are faster (but comparably accurate) alternatives to
Ok, I'll try. Do these applications have the only advantage in speed or are there any other pros?
MMseqs2can do a ton of other sequence-related stuff, if that interests you (e.g., sequence clustering). Not much else in terms of pros that I can think of.