I have a fairly large BLAST output from a metagenomics project with millions of lines per sample. The input sequences for the blast search were prodigal annotated proteins. I have been looking for a while now to find a suitable program that calculates the LCA of the BLAST outputs of the sequences.
I have looked into the following softwares:
MEGAN6: This is based on a UI (there is a command line interface, which is only available for certain editions). I am very hesitant to ever use any GUI, so this is not optimal.
BASTA: In principle, this tool would be suitable. However, the implementation is not ideal, since it doesn't allow for parallel processing of files (it locks the database folder) and it is incredibly slow when running it on larger files.
What alternatives are out there? I am debating whether it's worth writing everything from scratch, but usually it isn't.
Any inputs would be highly appreciated.
What did you use as your reference database? It's really a rather simple process, like just a few LOC. This is assuming that you have the lineage information of your hits at hand..
I used the
nr
diamond database (which is essentially a modified NCBI database). Thus I have the NCBI TaxIDs available in the second column of the output.So essentially all you need to do then is to decide the threshold for your hits, join your blast output with lineage information (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/), and come up with a good awk oneliner..
Edit. for relatively simple things like this, I recommend doing it yourself vs. using some black(ish) box and not fully understanding what is happening. For example, if you were to use Megan, it could just skip many of your hits because they're not in its txid - taxinfo map file. Of course it wouldn't even tell you that something like that occurred (at least this used to be the case many years ago)
Hi @5heikki I realize this post is rather old. However I am really interested in this last comment from you. I wonder if you have already written the codes which would do as you describe and whether you would be willing to share it? i.e. identify LCA from a list of blast results. I am not good enough at coding to write such code from scratch but if I could see an example I might be able to modify it to fit my specific data.
Here's an example:
A list of NCBI tax ids:
The lineages of these tax ids:
A simple loop which stops after there's no more common "column":
So here the last common ancestor would be an ancestral Firmucutes bacterium. This is one way to do it. I'm sure that there are more sophisticated methods available. Note that a single horizontal gene transfer event (or e.g. including a tax id from a blast hit which shouldn't be included) to some unrelated bacterial group would take the ancestor down to Bacteria..