LCA from BLAST output
0
1
Entering edit mode
2.2 years ago
makrez ▴ 50

I have a fairly large BLAST output from a metagenomics project with millions of lines per sample. The input sequences for the blast search were prodigal annotated proteins. I have been looking for a while now to find a suitable program that calculates the LCA of the BLAST outputs of the sequences.

I have looked into the following softwares:

• MEGAN6: This is based on a UI (there is a command line interface, which is only available for certain editions). I am very hesitant to ever use any GUI, so this is not optimal.

• BASTA: In principle, this tool would be suitable. However, the implementation is not ideal, since it doesn't allow for parallel processing of files (it locks the database folder) and it is incredibly slow when running it on larger files.

What alternatives are out there? I am debating whether it's worth writing everything from scratch, but usually it isn't.

Any inputs would be highly appreciated.

lca blast • 1.1k views
0
Entering edit mode

What did you use as your reference database? It's really a rather simple process, like just a few LOC. This is assuming that you have the lineage information of your hits at hand..

0
Entering edit mode

I used the nr diamond database (which is essentially a modified NCBI database). Thus I have the NCBI TaxIDs available in the second column of the output.

0
Entering edit mode

So essentially all you need to do then is to decide the threshold for your hits, join your blast output with lineage information (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/), and come up with a good awk oneliner..

Edit. for relatively simple things like this, I recommend doing it yourself vs. using some black(ish) box and not fully understanding what is happening. For example, if you were to use Megan, it could just skip many of your hits because they're not in its txid - taxinfo map file. Of course it wouldn't even tell you that something like that occurred (at least this used to be the case many years ago)

0
Entering edit mode

Hi @5heikki I realize this post is rather old. However I am really interested in this last comment from you. I wonder if you have already written the codes which would do as you describe and whether you would be willing to share it? i.e. identify LCA from a list of blast results. I am not good enough at coding to write such code from scratch but if I could see an example I might be able to modify it to fit my specific data.

1
Entering edit mode

Here's an example:

A list of NCBI tax ids:

$cat list 138119 1507783 1602889 1856264 221807 527029 529122 537377 696750 712173  The lineages of these tax ids: $ join -1 1 -2 1 <(sort list) <(sort -k1,1 fullnamelineage.dmp) | awk 'BEGIN{FS=" \\| "}{print $3}' cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Clostridia; Eubacteriales; Peptococcaceae; Desulfitobacterium; Desulfitobacterium hafniense; | cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus; unclassified Bacillus (in: Bacteria); | cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus; unclassified Bacillus (in: Bacteria); | cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Bacillaceae; Oceanobacillus; unclassified Oceanobacillus; | cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus; unclassified Bacillus (in: Bacteria); | cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus; Bacillus cereus group; Bacillus thuringiensis; Bacillus thuringiensis serovar pondicheriensis; | cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus; Bacillus cereus group; Bacillus thuringiensis; | cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Clostridia; Eubacteriales; unclassified Eubacteriales; | cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus; Streptococcus dysgalactiae group; Streptococcus dysgalactiae; Streptococcus dysgalactiae subsp. equisimilis; | cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus; unclassified Bacillus (in: Bacteria); |  A simple loop which stops after there's no more common "column": $ for i in $(seq 1 10); do join -1 1 -2 1 <(sort list) <(sort -k1,1 fullnamelineage.dmp) | awk 'BEGIN{FS=" \\| "}{print$3}' | cut -f"$i" -d ";" | sort -u>tmpf; if [$(wc -l < tmpf) = "1" ]; then awk '{printf "%s ;", \$0}' tmpf; else break; fi; done
cellular organisms ; Bacteria ; Terrabacteria group ; Firmicutes ;


So here the last common ancestor would be an ancestral Firmucutes bacterium. This is one way to do it. I'm sure that there are more sophisticated methods available. Note that a single horizontal gene transfer event (or e.g. including a tax id from a blast hit which shouldn't be included) to some unrelated bacterial group would take the ancestor down to Bacteria..