Question

Classify Or Align ?

4

Entering edit mode

12.3 years ago

Manu Prestat 4.1k

Hi, I am working with a bunch of microbiologists which with I conduct metagenomic analyses. In the past (not so old), I did my PhD thesis on Bayesian Network classification (a supervised classification approach), then, turning to more sensu stricto bioinformatic tasks started on my first postdoc, my PhD experience made me more sensitive to classification approaches whenever it's possible, for instance, when I want to screen (in silico) my metagenome sequences.

So, when coworkers ask me for a RNA16S (rrs gene) analysis, I propose a RDP classifier approach (based on a Naive Bayes method), or a HHMER approach for a functional screening. But most of the time, they don't like, they do prefer BLAST. It's slower, but, according to them the results are more accurate, and most importantly, more relevant according to their expectations. I try to explain what the differences between the 2 approaches (essentially classifiers are based on learnt profiles which is supposed to be more relevant when the goal is to classify), but obviously without success.

Regarding to my lack of arguments, and before investigating deeper this "issue" by myself, I would like to have your opinion about that. Thanks a lot.

classification alignment blast hmm metagenomics • 3.9k views

ADD COMMENT • link updated 5.3 years ago by lagartija ▴ 160 • written 12.3 years ago by Manu Prestat 4.1k

score 7 · Answer 1 · 2012-01-19

I have been in a similar position in the past and after some arguments back and forth I chose to provide people what they prefer even if that may not have been my primary choice.

I personally believe that BLAST based methods are more inaccurate - especially since a second (usually very simplistic) algorithm needs to be applied to resolve matches to different taxonomical levels.

Fundamentally a BLAST alignment does not account for the information content of the base that it aligns over. A conserved region carries a lot less information about how the sequence should be classified than a variable region. Finally a BLAST result parsing classifier will happily classify sequences even when it shouldn't - this leads to fewer unknowns and it is what makes the method look better.

I recommend that you use both and compare the results (I have often seen entire families missing from one method vs the other).

Finally as a parting advice, by and large I found that the quality of metagenomics data (or that of accompanying experimental design) to have a far more profound effect on the quality of results than the choice of analysis methods.

score 4 · Answer 2 · 2012-01-19

Hi,

As far as the 16s classification goes it is my impression that the microbiologists are right. The naive bayesian approach is used because it is extremely fast, it is nothing more than a bunch of word counts multiplied. But observe how the RDP does not try to classify below the genus level, it is probably because classifications get very inaccurate on such little training data as is found on species and strain levels.

However, you can retrain the classifier on your own data. It is my feeling that the blast classifications will be more accurate, but if you actually decide to try this out i would love to hear about the results.

On a completely third note, their SeqMatch software uses 1-NN classification on word vectors. I have never seen anyone use it, but it seems like a nice idea at least.

score 0 · Answer 3 · 2019-01-10

Hi, I'm bringing up this discussion again because I have a similar problem and I would like some advice. Working on metagenomics but this time shotgun sequencing, because I cannot use the 16S, I'm wondering if I should use BLAST or a software. I understand that most of the people would agree that the advantage of the classifiers is only that they are faster but I came across this : "With regard to taxonomic identification, one of the most popular tools for analysing metagenomic data is MEGAN [129], software that originally used BLAST to infer taxonomic composition. However, BLAST searching does not represent the most appropriate method for metagenomic sequence assignment. This is because alignments are local and not global, and hit similarities provide a measure of the confidence in the local sequence similarity but not of the validity of the assignment per se"

Do you agree with that ? This is actuallly ment for the 16S but is it also true for a shotgun dataset ?

Thank you