Question: Classify Or Align ?
gravatar for Manu Prestat
7.3 years ago by
Manu Prestat3.9k
Marseille, France
Manu Prestat3.9k wrote:

Hi, I am working with a bunch of microbiologists which with I conduct metagenomic analyses. In the past (not so old), I did my PhD thesis on Bayesian Network classification (a supervised classification approach), then, turning to more sensu stricto bioinformatic tasks started on my first postdoc, my PhD experience made me more sensitive to classification approaches whenever it's possible, for instance, when I want to screen (in silico) my metagenome sequences.

So, when coworkers ask me for a RNA16S (rrs gene) analysis, I propose a RDP classifier approach (based on a Naive Bayes method), or a HHMER approach for a functional screening. But most of the time, they don't like, they do prefer BLAST. It's slower, but, according to them the results are more accurate, and most importantly, more relevant according to their expectations. I try to explain what the differences between the 2 approaches (essentially classifiers are based on learnt profiles which is supposed to be more relevant when the goal is to classify), but obviously without success.

Regarding to my lack of arguments, and before investigating deeper this "issue" by myself, I would like to have your opinion about that. Thanks a lot.

ADD COMMENTlink modified 3 months ago by lagartija60 • written 7.3 years ago by Manu Prestat3.9k
gravatar for Istvan Albert
7.2 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

I have been in a similar position in the past and after some arguments back and forth I chose to provide people what they prefer even if that may not have been my primary choice.

I personally believe that BLAST based methods are more inaccurate - especially since a second (usually very simplistic) algorithm needs to be applied to resolve matches to different taxonomical levels.

Fundamentally a BLAST alignment does not account for the information content of the base that it aligns over. A conserved region carries a lot less information about how the sequence should be classified than a variable region. Finally a BLAST result parsing classifier will happily classify sequences even when it shouldn't - this leads to fewer unknowns and it is what makes the method look better.

I recommend that you use both and compare the results (I have often seen entire families missing from one method vs the other).

Finally as a parting advice, by and large I found that the quality of metagenomics data (or that of accompanying experimental design) to have a far more profound effect on the quality of results than the choice of analysis methods.

ADD COMMENTlink written 7.2 years ago by Istvan Albert ♦♦ 80k

My current behavior goes in agreement with your first comment (I don't want to lose time to fight for helping people in a way they don't like)! Thanks for your answer.

ADD REPLYlink written 7.2 years ago by Manu Prestat3.9k
gravatar for Asker
7.2 years ago by
Asker40 wrote:


As far as the 16s classification goes it is my impression that the microbiologists are right. The naive bayesian approach is used because it is extremely fast, it is nothing more than a bunch of word counts multiplied. But observe how the RDP does not try to classify below the genus level, it is probably because classifications get very inaccurate on such little training data as is found on species and strain levels.

However, you can retrain the classifier on your own data. It is my feeling that the blast classifications will be more accurate, but if you actually decide to try this out i would love to hear about the results.

On a completely third note, their SeqMatch software uses 1-NN classification on word vectors. I have never seen anyone use it, but it seems like a nice idea at least.

ADD COMMENTlink written 7.2 years ago by Asker40

Thanks for your comment.

ADD REPLYlink written 7.2 years ago by Manu Prestat3.9k
gravatar for lagartija
3 months ago by
lagartija60 wrote:

Hi, I'm bringing up this discussion again because I have a similar problem and I would like some advice. Working on metagenomics but this time shotgun sequencing, because I cannot use the 16S, I'm wondering if I should use BLAST or a software. I understand that most of the people would agree that the advantage of the classifiers is only that they are faster but I came across this : "With regard to taxonomic identification, one of the most popular tools for analysing metagenomic data is MEGAN [129], software that originally used BLAST to infer taxonomic composition. However, BLAST searching does not represent the most appropriate method for metagenomic sequence assignment. This is because alignments are local and not global, and hit similarities provide a measure of the confidence in the local sequence similarity but not of the validity of the assignment per se"

Do you agree with that ? This is actuallly ment for the 16S but is it also true for a shotgun dataset ?

Thank you

ADD COMMENTlink written 3 months ago by lagartija60
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1184 users visited in the last hour