how much can you rely on local blast result for gene annotation in metagenomics samples?
5.6 years ago
kelvinfrog75

I have a contig file from the assembly of Illumina shotgun reads of soil samples. I did a local blastx against the virulence gene of different bacteria. After I parsed the blast report using MEGAN, I saw many "assigned" reads have low identity percentage such as this one having only 27%. So my question is why it has such a low expect score when the identity percentage is so low? Can I say the reads are indeed what it is assigned only based on the fact that the expected score is low? Is there an option in MEGAN or BLAST to filter the reads according the identity %?

>VFG002283(gi:18309707) (nanI) exo-alpha-sialidase [sialidase (VF0391)] [Clostridium
perfringens str. 13]
Length=694

Score = 129 bits (323),  Expect = 6e-33, Method: Compositional matrix adjust.
Identities = 119/447 (27%), Positives = 188/447 (42%), Gaps = 108/447 (24%)
Frame = +2

+ + H G    + YRIP L  + +GTL+   D R +  AD     +D  + RS DGG+TW

Query  257   EPMRVAMTFGEEGGLPSAQNGVGDPAILVDKKTGTIWIVAA--------WTHGMG-----  397
+  ++ M + +       ++ V D  ++ D +TG I+++          W  G+G
Sbjct  312   DEGQIIMDYPD-------KSSVIDTTLIQDDETGRIFLLVTHFPSKYGFWNAGLGSGFKN  364

Query  398   -NGRAWFNSQDGMDKNHTAQ----------------------------------------  454
+G+ +    D   K  T +
Sbjct  365   IDGKEYLCLYDSSGKEFTVRENVVYDKDGNKTEYTTNALGDLFKNGTKIDNINSSTAPLK  424

Query  455   ------LVLAKSDDDGKTWSNPINITSQVKDPSWKFLLQGPGSGITMQDGT----LVFAT  604
+ L  SDDDGKTWS P NI  QVK    KFL   PG GI +++G     +V
Sbjct  425   AKGTSYINLVYSDDDGKTWSEPQNINFQVKKDWMKFLGIAPGRGIQIKNGEHKGRIVVPV  484

Query  605   QFIDSTRVPNAGIMYSKDHGKTW----------KMHNYARTNT----------TEAQVAE  724
+ +     ++ ++YS D GK W          K+ N    N+          TE QV E
Sbjct  485   YYTNEKGKQSSAVIYSDDSGKNWTIGESPNDNRKLENGKIINSKTLSDDAPQLTECQVVE  544

Query  725   VEPGVLMLNMRDNRGGSRAVSVTKDLGKTWTEHPSNRSVLQESVCMASLIKVEAKDNVLN  904
+  G L L MR N  G   ++ + D G TW E     + + E  C  S+I    K  +
Sbjct  545   MPNGQLKLFMR-NLSGYLNIATSFDGGATWDETVEKDTNVLEPYCQLSVINYSQK--IDG  601

Query  905   KGILLFSNPNTTKGRHSITIKASLDGGL-TFPN---------EYDVLLDEGHGWGYSCLT  1054
K  ++FSNPN  + R + T++  L   + T+ N         +Y+ L+  G+ + YSCLT
Sbjct  602   KDAVIFSNPN-ARSRSNGTVRIGLINQVGTYENGEPKYEFDWKYNKLVKPGY-YAYSCLT  659

Query  1055  MIDKETVGILYEGS-TAHMVFQAVKLK  1132
+    +G+LYEG+ +  M +  + LK
Sbjct  660   ELSNGNIGLLYEGTPSEEMSYIEMNLK  686


blast gene annotation • 1.1k views
% identity as a stand-alone metric will not be very informative. You are not blasting with queries that are guaranteed to be full length sequences so that has to be kept in mind as well. If that xx% identity happens to include a near perfect match over a known domain/active site (see if you can find that for sialidase in the hit above) it would be meaningful but if that is not the case then it may be a random match that just happens to be there.

5.6 years ago
Asaf

The result makes sense, for a lot of proteins from environmental samples the best you can get is a partial and low identity alignment. The interpretation of the alignment is not straight forward, though. You can use other tools like domain prediction using interproscan, structure prediction etc. It also depends on your motivation.

My motivation is to identify pathogens in my samples. Previous studies have used blastx of virulence gene to identify pathogenic bacteria in wastewater and water, so I am testing out this approach to my samples.

It's a tough decision based on one BLAST result, I would add some more tests like does this protein (translated protein) have a better match to nr? Maybe the neighbours of the original virulence gene are a part of the pathway and you would like to find them near the hit.