Question: how much can you rely on local blast result for gene annotation in metagenomics samples?
0
gravatar for kelvinfrog75
3.3 years ago by
kelvinfrog7510
kelvinfrog7510 wrote:

I have a contig file from the assembly of Illumina shotgun reads of soil samples. I did a local blastx against the virulence gene of different bacteria. After I parsed the blast report using MEGAN, I saw many "assigned" reads have low identity percentage such as this one having only 27%. So my question is why it has such a low expect score when the identity percentage is so low? Can I say the reads are indeed what it is assigned only based on the fact that the expected score is low? Is there an option in MEGAN or BLAST to filter the reads according the identity %?

>VFG002283(gi:18309707) (nanI) exo-alpha-sialidase [sialidase (VF0391)] [Clostridium 
perfringens str. 13]
Length=694

 Score = 129 bits (323),  Expect = 6e-33, Method: Compositional matrix adjust.
 Identities = 119/447 (27%), Positives = 188/447 (42%), Gaps = 108/447 (24%)
 Frame = +2

Query  80    IGVRHAGDDGVAAYRIPGLVTSNKGTLLGVYDIRYNNSADLQER-VDIGLSRSTDGGQTW  256
             + + H G    + YRIP L  + +GTL+   D R +  AD     +D  + RS DGG+TW
Sbjct  252   VDLFHPGFLNSSNYRIPALFKTKEGTLIASIDARRHGGADAPNNDIDTAVRRSEDGGKTW  311

Query  257   EPMRVAMTFGEEGGLPSAQNGVGDPAILVDKKTGTIWIVAA--------WTHGMG-----  397
             +  ++ M + +       ++ V D  ++ D +TG I+++          W  G+G     
Sbjct  312   DEGQIIMDYPD-------KSSVIDTTLIQDDETGRIFLLVTHFPSKYGFWNAGLGSGFKN  364

Query  398   -NGRAWFNSQDGMDKNHTAQ----------------------------------------  454
              +G+ +    D   K  T +                                        
Sbjct  365   IDGKEYLCLYDSSGKEFTVRENVVYDKDGNKTEYTTNALGDLFKNGTKIDNINSSTAPLK  424

Query  455   ------LVLAKSDDDGKTWSNPINITSQVKDPSWKFLLQGPGSGITMQDGT----LVFAT  604
                   + L  SDDDGKTWS P NI  QVK    KFL   PG GI +++G     +V   
Sbjct  425   AKGTSYINLVYSDDDGKTWSEPQNINFQVKKDWMKFLGIAPGRGIQIKNGEHKGRIVVPV  484

Query  605   QFIDSTRVPNAGIMYSKDHGKTW----------KMHNYARTNT----------TEAQVAE  724
              + +     ++ ++YS D GK W          K+ N    N+          TE QV E
Sbjct  485   YYTNEKGKQSSAVIYSDDSGKNWTIGESPNDNRKLENGKIINSKTLSDDAPQLTECQVVE  544

Query  725   VEPGVLMLNMRDNRGGSRAVSVTKDLGKTWTEHPSNRSVLQESVCMASLIKVEAKDNVLN  904
             +  G L L MR N  G   ++ + D G TW E     + + E  C  S+I    K  +  
Sbjct  545   MPNGQLKLFMR-NLSGYLNIATSFDGGATWDETVEKDTNVLEPYCQLSVINYSQK--IDG  601

Query  905   KGILLFSNPNTTKGRHSITIKASLDGGL-TFPN---------EYDVLLDEGHGWGYSCLT  1054
             K  ++FSNPN  + R + T++  L   + T+ N         +Y+ L+  G+ + YSCLT
Sbjct  602   KDAVIFSNPN-ARSRSNGTVRIGLINQVGTYENGEPKYEFDWKYNKLVKPGY-YAYSCLT  659

Query  1055  MIDKETVGILYEGS-TAHMVFQAVKLK  1132
              +    +G+LYEG+ +  M +  + LK
Sbjct  660   ELSNGNIGLLYEGTPSEEMSYIEMNLK  686

blast gene annotation • 767 views
ADD COMMENTlink modified 3.3 years ago by Asaf6.2k • written 3.3 years ago by kelvinfrog7510

% identity as a stand-alone metric will not be very informative. You are not blasting with queries that are guaranteed to be full length sequences so that has to be kept in mind as well. If that xx% identity happens to include a near perfect match over a known domain/active site (see if you can find that for sialidase in the hit above) it would be meaningful but if that is not the case then it may be a random match that just happens to be there.

ADD REPLYlink written 3.3 years ago by genomax71k
0
gravatar for Asaf
3.3 years ago by
Asaf6.2k
Israel
Asaf6.2k wrote:

The result makes sense, for a lot of proteins from environmental samples the best you can get is a partial and low identity alignment. The interpretation of the alignment is not straight forward, though. You can use other tools like domain prediction using interproscan, structure prediction etc. It also depends on your motivation.

ADD COMMENTlink written 3.3 years ago by Asaf6.2k

My motivation is to identify pathogens in my samples. Previous studies have used blastx of virulence gene to identify pathogenic bacteria in wastewater and water, so I am testing out this approach to my samples.

ADD REPLYlink written 3.3 years ago by kelvinfrog7510

It's a tough decision based on one BLAST result, I would add some more tests like does this protein (translated protein) have a better match to nr? Maybe the neighbours of the original virulence gene are a part of the pathway and you would like to find them near the hit.

ADD REPLYlink written 3.3 years ago by Asaf6.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1956 users visited in the last hour