Closed:Protein structure to classify domain sequences into full Vs. partial lengths and false positives
2
0
Entering edit mode
4.8 years ago
Anand Rao ▴ 630

Research problem - Using protein structure to verify and classify protein domain predictions from HMMER.

Dataset: I have a test dataset of ~ 800 protein domain sequences that are 30-110aa in length. Most of them are ~ 40-45aa long. The shortest (~30aa) and longest sequences (~110aa) are extreme variations, but there are many many predicted sequences that are either shorter or longer than expected, and which look quite doubtful. The full dataset is much larger at ~17K domain predictions.

Proposed Solution: Therefore, I want to use (2D or 3D) protein structure to "verify" and "classify" these domain sequence predictions into

  • full-length domains,
  • partial-length domains, i.e. truncations, and
  • false positive predictions (that do NOT match the canonical 2D and/or 3D details for this protein domain)

Structure information: This domain is found in 7 PDBs at RCSB, and when trimmed to just the domain boundaries and superimposed in 3D space, they overlap quite well, even though pairwise sequence identity can be as low as ~ 17%. This is not surprising since structure is much more conserved than sequence. But it reinforces my motivation to explore using "protein structure" to verify and classify my domain predictions based on "protein sequence" methods.

Questions: to forum members are as follows:

1. Is 2D prediction for my domain sequences sufficient for the domain verification and classification I want to carry out?

  • If yes, then which SS prediction tool is suggested (PsiPred, DSSP, something else?)
  • How do I parse those results, and
  • What do I compare such parsed results to, would it be to SS prediction for the 7 solved PDBs? All of them, one of them, some consensus SS?

2. If 2D info is necessary but not sufficient, and 3D info is required, then what tool(s) should I use?

  • Is there any Deep Learning or Machine Learning software than can do this? Al Quiraishi @ Harvard said his very recent RGN/ProteinNet is not suitable since I do have templates to check against, and
  • The I-TASSER people at U-Mich said this is computationally too intensive, even though my inputs are not full-length proteins, but just shorter domain sequences -to be fair my inquiry was to analyze the full dataset, not just the test dataset.

Based on responses from RGN and I-TASSER groups, I've started thinking whether 2D rather than 3D prediction would provide an acceptable solution to my problem?

I look forward to your answers to these questions, as well as orthogonal thinking to solve this research problem. Thanks, in advance!

structure domain 3D sequence protein • 168 views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 1781 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6