Question

Help with BLAST and the PDB

0

Entering edit mode

5.9 years ago

danielgurnon • 0

Hi everyone,

I’m hoping someone can help me understand how to use BLAST to address a specific problem. I’m relatively inexperienced with BLAST, and prior to now I’ve never tinkered with settings. So first, thank you for reading this, and thanks in advance for any guidance!

I want to obtain structural data about a particular position in a human protein. I’ll be doing this for a many positions in many different proteins, so I’m working with a computer scientist to develop a script using Biopython. The script essentially does the following:

(1) obtain a position of interest from a database of mutations (for example, Ala320Thr in human LDHA)

(2) Obtain a sequence window around that position using sequence data from NCBI (for example, residues 310-330 in human LDHA)

(3) BLAST the PDB with the sequence window from (2), returning PDB IDs that contain the sequence or something very similar.

I need help understanding how best to accomplish (3):

• How can we ensure that the key position (e.g., “Ala”, at the center of the window) is always present in our hits?

• Besides the key position, other amino acids in the window can vary somewhat. But how much should the rest of the window be allowed to vary while still ensuring a reasonable match? Along the same lines, what size of a sequence window should we use?

• How should we define “reasonable match” in the BLAST results?

• How should we treat a situation where a position of interest occurs close to a terminus, such that the window would be smaller than usual, and the key position would not be centered?

Thanks, Dan

alignment • 1.2k views

ADD COMMENT • link 5.9 years ago by danielgurnon • 0

0

Entering edit mode

Only a partial answer, and possibly not a very useful one at that, but defining that is a reasonable match is impossible really. You have to define what you consider reasonable based on what's known about that protein. If you're looking for active sites and functional domains, you might suppose that the conservation will be reasonably high, but only you (or people intimately familiar with that family of proteins) will have a good grasp of this.

The long answer would be try a bunch of different cutoffs and see at what point you start getting what you might consider junk. It'll also depend on the question - are you interested in the diversity of all similar proteins for instance? If so, you might have to be less strict on your cutoffs.

ADD REPLY • link 5.9 years ago by Joe 21k

0

Entering edit mode

Thanks, that makes sense. I guess I want a list of potential matches that an expert would need to curate based on their knowledge of the protein. But I want to use a top hit for an initial display. The specific question being asked in all cases is, "Does structural data exist that would help hypothesize the effect of a mutation at this position?". The best PDB code found would be brought up in a visualization program and the position would be highlighted.

ADD REPLY • link 5.9 years ago by danielgurnon • 0

0

Entering edit mode

If you already have a close structure of interest, a more conventional approach (though maybe more computationally intense) would be to use a tool like ITASSER or MODELLER to do homology modelling of your sequence of interest but templated against the known 3D structure.

ADD REPLY • link 5.9 years ago by Joe 21k

0

Entering edit mode

@jrj.healey has recounted some of the difficulties. You were able to enumerate your needs above but to translate them into programmatic actions is going to be a tall order.

Instead of position of interest at center of sequence window why not consider the full sequence (or are you specifically looking at only a domain?). It is unclear if the region of interest is going to be narrow in terms of location but also organisms it is found in. Looking at structural data with a fragment is going to be significantly different than looking at whole proteins and any observations you make would not translate well.

Perhaps you need to look into structure similarity searches (e.g. VAST from NCBI) instead of plain BLAST.

ADD REPLY • link 5.9 years ago by GenoMax 141k

0

Entering edit mode

I hadn't heard of VAST. Excellent. Thank you. But the problem is, I'm not starting with structural data.

ADD REPLY • link 5.9 years ago by danielgurnon • 0

0

Entering edit mode

Following genomax's thinking, I'm a big fan off HMMs for protein structure queries. Tools such as HHSuite/HHPred intrinsically break the query down in to matching domains too, so you might not have to do this window based logic so manually. HHPred also tells you the chain that was matches, returns the sequence, PDB ID, score info etc.

I'd be more inclined to maybe find all the hits to your sequences of interest, and filter them after the fact for sequences with residue X or Y at whatever position.

Another related option might be to look at PFAMs instead of just directly BLASTing against PDB, and instead build up a list of domains that fit your criteria?