Question

Filtering AA sequences of PDB structures by sequence identity

1

Entering edit mode

21 months ago

Jonathan Lefebre ▴ 70

Dear all,

I came across a problem that I thought would be easier to solve. What I want to do is accessing the PDB and download structures associated with a certain protein family. I then want to align those structures. So far so easy. BUT of course for some members of the protein family there is several structures, of which I want only one. It doesn't matter which one, but it has to be a unique set of members of the protein family. I am not posting my code here because this is not where I need help. I would need advise to a approach that is feasible and biologically acceptable, if that makes sense.

My approach so far:

Query PDB for structures associated with protein family. This gives me a list of PDB IDs
Download Fasta files associated with PDB IDs
Construct MSA using Clustalo
calculate pairwise sequence identity for each pair of sequences in the alignment --> yields a numpy array with 0 --> 1 for sequence identity between pairs
filter array by cutoff set to 0.95 (yes some members of the family have very high sequence similarity) and delete everything higher than 0.95 sequence identity
extract indices from array to use as unique set

Here is the problem: I significantly reduce the number of sequences, but I end up with some duplicates, because sometimes the sequences of the PDB entries of the same protein have different length etc., leading to lower sequence identity. Meaning I miss those.

If anyone has a smart idea to filter a list of PDBs to obtain one unique structure for every protein, please don't hesitate to post it here. Any input is highly welcome!

identity python sequence • 868 views

ADD COMMENT • link updated 21 months ago by Mensur Dlakic ★ 27k • written 21 months ago by Jonathan Lefebre ▴ 70

0

Entering edit mode

Couple of thoughts that may be in some way useful:

Can you change your alignment process/filter your sequences such that you require at least one gap/mismatch so you never obtain an exact duplicate? (Not sure if this will affect your downstream task). It sounds like %ID isn't sufficient alone for this.
Maybe clustering via CD-HIT or doing reciprocal alignments would be a better alternative? The perfect substring matches should identify as high scoring in at least one 'direction' of the search?
Do some pre-filtering set manipulations to obtain just the longest sequence for any given PDB ID, so you don't have any duplicate PDB IDs with shorter seqs?

ADD REPLY • link 21 months ago by Joe 21k

score 2 · Answer 1 · 2022-07-08

I think your solution is sound in general, but it doesn't strike me as very efficient. As you discovered, it is also prone to some issues due to different lengths of otherwise similar structures.

I suggest you start with all FASTA sequences extracted from PDB structures. Those are in the following file (current as of July 1st, 2022):

https://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt

You can use that file to perform a BLAST search from a given sequence or a set of sequences, which will give you a set of PDB ID matches. If you already know all the IDs, you can skip this part.

People are continuously culling PDB sequences using various quality and sequence identity criteria. Those datasets can be found on PISCES server:

http://dunbrack.fccc.edu/pisces/

If you click on "Download precompiled ..." link, there will be many files to choose from based on sequence identity you wish and various quality criteria.

Let's take this file:

cullpdb_pc50.0_res0.0-1.5_noBrks_len40-10000_R0.25_Xray_d2022_06_19_chains4865.fasta

This means all protein chains with length 40-10000 residues (len40-10000), without any breaks (noBrks), resolution 0-1.5 angstroms (res0.0-1.5), R factor =< 0.25 (R0.25), and with at most 50% identity (pc50.0). There are 4865 such protein chains, and you can compare the list of IDs obtained in the first step to this group, which will automatically guarantee in this case that your protein sequences are at most 50% identical, plus all the other quality criteria. It is up to you to select a group of sequences that best matches your needs.