Question: How to extract all Pfam Sequences with cleared PDB Structure?
0
gravatar for twinstar2
19 months ago by
twinstar20
twinstar20 wrote:

I downloaded and extracted the Pfam-A.full.ncbi.gz from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam31.0/

I splittet the huge file into smaller files for each family like: https://www.dropbox.com/sh/t0b173oa8odvsne/AABZLhiaq-jZtE5PrULvBm9na?dl=0

Now i try to extract all Sequences from the Alignment, which have an observed/cleared structures in the pdb database. Luckily Pfam provides the pdbmap file which, links pdb IDs to Pfam IDs. I can extract all Pfam families, which have observed structures in the pdb, but i can not extract the corresponding sequences. So any help on how to accomplish this is very appreciated.

PS: I currently use Python 2.7 with BioPython, but it can't handle the files.

pfam python pdb • 783 views
ADD COMMENTlink modified 19 months ago • written 19 months ago by twinstar20
0
gravatar for twinstar2
19 months ago by
twinstar20
twinstar20 wrote:

Okay for anyone with the same problem, i found the answer!

  1. Extract all UniProtKB ids (line 5) from the pdbmap file into a list (you may have to make smaller sub parts of the list due to Server limitations)
  2. Upload the List here: http://www.uniprot.org/mapping/ and select "UniProtKB AC/ID" as Input and "EMBL/GenBank/DDBJ CDS" as Output
  3. Rejoin the target_id list
  4. Iterate over all Pfam files and dump all sequences which are not in your targed_id list
  5. Let it run for 2 days
  6. ????
  7. profit

This will get you ~ 70.000 sequences

ADD COMMENTlink modified 19 months ago • written 19 months ago by twinstar20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1361 users visited in the last hour