Question

Search RCSB with a list of protein names?

0

Entering edit mode

7 months ago

Joseph • 0

Hi, I have a text file with 1000 different proteins (EDIT: proteins, not PDB-IDs), and I would like to gather all the structural information related to them from RCSB. I have never done any high-throughput work with RCSB, and the list is a bit too long to manually curate.

Does anyone know any existing methods/packages that could help me out? Ideally I'd be able to download all potentially relevant PDBs, but anything would help as a starting point.

*EDIT history: -clarified that I have gene names for the proteins, not PDB-IDs. E.g., there could be many PDBs in RCSB that contain the same protein, and my goal is to find all of them in one go.

Thanks so much, Joe

RCSB protein • 1.0k views

ADD COMMENT • link updated 7 months ago by GenoMax 142k • written 7 months ago by Joseph • 0

1

Entering edit mode

You can download various types of bulk data files from RCSB here: https://www.rcsb.org/docs/programmatic-access/file-download-services

You can batch download data as shown here: https://www.rcsb.org/docs/programmatic-access/batch-downloads-with-shell-script

ADD REPLY • link 7 months ago by GenoMax 142k

0

Entering edit mode

Hey, this is so close to what I'm looking for - I clarified in my edit. This is a great way to batch download lots of PDBs in RCSB by their PDB-ID. However, the same protein can be in many PDBs. I'm looking for a method to download all PDBs given a specific protein name/sequence

ADD REPLY • link 7 months ago by Joseph • 0

1

Entering edit mode

7 months ago

Mensur Dlakic ★ 27k

This is not the most elegant solution, but it should work with previous suggestions.

In this remote directory:

http://ftp.wwpdb.org/pub/pdb/derived_data/

there is a file pdb_seqres.txt containing protein sequences for all RCSB structures. You could BLAST your sequences against this database, find hits that are 100% identical (or whatever identity threshold is acceptable), and use their IDs as a starting point for the programs that are listed above.

ADD COMMENT • link 7 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

7 months ago

bk11 ★ 2.4k

You can use Bio3D an R Package for this.

http://thegrantlab.org/bio3d_v2/tutorials/installing-bio3d

http://thegrantlab.org/bio3d/articles/online/intro_vignette/Bio3D_introduction.html

library(bio3d)
pdb=read.pdb("1hsg")

pdb

 Call:  read.pdb(file = "1hsg")

   Total Models#: 1
     Total Atoms#: 1686,  XYZs#: 5058  Chains#: 2  (values: A B)

     Protein Atoms#: 1514  (residues/Calpha atoms#: 198)
     Nucleic acid Atoms#: 0  (residues/phosphate atoms#: 0)

     Non-protein/nucleic Atoms#: 172  (residues: 128)
     Non-protein/nucleic resid values: [ HOH (127), MK1 (1) ]

   Protein sequence:
      PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYD
      QILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNFPQITLWQRPLVTIKIGGQLKE
      ALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTP
      VNIIGRNLLTQIGCTLNF

+ attr: atom, xyz, seqres, helix, sheet,
        calpha, remark, call

ADD COMMENT • link 7 months ago by bk11 ★ 2.4k

0

Entering edit mode

This is a really neat package, but I don't think it does exactly what I want, so I've clarified my question. This seems like it grabs a PDB ID and returns relevant information about that PDB, similar to what "fetch" does in pymol. But the same protein could be featured in many PDBs in RCSB. So, I'm looking for a way to find all PDBs in RCSB that contain an input protein name/sequence

ADD REPLY • link 7 months ago by Joseph • 0

0

Entering edit mode

7 months ago

Jiyao Wang ▴ 370

You can use NCBI esearch to search the protein names against the structure database to get the PDB IDs, then retrieve the structures.

ADD COMMENT • link 7 months ago by Jiyao Wang ▴ 370

score 4 · Accepted Answer · 2023-10-01

Using EntrezDirect as noted to get structure accessions (output truncated to save space):

$ esearch -db structure -query TP53 | esummary | xtract -pattern DocumentSummary -element PdbAcc,string
8U4U    Homo sapiens
8SWJ    Homo sapiens
8GJS    Danio rerio
8E7B    Homo sapiens

$ esearch -db structure -query dnaA | esummary | xtract -pattern DocumentSummary -element PdbAcc,string
7S3L    Bacillus subtilis PY79
7RZM    Stenotrophomonas maltophilia K279a
6T66    Vibrio cholerae
6JIR    Caulobacter vibrioides CB15

To download the data using those accessions do the following:

$ esearch -db structure -query dnaA | esummary | xtract -pattern DocumentSummary -element PdbAcc | tr "\n" "," >  acc.txt

Use the batch download script from RCSB ( https://www.rcsb.org/scripts/batch_download.sh ) to download the PDB files.

sh batch_download.sh -f acc.txt -p