Question

Protein Sequence Alignment From Protein Databank To Cosmic Or Uniprot

2

Entering edit mode

12.2 years ago

alphaace ▴ 40

I would like to match up PDB files from the Protein Databank to canonical AA sequences for the protein as displayed in Cosmic or Uniprot. Specifically, what I need to do is pull from the pdb file, the carbon alpha atoms in the backbone and their xyz positions. I also need to pull their actual order in the proteins sequence. For structure 3GFT (Kras - Uniprot Accession Number P01116), this is easy, I can just take the ResSeq number. However, for some other proteins, I can't figure out how this is possible.

For example, for structure (2ZHQ) (protein F2 - Uniprot Accession Number P00734), the Seqres has the ResSeq numbers repeated for numbers "1" and "14" and only differs in the Icode entry. Further the icode entries are not in lexographic order so it's hard to tell what order to extract.

It get's even worse if you consider structure 3V5Q (Uniprot Accession Number Q16288). For most of the protein, the ResSeq number matches the actual amino acid from a source like COSMIC or UNIPROT. Howver after Position 711, it jumps to position 730. When looking at REMARK 465 (the missing atoms), it shows that for chain A , 726-729 are missing. However after matching it up to the protein, those AA actually are 712-715.

I've attached code that works fro the simple 3GFT example but if someone is an expert in pdb files and can help me get the rest of it figured out, I would be much obliged.

library(gdata)

#answer<- get.positions("http://www.pdb.org/pdb/files/2ZHQ.pdb","L")
answer<- get.positions("http://www.pdb.org/pdb/files/3GFT.pdb","A")


#This function reads a pdb file and returns the appropriate data structure
get.positions <- function(sourcefile, chain_required = "A"){
N <- 10^5
AACount <- 0
positions = data.frame(Residue=rep(NA, N),AtomCount=rep(0, N),SideChain=rep(NA, N),XCoord=rep(0, N),YCoord=rep(0, N),ZCoord=rep(0, N),stringsAsFactors=FALSE)     

visited = list()
filedata <- readLines(sourcefile, n= -1)
for(i in 1: length(filedata)){
input = filedata[i]
id = substr(input,1,4)
if(id == "ATOM"){
  type = substr(input,14,15)
  if(type == "CA"){
    resSerial = strtoi(substr(input, 7,11))
    residue = substr(input,18,20)
    type_of_chain = substr(input,22,22)
    resSeq = strtoi(substr(input, 23,26))
    altLoc = substr(input,17,17)

    if(resSeq >=1){ #does not include negative residues
      if(type_of_chain == chain_required && !(resSerial %in% visited)  && (altLoc == " " || altLoc == "A") ){
        visited <- c(visited, resSerial)
        AACount <- AACount + 1
        position_string =list()
        position_string[[1]]= as.numeric(substr(input,31,38))
        position_string[[2]] =as.numeric(substr(input,39,46))
        position_string[[3]] =as.numeric(substr(input,47,54))
        #print(input)
        positions[AACount,]<- c(residue, resSeq, type_of_chain, position_string[[1]], position_string[[2]], position_string[[3]])
      }
    }
  }
}

  } 
  positions<-positions[1:AACount,]
  positions[,2]<- as.numeric(positions[,2])
  positions[,4]<- as.numeric(positions[,4])
  positions[,5]<- as.numeric(positions[,5])
  positions[,6]<- as.numeric(positions[,6])
  return (positions)
}

protein pdb uniprot • 4.8k views

ADD COMMENT • link updated 10.2 years ago by Jose Manuel Duarte ▴ 340 • written 12.2 years ago by alphaace ▴ 40

Ram · Answer 1 · 2014-05-29

PDB to UniProt mapping can indeed be tricky. Fortunately the SIFTS project is dedicated to that, they provide residue-level mapping of PDB to UniProt, for instance see this tab delimited file, here the first few lines:

PDB     CHAIN   SP_PRIMARY      RES_BEG RES_END PDB_BEG PDB_END SP_BEG  SP_END
101m    A       P02185  1       154     0       153     1       154
102l    A       P00720  1       40      1       40      1       40
102l    A       P00720  42      165     41      164     41      164
102m    A       P02185  1       154     0       153     1       154
103l    A       P00720  1       40      1       40      1       40
103l    A       P00720  44      167     41      164     41      164
103m    A       P02185  1       154     0       153     1       154

There the start and end coordinates for each mapping are provided in the last 6 columns: RES_BEG/RES_END are for residue numbers matching the SEQRES of the PDB files (from 1 to n), PDB_BEG/PDB_END for the residue numbers as they appear in ATOM lines (i.e. can have insertion codes, can have jumps etc) and SP_BEG/SP_END are the UniProt coordinates (SP is for swissprot the old name of UniProt).

Regarding mappings of ATOM lines residue number to SEQRES numbers it is easy to find them in the mmCIF or XML files provided by the PDB. The pdbx_poly_seq_scheme section contains the mapping. See for instance the corresponding file for 2zhq: ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/mmCIF/zh/2zhq.cif.gz

score 0 · Answer 2 · 2012-06-21

Not sure if this helps but it is my understanding that SEQRES and ATOM sequences do not always match up. Some residues in the ATOM section could not be resolved in the crystal structure. If you simply want to check the sequence and correct order, you could use http://bioinf.org.uk/pdbsws/ This server matches PDB code to uniprot id. If you check for the PDB in question: 2ZHQ, the corresponding uniprot id given by the server is: Q69EZ8.

If you check the sequences, they seem to be the same. This does not resolve you issue about extraction of the atomic coordinates of alpha carbons however. I do not really understand what is going on with 2ZHQ, maybe there is a genuine reason for this structure, hopefully a crystallographer can help you. All the best.