PDB offers to download a fasta file containing one sequence per individual chain. This file is very useful and I use it routinely for checking if there is an available structure that is related to my sequences of interest.
Every now and then, a sequence search finds a decent hit, but when I download the structure, I notice that the sequence region matching my query is not present in the file. I am not much of a structure expert, so I can't give a definitve explanation for this. I assume that the downloadable FASTA sequence is what the scientists use for their experiments and the missing parts are the bits that were not resolved. Other explanations may also apply. What I would be interested in is a FASTA file corresponding to the part of PDB for which there is an actual structure (with coordinates and everything).
Edit: Here is one example: The PDB file 2H9E(chain C) is reported to have the sequence:
The SIFT annotation, as proposed by Khader, maps this to Uniprot entry Q16938, residues 8-91 (which results in the same sequence as that shown above). So far, so good. However, when I look at the structure (or even try to use it for homology modeling), I notice that the structure starts at CGE... (without the first 5 residues) and ends with ..PGTR (without the last residue). Moreover, there are some residues missing in the middle of the structure. I completely understand WHY they are missing, but it would be nice if there were a way to extract the usable sequence. Ideally on a database-wide scale, because this would allow me to do searches and look for the optimal template (maybe there is a better-resolved structure). Failing that, it would be nice to extract the usuable sequence from a given pdb entry. It is possible that the 'joy' package recommended by Khader does just that, but it may be used only by people in academia. No joy for me.
While we're at it: is anybody aware of a simple tool that allows to renumber the residues in the pdb coordinate files? The authors don't appear to be consistent with their use of coordinates - sometimes they start at 1, sometimes the numbers apply to the whole protein (and not just the part present in the structure). These two questions are in a way related, as I would like to work with coordinates that are in agreement with the FASTA file of the sequence.