I have seen cases that one PDB can have multiple BioUnit files, and I am not quite the sure the reason they make multiple biounits. Some biounits only contain the binding peptide. Could I just use the first biounit file? Or I need to be careful about which one to choose.
Because I want to do it in large scale, rules like "choosing the one with the largest number of residues" would be better.
There's 2 main reasons for having multiple biounits in the PDB:
When there are multiple equivalent copies of the biounit in the crystal, then they are annotated as several biounits. A very typical example would be 4re6: a dimeric protein with 4 chains in the asymmetric unit. In that example biounit 1 is the dimer between chains A and B, biounit2 the dimer between chains C and D. Both are equivalent, but as it appears twice in the crystal it is given as 2 biounits.
Another reason is that the PDB tries to accommodate different opinions about them. The annotations can come from the authors or from software predictions (mainly PISA). So many times you will encounter both of them: biounit1 authors and biounit2 PISA. It can also happen that a certain biounit is annotated by both PISA and authors, when both are agreeing. The same 4re6 example above is also showing this, the first 2 biounits are from both authors and PISA, whilst biounit3 is only PISA (predicting a tetramer). As you see even PISA can have multiple predictions and in some cases like 4re6 they are both added to the PDB.
An important point to make is that experimentally coming out with a correct biounit is not a simple task. From crystallographic data alone you can't really do much, you need to use other experimental methods like gel filtration, analytical ultra centrifugation, light scattering, mutagenesis etc in order to be sure of the oligomeric state in solution of the protein. Those methods can sometimes not be conclusive enough, thus making things complicated.
As to which of the biounits to use there's not a simple straight forward answer to that. One very accepted method is to simply use biounit1 and ignore the rest. That is counting on that the PDB annotators often will set biounit1 to their main preference (due to good authors data or software having confident predictions). Another method would be to use the first biounit annotation coming from authors, trusting that authors are doing their job correctly.
A last word of caution comes with errors in annotations: there are quite a few errors in biounit annotations in the PDB, see for instance our paper where we analyse the problem in some detail.
Thanks for the information! This is very helpful. Will annotations be in the biounit file also? Or there are other resource to look into.
In order to know whether a particular biounit comes from authors or PISA, you will need to look into the original PDB/mmCIF file. The biounit file only contains the result of applying the symmetry operators present in the REMARK 350 section (or the
_pdbx_struct_oper_listfield in mmCIF files) and does not annotate the source (authors/PISA).
Alternatively you can parse the PDB/mmCIF file only and apply the symmetry operators in order to generate the biounit. You can then also parse the source of the biounit annotation from the same file. If you are interested in coding in Java, there is a lot of this already implemented in Biojava, see for instance the bioassembly part of the biojava tutorial.