Question

Finding Single Domain Proteins

3

Entering edit mode

12.5 years ago

Fernando ▴ 30

Hi,

I am just begining to find my way through protein science--I have a question I want a list of all Single domain proteins in the PDB, I am not sure if there is a list like that?

I tried to play with both CATH/SCOP but I am not getting anywhere, is there a list someone has of all the single domain proteins, does not matter if it is all alpha or mixed, just need a list of them

What I mean is, lets say I define a domain as defied by SCOP (or CATH), I just want a list of single domain proteins Thanks, Fernando

domain • 4.2k views

ADD COMMENT • link updated 23 months ago by Ram 43k • written 12.5 years ago by Fernando ▴ 30

1

Entering edit mode

Can you clarify what you mean by single-domain proteins? (It might help to state what your research question is.)

Peptides which only have 1 functional domain, ignoring overlaps. These would be identifiable by Pfam or RPS-BLAST search against the PDBAA sequence database for domain architecture.
3D structures that show only 1 domain, ignoring small ligands.
Something else?

ADD REPLY • link 12.5 years ago by Eric T. ★ 2.8k

0

Entering edit mode

For example lysozyme is a single domain protein, so I define a single domain as something that cannot be further divided (unlike hemoglobin which has 4 domains)

Actually, is there a way to find all small globular proteins? These are usually sinegle domains (~ 100-150 residue)??

I am sorry if these questions sound trivial !

I am looking to compare different small globular proteins structures. (Not using RMSD or FASTA just visul comparision using pymol)

ADD REPLY • link 12.5 years ago by Fernando ▴ 30

Ram · Answer 1 · 2011-11-08

The following java program scans uniprot and search for the entries having an entry in PDB and having one and only one entry in prosite:

(firts generate the XML unmarshaller with:

 xjc -d . "http://www.uniprot.org/docs/uniprot.xsd"

then compile (javac Biostar14046.java) and run ( java Biostar14046) the following program:

import java.net.URL;
import java.util.zip.GZIPInputStream;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.events.XMLEvent;

import org.uniprot.uniprot.DbReferenceType;
import org.uniprot.uniprot.Entry;

public class Biostar14046
    {
    void run() throws Exception
        {
        JAXBContext jc = JAXBContext.newInstance("org.uniprot.uniprot");
        Unmarshaller u=jc.createUnmarshaller();
        XMLInputFactory factory = XMLInputFactory.newInstance();
        factory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, Boolean.TRUE);
        factory.setProperty(XMLInputFactory.IS_VALIDATING, Boolean.FALSE);
        factory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
        factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.TRUE);
        XMLEventReader r= factory.createXMLEventReader(new GZIPInputStream(new URL("ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz").openStream()));
        int i=0;
        while(r.hasNext())
            {
            XMLEvent evt=r.peek();
            if(!(evt.isStartElement() && evt.asStartElement().getName().getLocalPart().equals("entry")))
                {
                r.next();
                continue;
                }
            QName qName=evt.asStartElement().getName();
            Entry entry=(Entry)u.unmarshal(r);
            int countprosite=0;
            String pdb=null;
            for(DbReferenceType ref:entry.getDbReference())
                {
                if(ref.getType().equals("PDB") && ref.getId()!=null)
                    {
                    pdb=ref.getId();
                    }
                else if(ref.getType().equals("PROSITE"))
                    {
                    countprosite++;
                    }
                }
            if(countprosite!=1 || pdb==null) continue;

            System.out.println(entry.getAccession()+"\t"+pdb);
            }
        }
    public static void main(String[] args) throws Exception
        {
        new Biostar14046().run();
        }
    }

Result:

[Q58097]    2Z61
[P49777, Q9URU7]    1IUF
[P02718]    1OLK
[Q08AH3, B3KTT9, O75202]    3GPC
[P26276]    3C04
[Q9ZCD3]    3MX6
[O35381, P97437]    2JQD
[Q9NQW6, Q5CZ78, Q6NSK5, Q9H8Y4, Q9NVN9, Q9NVP0]    2Y7B
[O43747, O75709, O75842, Q9UG09, Q9Y3U4]    1IU1
[P53068, D6VV95]    1GQP
[P07741, Q3KP55, Q68DF9]    1ZN9
[O50202]    2WFW
[P63590, Q48ZH6, Q9A0E5]    2OCZ
[P0AC38, P04422, P78140, Q2M6G5]    1JSW
[P0ABB8, P39168, Q2M665]    3GWI
[P33447]    1BW0
[P56547]    1RKR
[Q9X108]    1UP7
[P52664]    1HZO
[P0C2P0, P78986, Q0CGS9]    2Z3J
[P14315]    3LK4
[P57730, A2RRF8]    1DGN
[A5JTM5]    1NZY
[Q28960]    1N5D
[P80075, A0AV77, P78388]    1ESR
[P18181, Q545K2]    2PTV
[P31997, O60399, Q16574]    2DKS
[P30429, Q5BHI5]    3LQR
[P36222, B2R7B0, P30923, Q8IVA4, Q96HI7]    1NWU
[Q5PXQ6]    1TMX
[P01524]    1GIB
[Q96LI5, Q9UF92]    3NGQ
[Q9DBL7, A2BFA8, Q3TVZ2, Q8K3Y4]    2F6R
[P49347]    1CNV
[P02526, A2TJU8]    4GCR
[P32081, P41017, Q45690]    2I5M
[P01443]    1KBT
[Q6F495, Q3MV17]    2D04
(...)

score 0 · Answer 2 · 2011-11-08

0

Entering edit mode

12.5 years ago

Fernando ▴ 30

For example lysozyme is a single domain protein, so I define a single domain as something that cannot be further divided (unlike hemoglobin which has 4 domains)

Actually, is there a way to find all small globular proteins? These are usually sinegle domains (~ 100-150 residue)??

I am sorry if these questions sound trivial !

I am looking to compare different small globular proteins structures. (Not using RMSD or FASTA just visul comparision using pymol)

ADD COMMENT • link 12.5 years ago by Fernando ▴ 30

0

Entering edit mode

You should update your original question but not adding a new answer. This one will be deleted soon.

ADD REPLY • link 12.5 years ago by Pierre Lindenbaum 161k

score 0 · Answer 3 · 2011-11-08

Some combination of these strategies might do:

Filter NCBI-PDBAA for sequences with length less than, say, 600.
Fetch the FASTA records or PDB files for those matching sequences. Use Biopython to filter for proteins that have (a) one sequence in the FASTA record (via Bio.SeqIO), or (b) one chain in the structure (via Bio.PDB). This will lose a lot of PDB entries where the biological unit is monomeric but the crystal was solved with multiple identical chains -- but I think that's OK for your purposes.
Run RPS-BLAST or HMMer on the PDBAA database, and use a script to filter for sequences that only have one distinct domain. Use a somewhat stringent e-value cutoff to reduce the number of overlapping hits you get. (The possibility of overlapping hits and multiple profile matches for a single domain can make this tricky.)