Hi,
My goal is to identify the list of transmembrane genes that have atleast one domain sticking out in the extracellular matrix. My approach was to utilize the COMPARTMENTS database for it. I downloaded the knowledgebase from COMPARTMENTS. It has the following format:
ensembl_peptide_id hgnc_symbol GO GO_Type Source Evidence_Code
ENSP00000000233 ARF5 GO:0005576 Extracellular region UniProtKB IDA
ENSP00000000442 ESRRA GO:0005576 Extracellular region HPA IDA
ENSP00000001008 FKBP4 GO:0005576 Extracellular region UniProtKB IDA
ENSP00000002125 NDUFAF7 GO:0005576 Extracellular region UniProtKB IDA
ENSP00000002165 FUCA2 GO:0005576 Extracellular region UniProtKB IDA
ENSP00000002829 SEMA3F GO:0005576 Extracellular region ProtInc TAS
Score
5
4
5
5
5
My approach is a pretty simple one - filter the list using the GO_Type being Plasma membrane, Cell surface, Extracellular region or extracellular matrix (these are just a few out of many possibilities). Then, filter by score>=3 or if I am being stringent then a score>=4. A score greater than 4 means it is curated, lesser the score lesser the confidence value. However, this approach seems too simplistic to me. I was also thinking of parsing the list of genes thus obtained to a domain finder. I tried the web API of SMART and it doesn't give a very data-mining friendly output.
Is there a better tool/approach that can help identify genes with domains in extracellular matrix with some confidence value?
Any thoughts would be much appreciated.
Thanks for the suggestions, I looked over and running TOPCONS and CCTOP. It is probably going to take a long time because I have about 15196 genes :|
Topcons and CCTOP can be installed locally, see Download, also Topcons has a batch api: http://topcons.cbr.su.se/pred/help-wsdl-api/. InterProScan also runs a membrane topology predictor, you can check for such annotation. Further you could pre filter your proteins by having at least one or two TMHMM predicted transmembrane domains these annotations should be already available in Ensembl Biomart, Query here. There are about 6000 genes with transmembrane domains, so 15000 seems a bit high.
My main goal was to identify if they have any domain that is in the Extracellular matrix. TOPCONS gives me this kind of result, however I don't know how to make sense out of it except that
o
means outer membrane.And CCTOP just tells me if it TM or not, which I already know, given that I found the list of TM proteins from Compartments, I can find a similar list of genes using biomaRt as well. I might only go ahead with biomaRt to get genes matching GO terms like extracellular matrix and transmembrane. Thanks for the help though!
o := outside i := inside M := membrane
CCTOP gives you a similar prediction in the image it makes. These annotations are important to estimate the size and orientation of domains.
Yes I understand that but my problem is interpretation of the output. Firstly, the output is not very easy to parse. Secondly, as far as I could check, all the query proteins have some i, o and m regions. I just wanted to find if there are TMM proteins that have (or don't have) an extracellular domain. Also I believe this tool is mainly for predicting if there is a TMM domain in your query sequence (which not necessarily means extracellular). The results can be accessed here - http://topcons.cbr.su.se/pred/result/rst_xXfRMl/.