Finding Proteins that have NO known Domains
1
1
Entering edit mode
7.5 years ago
ddofer ▴ 30

I want to extract a list of proteins for a given organism that have no (high confidence) predicted domains by PFAM or the like.

(Alternatively, getting a list of predictions for a list of proteins would also be good).

I know HMMER and Pfam and the like (CCD-Hit) have various tools for searching for domains, but I don't know how to work with the emailed file outputs, and I'm specifically interested in just finding which proteins DON'T have predicted domains.

Is there an easy/simple way to do this? (Even a tool with output that I can copy-paste into a text editor/excel and then filter the columns in it..)?

Thanks!

sequence pfam domain batch protein • 2.3k views
0
Entering edit mode

what is emailed file output? I think, after you blast against a domain database, all those sequences with no hits are considered as sequences without domains. Am I missing something?

0
Entering edit mode

I was working then with the HMMER and/or PFAM search results, which are returned as a plaintext email. Yuch.

That said, even with the offline tool, I don't know how to parse the command line output text properly, it just prints it onscreen. .

2
Entering edit mode
7.5 years ago

You could query the UniProt Knowledgebase for proteins with no cross-references to InterPro,

active:yes not database:interpro

http://www.uniprot.org/uniprot/?query=+active%3Ayes+not+database%3Ainterpro&sort=score

0
Entering edit mode

Interpro has many annotations though, not just domains...

(And I'm wokring on offline sequences which aren't necessarily in Uniprot; or even NCBI.

As for your approach on a database, Wouldn't i make more sense to just search for proteins with "NOT domain:*" ?  Your query has proteins with annotated domains right on the first page of results :P)