i am doing a domain analysis on a set of protein sequences retrieved from a HMM search using the profile of a specific TF family. After that, in order to filter those sequences who actually have the (entire) domain i am now using CD-search from NCBI (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi). According to the README (https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CDSearch_help_contents), i've seen that it uses a collection of domains from several databases (e.g. CDD, SMART, Pfam, etc..) which sounds pretty cool to me as i can handle a single output instead of using all these databases singularly and then integrating their results. What i find weird (and is the core of my question) is the type of output of this search.
It states that it returns Specific hit (is a high confidence association between a protein query sequence and a conserved domain,), Non-specific hits (If a specific hit IS NOT found on a query protein sequence, but the protein has an otherwise statistically significant hit (E-value cutoff of 0.01) to any domain model in CDD, the domain model is regarded as a non-specific hit) and Superfamily.
1) What i really don't understand is why including non-specific hits in the output (by the way only present in the "full" output and NOT in the "concise" output). What can we learn from a non-specific hit ?
2) what is the output you would retain from the concise output file (considering specific and superfamily hit type)?
I really hope you have experience in this.
Thanks in advance for any help.