I am interested in a particular protein domain and I would like to extract all instance of this domain from our whole exome data. Is this theoretically possible ? Is there any protocol / references on generating protein sequences of specific coding regions from exome data ?
Thanks in advance !
Khader, what is your input CHROM,POS,REF,ALT ? I've got a some java code to build a mRNA and to find the domains.
I am a bit puzzled by this question, it is very unclear to me what you are asking about generating protein sequences from exome data, is it only me? If you do whole exome seq you have the reference genome, correct? further you did a sequencing of genomic DNA enriched for exons, correct? Do you mean you want to infer the 'real' coding sequence and translate it (what extra information would the exome sequence provide then, as you must know all the exons to do the enrichment)? Or do you want to detect variations (e.g. SNPs) and infer the effect in terms of AA change of these variations?
@Pierre Yes, I have that input. Please let me know about your approach.
@Michael: This is a more of a conceptual approach. We found a missense mutation in a very well known protein domain from our exome data. This domain is part of multiple proteins and it could have other non-deleterious mutations. All I am trying to do is to derive protein sequence from exome, so that I can do an alignment using protein sequence and see how this particular domain is affected in a personal exome.
I see. but then I would just run the sequence through transeq for all 6 frames, then maybe use PFAM to see how much of the domain is intact?
Thanks Michael. I thought of similar idea, but I don't want to the step of 6 way translation and selection(or assumption) of coding transcript. IMHO, best approach could be similar to what Pierre is suggested - merge data from existing annotations along with variants specific to personal exome.