I am having some difficulty with the approaches suggested and the reality of using said approaches.
Basically I want to look at variations in homologous proteins to a protein I am interested in. So I have been told by a hundred different people to "just do it this way" and bingo the world is perfect. However the data and the procedure does not reflect the ease of its use.
1 - Blastp my protein to homologous proteins, and clustal-o for aligning them.
2 - Get the gene names for each of the blasted proteins .
3 - Get allele data or polymorphism data for each of these genes from genomic database such as gnomAD.
However how does my clustal-o alignment of the proteins relate to the genomic sequence alleles ? And how does an aligned protein variant seen at residue 50 relate to the allele at a particular position in a gene?
- Do I have to map the gene codons to the transcribed protein residues ? So that if there is a polymorphism in one codon I can then say that it is associated with the variant seen at the amino acid transcribed? I.e. how does a polymorphism relate to a protein variant?
I get many homologous proteins for which there is no data at all about its associated gene in databases - so my frequency in this case is N/A which cannot be used in a numerical analysis.
- what do you do with missing data ?
I am at a loss of how to relate all the data available - and whilst there is alot of information about the data, there is little that associates one data source to another.
AND IM NOT LOOKING FOR ANSWERS - I WANT SOME SORT OF DISCUSSION OR EXPLANATION OF THE RATIOCINATION OF THE PROCEDURES INVOLVED. WHATS THE REASON FOR ASSOCIATING OR NOT ASSOCIATING PROTEIN VARIANTS WITH GENETIC ALLELES FOR EXAMPLE, I AM NOT ASKING YOU TO DO THIS FOR ME! - this isnt yelling, just trying to stress my point. Im not into this instant gratification culture, so I would like a discussion.