I have selected 147 functional SNPs using genomatrix in a set of genes and tried to analyze the polymorphic status of the SNPs. 47 were polymorphic and located in TFBS (Transcription factor binding site). Can anyone please suggest me methods of prioritizing the polymorphic SNPs using bioinformatics So that I will be able to reduce the number of SNPs for further high throughput genotyping.
Montgomery et al in "A survey of genomic properties for the detection of regulatory polymorphisms" report that "distance to transcription start site, local repetitive content, sequence conservation, minor and derived allele frequencies, and presence of a CpG island" have discriminatory potential for identifying rSNPs.
MAPPER is our tool of choice as well as it uses both TRANSFAC and JASPAR motifs. Here's how we've analyzed SNPs with MAPPER:
Take a 41-bp segment of the genome with your SNP at position 21. That is 20bp of genome seq on either side of the SNP. I use 20 because the biggest models MAPPER uses are about 15 bp. Copy this sequence and append it to the end of your 41 bp segment and place an N between the two concatenated sequences (I use the N as a spacer or punctuation mark). Put allele 1 at position 21 and allele 2 at position 63. You have a sequence of 83 bp in teh following format:
(20 bp of genome, or bases 1-20)-allele 1-(next 20 bp of genome, or bases 22-41)-N-(20 bp of genome, or 1-20)-allele 2-(next 20 bp of genome, or 22-41)
In this manner I can assay one sequence to cover both alleles. Other approaches will work as well - e.g. two queries each with a different allele. Do as you wish.
Run MAPPER and save your results. I filter the results by score and E-value to retain only the most likely predictions.
I then look at for allele-specific binding of transcription factors that are relevant to the phenotypes we're following. This last point means that I delete those predictions that are for plant and invertebrate TFs. I am also not interested in many TFs that do not have a role in our research topics (obesity, diabetes, e.g.). For me, the predictions by MAPPER must encompass the positions where the SNP alleles are in the query sequence - positions 21 and 63.
I can highly recommend this approach as it has given us many good associations, even several that show interactions with components of the environment that drive activation of the TFs predicted by MAPPER.