Hey guys,
I'm currently working on my final project for college and it's about finding SNPs. So the basic idea is to implement a method to find functional and rare SNPs based on dbSNP and clinVar. I'm only allowed to use the downloaded VCF files of both of these data bases. So my question is if anybody has any related experience, especially on what is the best and efficient way to find those rare SNPs in those VCF files. Some of the SNPs come with the MAF (Minor Allel Frequency) but since there are plenty with no MAF I'm afraid there will be missing a lot if I filter based on MAF. Ideas?
Can you clarify a bit more what you mean about finding SNPs? If you are working ONLY with dbSNP and ClinVar, then all of the entries in those databases ARE SNPs (unless there are also CNVs in ClinVar now). If you are searching those databases for rare SNPs well, the only thing that defines how rare a SNP is is the Minor Allele Frequency.(<1% or <0.5% are the typical cut-offs used).
If you are doing work with an exome or whole genome sequence and need to find rare variants in that dataset, using only dbSNP and ClinVar as the resources then that is a different story and amounts to following some sort of Best-Practices protocl (GATK has one) for calling variants and filtering them. Of course if you are doing that dbSNP and ClinVar are only starting points, there are other population frequency databases for SNPs out there whose frequencies aren't necessarily included in dbSNP...
The point is to cut out common SNPs (for example a MAF >5% ) and find those which are not common (based on dbSNP and later maybe more Databases) and potentially pathogenic (based on clinVar). I'm focusing on only exomic data so the first step would be to filter those out of the databases.
Ok, the key point there was the last little but, that you are working with exome data. You didn't specify in your original post which made it ambiguous. You could have been looking through the databases themselves and just doing filtering or data reduction.