Entering edit mode
6.4 years ago
Biomed
5.0k
I am interested in finding a comprehensive (SNV +CNV) and reliable list of all known disease-causing genes and preferably also variants in those genes. Clinvar comes to mind as an obvious option but is it really the best source out there or should one combine the information found in clinvar with other data sets for a most comprehensive set? Thanks.
Was also going to mention HGMD. However, to use recent versions of it requires a licence (pay only). Also, your question is premature... When one considers the fact that we simply don't know the exact role of the vast majority of genetic variants in relation to disease, there cannot yet exist a database that is all-encompassing in the sense that you appear to want.
+1 on the "cannot know" comment. HGMD lags by 3-6 months between Pro and Free versions, but the difference is not too much and can be addressed by scanning recent papers. It does depend on the number of genes under study though. Whole Genome reconciliation would be near impossible.
The question is about known-to-be disease-causing genes and variants like a nonsense variant in an ACMG 59 gene. I am not interested in genes and variants of unknown significance. I hope this clarifies the question. Thanks for the comment.
From my experience, pathogenicity of a mutation/variant is a dynamic, changing annotation. We do not truly know which are definitely damaging, but by current standards, ClinVar + PolyPhen2 + HGMD will be the strongest evidence you will get that a variant is definitely damaging.
It is easier to determine that a variant is not damaging, but much more difficult to say for sure it causes X or Y phenotype.
I agree again with Ram. This is very much a 'work in progress'. The ACMG have done a lot to attempt to define pathogenicity but I think that it's an impossible task, currently, because we don't have the information at our disposal such that we can say with 100% certainty that this or that variant will be pathogenic. Also, we can sequence germline DNA from B lymphocytes, for example, and discover a whole bunch of previously reported 'pathogenic' variants, but then these may have minimal relevance in our tissue of interest. Also, they may have the highest scores among the in silico prediction tools, but actually not prove 'damaging' at all. How do we even define 'damaging' and 'pathogenic' when most diseases / phenotypes are determined by very complex genetics that we do not yet understand? We are still even attempting to define how to correctly annotate splice isoforms, and we just don't yet have data on expression in different tissues.
Some well-studied cases are out there, though, such as germline variants in TP53 resulting in Li Fraumeni Syndrome, BRCA1 germline variants and variants upstream of CCND1 and breast cancer susceptiility, ORMDL3 and asthma susceptibility, etc. Even in many of these well-studied cases, though, we have not even properly defined the disease mechanisms at play.
I write so much on this because I have a review coming out on this topic.