Question

Which Reference Index to use for Clinical Variants (SNPs and INDELs) Detection ?

0

Entering edit mode

9.8 years ago

cvu ▴ 180

Hello Everyone,

I've generated two vcf files, one using hg38(canonical includes, 25 core human chromosomes) and one using full hg38 (including All chromosomes (or scaffolds/contigs) released as part of the full build from the data source).

Now, i'm getting different numbers of SNPs in both the vcf files as follows:

Canonical: 1501928 SNPs

Full:1463299 SNPs

and also getting different SNP QUAL in some cases.

someone please suggest me which ref index i should use to detect clinical variants ?

Assembly SNP next-gen alignment genome • 2.3k views

ADD COMMENT • link updated 9.8 years ago by vlaufer ▴ 290 • written 9.8 years ago by cvu ▴ 180

Ram · Answer 1 · 2014-07-08

Prediction of phenotype from genotype is an entire field of bioinformatics at this point. The answer is that it depends on your goals, your disease of interest etc.

ClinVar will be highly specific, but not sensitive. Meaning, if you use only clinvar, and variants of interest are found there, it is a safe bet it is related to a phenotype. However, most deleterious variation will NOT be found in Clinvar. So, if you are dealing with a highly penetrant mendelian condition, perhaps ClinVar would be a good place to start.

Almost regardless of disease state, it is at least equally that any causative variant your data may actually possess is found or not, meaning, if you only look by annotation using a database, you are between somewhat likely and highly likely to miss the causative variant. Two anecdotes along those lines:

This is why CADD writes out the entire sample space for SNVs - it is a recognition that any mutation is possible, and we lack empirical data on whether the vast majority of them are harmless or harmful.(greg cooper senior author nature gen 2014)

Further, a recent paper on IRF5 shows two private SNPs (only one person in a cohort of 8700) produce changes consistent with Lupus ... so it is quite possible that the causative variant(s) in whatever you are studying have no information about them whatsoever.

So, we cannot answer your question without a great deal more information.

Perhaps the place to start is annotating according to some database, perhaps it is by studying papers in your field - we cannot know.

Some questions that will help you reflect on the next steps:

Do you have any idea where to look? I.e. are you looking only in a few genes? Do you know the part of the gene(s) likely altered?
How much is known about the contribution of those gene(s) in the literature? Perhaps there are pre-existing data to guide you?
Consider annotating with CADD
Cross-indexing with ClinVar as a final step, not likely to payoff but if it does relatively lucky