Question: Associate SNPs with Clinical Data
gravatar for carrara.matt
9 months ago by
carrara.matt0 wrote:


I am fairly new and inexperienced regarding SNP analysis and I am struggling to grasp some details regarding the analysis of association between SNP and clinical data.

At the moment I have a dataset of Whole Exome Sequencing of patients with Alzheimer's disease with their associated clinical data. All patients are not relative and we do not have information regarding the pedigree and family history. I have called the variants with GATK, annotated them with annovar and did some selection based on different annotation parameters. This gave us a small list of candidates to validate.

Now I am asked if I can find groups of SNPs that associates with specific clinical data of the patients. The rationale behind the request is to find, within the population, groups of SNPs that can positively or negatively affect parameters such as memory loss because they consistently associate with individuals with high/low scores of the parameter.

My review of literature and tutorials of the last few days suggest I should go with either GWAS or QTL mapping. It appears however that: 1) GWAS requires matched controls, which we do not have 2) QTL mapping requires more of a map of markers equally spread throughout the genome, rather than only the full list of SNPs.

I came across this post, but I'm not sure which function of PLINK they refer to and if it is exactly what I am looking for. I am now learning the suggested software "MERLIN", but I was wondering what is the golden standard for this kind of analysis (if there is one) or what software you use.

Thank you

ADD COMMENTlink modified 9 months ago by JC10k • written 9 months ago by carrara.matt0

Just do a simple stat analysis. For each feature (SNV) you divide your dataset into 2 groups: having this SNV and not having it (test and controls). Then regress it against clinical data, using linear model with appropriate link - and don't forget to make multiple test correction in the end. I liked the following post for revealing the potential problems (also there is a link to the paper - the methodology there was quite OK except the fact that the paper was wrong, but don't worry, apply the same)

This approach has quite a lot of drawbacks, but I think the solution will come once you'll start.

ADD REPLYlink written 9 months ago by German.M.Demidov1.6k

Couldn't you use a large population study such as UK Biobank as your controls? There is some WES data (50K or so participants at this time, although all 500K should be done eventually), although you could just convert the variant calls from from the WES data to the SNP alleles that match the UK Biobank microarray (data available for all participants) and do a GWAS. A couple of caveats though if you take this approach, microarray data is ideally for variants > 1% Minor Allele Frequency (with rarer alleles often false positives) and many of the of microarray SNPs will be intergenic and so probably not covered by the WES sequencing data. The alternative is to use the available WES UK Biobank data, although again this will only cover the exome. UK Biobank does have pretty comprehensive phenotype data available which may be compatible with your memory loss parameters ( as an example), as well as primary care, hospital episode statistics and self-reported data on conditions such as Alzheimer's.

ADD REPLYlink written 9 months ago by Garan620
gravatar for JC
9 months ago by
JC10k wrote:

Use Exomiser to find variant linked (or apparently linked) to your phenotypes (the clinical info), it needs to be described as HPO terms

ADD COMMENTlink written 9 months ago by JC10k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 936 users visited in the last hour