2.3 years ago by
Based on my work over the last few days I can say that in many cases it just isn't possible to use data from published GWAS to find the risk allele, many studies just do not publish enough information by either skipping allele information altogether and just publishing p-values, or by publishing an odds-ratio or beta without stating how it was calculated.
However, there are a few tricks/heuristics that can be used to get the data:
- Sometimes the way the OR was calculated (e.g. with respect to the minor allele) is given in the methods or table description. The allele they identify as the coded allele is then allele1 for the OR calculation and thus the risk allele is the coded allele if the OR is greater than 1.
- A large number of studies provide an allele1, allele2, and OR. In these cases it is almost always the case that the OR was calculated with respect to allele1, and thus the risk allele is allele1 if the OR is greater than 1. However, as the authors give no explicit information in this case, you need to treat this data with a large grain of salt, you could easily have it backwards.
- Many studies give the MAF for both cases and controls along with the minor allele, sometimes with an OR as well. In this case the risk allele is the minor allele if the MAF is higher in the the case vs control. When studies have both the MAFs and the OR, in my experience of a few dozen studies, the MAF calculation and OR direction always match, which gives me more confidence in the last method I mentioned (bullet 2).
- Sometimes studies do not provide the minor allele, but just provide the MAFs. In these cases you can still get the risk allele in cases where you can query the SNP on dbSNP. The key is that the population studied should be the same as the population data in dbSNP, and dbSNP should have only 1 alternate allele for that SNP. Sometimes there are multiple alternate alleles, in this case you can never have confidence that you have the right minor allele. In the cases where there is obviously one major and one minor allele for your population though, you can use that to pick the risk allele using the technique mentioned above.
Using these methods I am able to get the risk allele for around 80% of the studies that I have searched, which isn't bad. I think only around 10% actually explicitly state what the risk/coded/effect allele is, which is kind of mindblowing to me.