I have the genotype info of an individual in a vcf file. From that, I extracted relevant info like the REF, ALT and GT columns. I used a GWAS summary stats file for type 2 diabetes (T2D) from the Diamante consortium which contains risk alleles and effect size at each SNP. Since the effect size was in raw OR, I converted it to ln(OR) as I read this is equivalent to the beta-coefficient.
For the PRS calculation, I applied the formula used in PLINK (summation of no. of risk alleles multiplied by effect size for each SNP) and divided by ploidy * no. of non-missing SNPs, which I believe is the normalization step. The formula I referred to is from this tutorial: https://choishingwan.github.io/PRS-Tutorial/plink/
For my sample, the result shows:
Raw PRS: 2639.535275313198
Normalized PRS: 0.0106
How do I compare this value to a standardized score for T2D disease risk? Can I compare the normalized PRS value or do I need to convert it to a standardized score? I referred to UK Biobank for standardized scores but am having trouble figuring out how to interpret and compare the risk score I got to the average risk for T2D. I'm also not entirely sure where this average score is available.
Much thanks for your reply. If I could ask - what would be the best way or what kind of data is typically needed to obtain a PRS score that's actually meaningful? And in what case could this PRS score I've generated be of any use? Unfortunately this is all I've been given to work with and I'm at my wits' end trying to figure out how to make do with it.
Currently, we still haven't figure out how to best use PRS on individual level. What are your research question? Why do you need to use a PRS? Without these information, we cannot provide any useful recommendation.
The objective was to develop an algorithm for PRS calculation based on a given sample's SNP data. I chose to test their risk for type 2 diabetes using publicly available GWAS summary stats data. However, from what I understand now from reading your reply and several other forum posts, PRS being a relative measure within a population means I cannot compare it directly to the average PRS for diabetes from a database which used their own QC and PRS calc pipelines.
Am I correct in understanding that to obtain a PRS that has any scientific validity, I'd need raw SNP data of a large number of samples within the same cohort, apply the same pipeline of PRS calc to them, and interpret an individual’s PRS only in the context of that internal population distribution? And that it would make no sense to compare this score to, say UK Biobank's average PRS (e.g. for diabetes), because of ancestry mismatch, as well as different QC and PRS calc pipelines that were used to generate their average score?
Again, I apologize as this is the only info I have. I was introduced to the concept of PRS just about a week ago so I've had a lot to catch up to and my understanding of its application is still quite muddled.
I'd suggest you to read on the basics of PRS before you get started. A good starting place would be this: https://www.nature.com/articles/s41596-020-0353-1
This was one of the first papers I read while doing my research. Will go through more sources. Thank you.