Hi everybody,
I am new in bioinformatics, and I don't understnad many concepts and routines. I'm learning by myself looking for information, and I find this community... I would like to know how I could calculate the risk of a disease in a person knowing the genotype of several SNPs related to this disease. For example, I know the genotype configuration of a person in 22 SNPs related to thrombophilia. How can I determine if this person is in low, moderate or high risk of suffering trombo events? I read about PRS and GWAS, but I am not sure that tools help me in the question above. And I haven't found a good tutorial that explain step by step the procedure of PRS and GWAS. All I have read is not for beginners...
Thank you for your help.
Imputing the combined risk from multiple genotypes based on published data on individual genotype risks is a bit of a holy grail but kudos to your ambition. Consider a systems biology approach and see a recent Nature article
I though that's what polygenic risk score or GWAS calculate... And if the risk of a SNP for a disease can be estimated matematically, why can't several SNPs risk be calculated as well? I have several SNPs related to a disease, and I would like to know the risk score for that person to have events related to that disease (low, moderate, high) in base to those SNPs. An example of data: SNP CHR Allele1 Allele2 rs1234 2 A A rs5678 19 G C rs9012 X T A rs3456 6 C C
Thank you for your help.
'polygenic risk score' is actually just a generic term, and there are many ways to generate a 'risk score' from multiple genotypes. The algorithms that I have seen (and including my own algorithm) start with the beta coefficients, which are obtained once you fit a regression model to your data. As you are a beginner, you may try the PRS that is in-built with PLINK - I believe there is one, no?
I don't know whether Plink has a built-in PRS. That's what I'm trying to find out, but I don't find a tutorial for beginners...
Thank you for your help.
Take a look at PRSice: http://www.prsice.info/ The developers of both PRSice and PLINK are active on Biostars.
Perhaps consider adding both of these as tags to your question.
I took a look to PRSice, but I didn't understand much. It's not very helpful for beginners, in my opinion.... I will add some more tags to my question. Let's see if someone could help me to understand better the basics concepts of PRS.
Thank you for your help.
Hi, I'd suggest you to first read our guide paper which has layout some of the challenges and problems of PRS. As for the PRSice tutorial, have you try following our step-by-step tutorial? We are also trying to construct an independent tutorial for the guide paper, which you can find here. However, please note that it is still under construction. Do feel free to let us know if you found anything unclear or problematic. Good luck
Thanks Sam! Was waiting for you to arrive :)
First of all I would like to thank you very much for both the answer and help. Unfortunately, this is not vey helpful for me. I downloaded the paper and found the PRSice tool just a couple of weeks ago. I didn't understand very well neither the procedures nor what it was explained in them, that's why I decided to ask for help in this forum. As I have said, I am a begginer and I would like to learn, but I'm afraid the documentation is not for rookies as me, is for advanced users, or at least for people that have more knowledge about that kind of tools. I know I am asking for something that can be complex, and maybe the next answer I will obtain will be something like "study bioinformatic basics first". Of course, no one have obligation to answer me, but it's the same as if someone ask for how to change the lamp to a car and people in a mechanic forum answered: "study mechanic basics first", or gave the person a document that were non-understandable for him. Thank you anyway.
Mind letting us know which part of the document do you find to be too complex? One of the main goal of PRSice and our paper is to explain the basics of PRS to people who would like to perform PRS analysis. While we try to make the instructions as simple as possible, our background might sometimes make us blind to problems that new users might find difficult. It will therefore be great if we can know which part of the tutorial or the documentation are too difficult or are unclear so that we can improve upon. Thanks
I would like to apologize for my lack of response. I've been out and I haven't been able to attend the post ... On the other hand, I would like to say that it is difficult to say which parts of the manual are more complicated for me, because I imagine that although it is easy for you because you know the topics, for beginners in bioinformatics it is difficult to start reading it because we lack too many concepts. I stayed in the beginning, in the first paragraphs ... For example, it is not said what kind of tools are necessary, or where we get the necessary data to do the analysis. Talk about GWAS, but what other type of files do they serve? where can we get that data from OR and p and others, besides GWAS? I have heard about GWAS, but I don't know how it is done, and that is not explained either, it would be great to link to a tutorial about GWAS so that people who do not know what it is, can know what it is for and how it is done. There is a lot of literature about GWAS, but for people with previous knowledge, not for newbies. If you want to make a step-by-step manual for newbies, it should be structured in a simpler way, without so many sections on the left, there should be the general guide, and from there put links in the guide to other sections of the web in the that the information of the same be extended. In that sense the other link is better, but still has almost no information. The colors don't help either ... what are the red boxes with the dashes? I guess the application parameters, but it is not clear to people who are novice. Again, it would be better to put the code differently, in its natural environment to make it more logical. It says you have to use files separated by spaces, how about an example to make it totally clear? A sample file that could be downloaded to see how it is, would not be bad ... And this only in the first paragraphs. I have not continued reading because I do not know things well. I hope that my comments do not seem bad, but scientists and computer scientists have a problem, we explain things to people who think they have the same level as us. In my opinion, we only show that we know things when we are able to explain them to our grandmothers, and that they understand us. Here is an example of a step-by-step guide that I have followed and which is very good for me. By doing what it says I have managed to do what it says, and without prior knowledge of anything. It is true that over time I have gradually understood better what I was doing, but this is what I mean with a step-by-step guide for newbies: https://github.com/freeseek/gtc2vcf. The best I have found on the Internet, and even a couple of questions I asked him, he answered them with all the patience in the world, and with simple explanations. I can only say woouuuh! This forum could have several sections: one in which the rookies could have an answer from the people who know the most and another to more advanced people in which other higher-level questions could be exposed, in my humble opinion...
Thank you for your patience. Regards.
In that case, at least for the topic of PRS, I guess the closest you will get will be this tutorial we made for the guide paper. The problem with PRS analysis is that it is a slightly advance additional analysis based on GWAS data. Without knowledge on the basics of GWAS, it will be very difficult to understand the ideas behind PRS. And as GWAS itself is a rather complicated topic, it'd be a bit too much to include in a single guide. A good starting point might be this paper.
Given the background, it might be best for you to study how GWAS were performed, the statistics behind the GWAS analysis and the assumptions, before you jump into PRS analysis. Good luck!
Thank you again for your answer. I only wanted to calculate a genetic risk score based on the genotype of several SNPs related to a disease. I agree that it would be much easier to perform a PRS analysis if you know GWAS, but going on with my previous example, I am able to change the air, fuel and oil filters of my car without having mechanics knwoledgement, because someone taught me how to change them. And by the way, you didn't answer the questions above, overall that related to the filetypes needed by PRSice (only GWAS?, what other formats?). I don't need luck, I only need a good tool to calculate the risk and someone willing to help me in an easy way... Regards.
The summary statistics file are geneated from GWAS studies. Without performing a GWAS, you won't have an estimation of each SNP's effect size. GWAS file can come with many formats (there isn't any standard), but to perform PRS analysis, you will need at least the following columns: SNP_ID, P-value, Effect size, Effective allele
The thing is, as with most bioinformatic analysis, it is usually a bad idea to simply follow a tutorial and run the analysis without the background knowledge. A lot of stuff in bioinformatics are work in progress, for example, models have their own hypothesis and assumptions. Without understanding the problem, and without acquiring the background knowledges, you will very likely misinterpret the findings. For example, if you don't know what's a effect size from a GWAS, how can you understand the polygenic score model, which is the weighted sum of effect size? And if you don't know what's polygenic score, how can you interpret the result? So while it might be good to have a full detail guide for a program to teach you how to perform a analysis, it is vital for you to understand the background.
Finally, just so you know, the PRSice release comes with Toy Data, which allows you to follow the PRSice tutorial. Our tutorial also provide detail description of the expected file format though we don't go into how you obtain those file (performing GWAS).
Hi Sam, and thanks again both for your help and explanations. The problem with GWAS is that I have samples but I don't have any controls to compare with. I would need a tutorial or a step-bystep guide related to GWAS. I would like to learn, but I don't find a good tutorial....
For binary traits, you cannot perform a GWAS without the samples.
In a GWAS, you are trying to find out whether a SNP is more likely to be observed in the case when compared to the controls. Without the controls, you cannot perform a GWAS.
In this case, even if you can perform the GWAS with your samples, unless you have additional independent samples, you cannot perform PRS, as PRS require the genotype samples to be independent from the GWAS samples, otherwise it will lead to invalid results.
If you really want to learn, I'd suggest you google "GWAS tutorial". There are a lot of tutorials and even videos available online.
Ok, thank you again... Do you know any tool that could perform a genetic risk score in an easier way? I will take a look to GWAS tutorial on internet or in videos.
The easiest tool is PRSice. Then there're lassosum, LDpred, PRS-CS and if you want, you can use plink. The tutorial I posted should contain most of the info
Thank you again for all your help and your patience. Regards.
You clearly state that you are a beginner, but the project that you are aiming to do is somewhat advanced, at least from my perspective. Are you at least familiar with regression analysis, and know how to conduct this in the context of genetic variants? What data do you currently have (paste an example here)?
I'm a beginner, but I have tried with Plink. The files I have are those generated with Plink, both 2 files and 3 files. I'm not familiar neither with regression analysis nor with how to conduct this in the context of genetic variants.
An example of data: SNP CHR Allele1 Allele2 rs1234 2 A A rs5678 19 G C rs9012 X T A rs3456 6 C C
Those SNPs are related to a disease. I woul like to calculate the risk of that person to suffer that disease (a risk score).
Thank you for your help.