Hi,
I am performing GWAS analysis on human samples. My workflow:
- I performed Pre-imputation QC using plink and imputation using TopMed Michigan server.
- After imputation, I got 22 vcf files + 22 info files. I did Post-imputation QC check on the imputed data (vcf).
- After QC I had separate vcf files for every chromosome (22 files).
- I merged them using bcftools merge option.
Right now I have a merged vcf for all 22 chromosomes which is around 4.2GB. Also, I have 22 info files merged together around 12GB. I want to check the accuracy of the imputation, basically I want to analyse plots of MAF and Rsquare values. Rsquare and MAF, both I will get from merged info file which is like 12GB. Although, I am not very good with R plots somehow I could manage to plot MAF against frequency of variants. But I am stuck with histogram/scatter plot for MAF against Rsquare in RStudio. I have been trying since last week but my file size is so big that my system hangs up. And even I am not sure of my RScript.
Can anyone please help me or refer me to some good resource for RScripts specifically for such GWAS analysis plots. I have tried online tutorials also for R plots.
Thanks AR
Thanks Curious. I am trying to chip off my large files and then will go for plots in R. Thanks for your suggestion. But right now, I am stuck with R2 values. I have analysed a single dataset containing 96 samples. After imputation, there were around 300 million variants. But after post-imputation QC step (R2>0.5) number drastically reduced to 10 million.
My command:
Any help highly appreciated pls.
AR
I don't know if its fine without seeing the data, thats up to you but overall thats pretty normal to thin out to a few dozen or so well imputed variants
Thanks for the reply curious. After some struggle, I was able to solve it today. Problem is with plink --qual-threshold. It should be --qual-max-threshold instead of --qual-threshold. Now, I am getting 292 million variants out of 300 million variants after post-imputation QC. I figured it out when I plotted the Rsquare values. Before, all values were like below 0.5 only. Silly mistake I would say.
But, I again found out one very suspicious thing in my Rsquare filter output. It is giving me filtered variants( Rsquare>0.5) but with that it mentions that around 90,000 IDs missing. I am not able to figure out this missing ID problem. It is coming with plink --qual-threshold only not with any other plink modules I have used during my post imputation QC.
Thanks. AR
Might want to look at the qqman package