14 days ago by
Republic of Ireland
The typical program for performing statistical analyses on genetic data is PLINK. To run PLINK, your starting data should generally be VCF, BCF, or some other standard format. Can you clarify the format of your datasets?
It is possible to run statistical analyses in R, too. The most basic association test is just a Chi-squared test, after all, which I show here in R code: A: SNP dataset and Z Score
I also have a Bioconductor R package that I originally developed for the purposes of running statistical tests over large GWAS cohorts in R: RegParallel.
Another program that can run a test directly on your VCF data is SnpSift CaseControl.
For a Manhattan plot, you will require:
- ID (e.g. rs ID)
- CHR (chromosome)
- BP (base position)
- P (p-value)
You then just need the qqman package:
subset(temp, select=c(SNP, CHR, BP, P)),
chrlabs=c(1:22, "X", "Y", "MT"),
legend("topright", cex=0.8, title="Significances", c("P<0.0001", "FDR (P<5.2E-08)"), fill=c("blue", "red"))