I have 4 fairly old GWAS files from 2010 and I need to use one of them for a negative control in my analyses.
The format of the files are as follows:
MarkerName Allele1 Allele2 Weight GC.Zscore GC.Pvalue Overall Direction rs629301 t g 100184.00 24.350 5.77e-131 + +++++++++++++++++++++++++ rs599839 a g 100122.00 24.269 4.12e-130 + +++++++++++++++++++++++++ rs646776 t c 100184.00 24.247 7.11e-130 + +++++++++++++++++++++++++ rs12740374 t g 100184.00 -24.203 2.06e-129 - ------------------------- rs660240 t c 100184.00 -24.156 6.51e-129 - -------------------------
The information I need from these files are chromosome position, base position and the p-value, but only the p-value is included in the current format (under GC.Pvalue).
In my head the best way to do this is in two stages, use the rsIDs to get the chr and base position information on hg18 then do a liftover from hg18 to hg19. The lift over is not an issue I have used this package before. The issue is the quickest method to get the genome co-ordinates for the 4 files given they contain 3M SNPs.
I have seen python and database solutions for this on Biostars, but ideally I'm looking for a linux solution (or uber ideally a single tool out there to do all of this!), though I can use R. I also suspect these solutions work for lesser amounts of SNPs.
Can this be done by one package, or is there a best practice method?