Post-imputation QC GWAS analysis
1
0
Entering edit mode
4.2 years ago
AR • 0

Hi,

I am performing GWAS analysis on human samples. My workflow:

  1. I performed Pre-imputation QC using plink and imputation using TopMed Michigan server.
  2. After imputation, I got 22 vcf files + 22 info files. I did Post-imputation QC check on the imputed data (vcf).
  3. After QC I had separate vcf files for every chromosome (22 files).
  4. I merged them using bcftools merge option.

Right now I have a merged vcf for all 22 chromosomes which is around 4.2GB. Also, I have 22 info files merged together around 12GB. I want to check the accuracy of the imputation, basically I want to analyse plots of MAF and Rsquare values. Rsquare and MAF, both I will get from merged info file which is like 12GB. Although, I am not very good with R plots somehow I could manage to plot MAF against frequency of variants. But I am stuck with histogram/scatter plot for MAF against Rsquare in RStudio. I have been trying since last week but my file size is so big that my system hangs up. And even I am not sure of my RScript.

Can anyone please help me or refer me to some good resource for RScripts specifically for such GWAS analysis plots. I have tried online tutorials also for R plots.

Thanks AR

GWAS Rstudio Imputation TopMed • 3.1k views
ADD COMMENT
0
Entering edit mode

Thanks Curious. I am trying to chip off my large files and then will go for plots in R. Thanks for your suggestion. But right now, I am stuck with R2 values. I have analysed a single dataset containing 96 samples. After imputation, there were around 300 million variants. But after post-imputation QC step (R2>0.5) number drastically reduced to 10 million.

My command:

./plink --bfile s2_chr1 --qual-scores chr1.info 7 1 1 --qual-threshold 0.5 --make-bed --out plinkout_chr1

Any help highly appreciated pls.

AR

ADD REPLY
1
Entering edit mode

I don't know if its fine without seeing the data, thats up to you but overall thats pretty normal to thin out to a few dozen or so well imputed variants

ADD REPLY
0
Entering edit mode

Thanks for the reply curious. After some struggle, I was able to solve it today. Problem is with plink --qual-threshold. It should be --qual-max-threshold instead of --qual-threshold. Now, I am getting 292 million variants out of 300 million variants after post-imputation QC. I figured it out when I plotted the Rsquare values. Before, all values were like below 0.5 only. Silly mistake I would say.

But, I again found out one very suspicious thing in my Rsquare filter output. It is giving me filtered variants( Rsquare>0.5) but with that it mentions that around 90,000 IDs missing. I am not able to figure out this missing ID problem. It is coming with plink --qual-threshold only not with any other plink modules I have used during my post imputation QC.

--qual-scores: 22796091 variants remaining, 90816 IDs missing.

Thanks. AR

ADD REPLY
0
Entering edit mode

Might want to look at the qqman package

ADD REPLY
2
Entering edit mode
4.2 years ago
curious ▴ 810

I don't think there are specific resources, you just need to keep chipping.

The common this is to do line plots of box plot that summarizes average rsq over a minor allele frequency bin. kind of like this:

https://imgur.com/a/IIIwZ4n

People do scatterplots sometimes too, but for "smaller" type imputation, but for topmed thats going to end up being like 300M ish points, which is kind of a lot to draw.

Try to cut down the info file so it is just one column for MAF and one for Rsq? Maybe try to split it further into a file for each MAF bin then use that? Basically anyway you can split it up.

ADD COMMENT

Login before adding your answer.

Traffic: 1716 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6