I am currently preprocessing UK Biobank genetic data in order to run a GWAS. I am working with the imputed data, which I transformed to .pgen files and then merged into a single file containing all chromosomes. I have run into a problem with excluding variants with poor imputation quality. The PLINK2 documentation appears to provide several options:
- exclude-if-info. I am not totally sure if this function does what I intend, because it mentions needing a 'key' to evaluate against, and I do not know what 'key' is intended here.
- mach-r2-filter. This definitely allows you to select variants based on imputation quality. However, the UK Biobank data is imputed using IMPUTE, which uses a different imputation quality metric (although both appear to be called 'INFO'?). The imputation quality in the mfi files provided by the UK Biobank ranges from 0 to 1 while the MaCH R2 appears to run from 0 to 2 based on the PLINK2 documentation ("mach-r2-filter excludes variants where the MaCH Rsq imputation quality metric (frequently labeled as 'INFO') is outside [0.1, 2.0]"), so I assume the measure in the mfi file does is not the same as the MaCH-R2 metric?
- exclude. This is the one I have been trying for a while now. I have selected SNPs from the mfi files based on the imputation quality threshold, saved the SNP IDs to a .txt file and used that as input for the --exclude command. However, for some reason it keeps saying that the number of variants after using the exclude command is the same as before. I have tried several different method of identifying the SNPs, including rs ID, the method that the data is in the .pvar file (1:10586 for example), and the method I used to save the SNP IDs when transforming the .bgen to .pgen files (chr1:10486A,G for example). (EDIT: IGNORE THIS CONFUSING DESCRIPTION, IT'S INACCURATE; SEE COMMENTS.) This is the full command:
plink2 --pfile ukb22828_merged_only --exclude ukb_info_exclusions_3_v4.txt --memory 18000 --make-pgen --out ukb22828_merged_keep_exclude
This is the relevant part of the output in slurm (the logfiles are empty for some reason):
PLINK v2.00a4LM 64-bit Intel (9 Jan 2023) www.cog-genomics.org/plink/2.0/ (C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to ukb22828_merged_keep_exclude.log. Options in effect: --exclude ukb_info_exclusions_3_v4.txt --make-pgen --memory 18000 --out ukb22828_merged_keep_exclude --pfile ukb22828_merged_only
Start time: Fri Feb 3 17:15:48 2023 128549 MiB RAM detected; reserving 18000 MiB for main workspace. Using up to 3 compute threads. 408090 samples (220606 females, 187484 males; 408090 founders) loaded from ukb22828_merged_only.psam. 93095623 variants loaded from ukb22828_merged_only.pvar. Note: No phenotype data present. --exclude: 93095623 variants remaining. 93095623 variants remaining after main filters.
This is the start of the ukb_info_exclusions_3_v4.txt file:
It's not failing, it's working fine, but it doesn't exclude any variants. Does anyone know of a solution? I mostly just want to be able to exclude the SNPs with poor imputation quality, but I would also really appreciate knowing what I'm doing wrong with my tries to exclude certain SNPs, in case I need it in the future.
Thank you in advance and kind regards.
The --exclude file must use the same variant IDs as your .pvar file. You should elaborate on what seemed to go wrong when you tried to arrange that.
I have tried that (the .pvar file contains the "1:10586" form as ID which I mentioned above) which leads to the same output as above, 0 variants excluded, 93095623 variants remaining after exclude filter.
Edit: I should mention that it also contains the ref and alt alleles, but those were also included in the .txt file behind the "1:10586".
If you aren't willing to provide more precise information than this, sufficient for a reader to reproduce what you're seeing, I will delete your question.
I feel like I reproduced all the information I could, as I do not know the inner workings of PLINK2 and this is all I know of the problem. However, the issue has been resolved and the question can be deleted. I noticed our cluster was experiencing upload issues and after deleting and reuploading the file through another route the "1:10585[REF],[ALT]" IDs now worked. Thank you for your thoughts.
This isn't a matter of "know[ing] the inner workings of PLINK2". It's about clear communication and debugging practice.
You never posted a direct excerpt from the .pvar file that is actually in the failing command, which should obviously be the one that matters; this doesn't require knowledge of PLINK2's internals. And from your resolution, it's clear that you MISREPORTED what the ".pvar file contains", at least for the .pvar file that matters. At minimum, you should have caught that inconsistency after my first comment, instead of just saying "I have tried that" when I explicitly asked you to elaborate.
I'm sorry. I was not aware that the .pvar file was where I had to look for the correct IDs. The .pvar file really does contain the IDs as "1:10586A,G", I do not know why this is not possible. I figured showing you the same slurm output that I had already posted in my original question would not add to the clarity of my question.
Edit: Ah, I see now that my last response was incomplete indeed (I wasn't as careful as I figured the question would now be deleted), but the .txt file was the one I had to reupload, which had correct IDs on my computer and I had no reason to expect was uploaded wrongly as the issues only started on the day I asked this question. On the cluster, that one must have had incorrect IDs.
I cannot find a way to delete this question myself, so feel free to do it as I doubt this issue and its resolution will be relevant for anyone else.