Edit: See answer. tl;dr: code should have been:
plink2 --vcf [input-name] dosage=HDS --exclude-if-info "R2<=0.3" --export vcf --out [output-name]
I'm struggling with post-imputation processing of some data, and I would be very grateful for some guidance.
I have a data set that has been imputed through Michigan Imputation Server. I now need to perform post-imputation processing. I've attempted to run the data through Plink with the following command:
plink2 --vcf [vcffilename] dosage=DS --exclude-if-info "R2<=3" --score [scorefilename] E.g.: plink2 --vcf file1.DOSAGE.vcf dosage=DS --exclude-if-info "R2<=3" --score file1.INFO
I want to process this data per superpopulation (AFR, ALL, AMR, EUR, EAS, or SAS), but I am working with sample sizes less than n=50 per superpopulation. As a result, when I try to run the data through Plink, it reports that I need frequency files from larger, similar populations. I figured that the .freq files for the 1000G superpopulations would work for this, but I cannot for the life of me find any such files. I tried to create my own, but the files located at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ just have "." in the ID column, which seems to be a problem for Plink.
- Do .freq files per superpopulation already exist for the 1000G data? If not, is there a simple method for creating them?
- Is this even the correct way to perform post-imputation QC? I'm very new to working with this kind of data, so I'm honestly just flying blind and hodgepodging steps together from tutorials and methods I've found around the web.
Thanks in advance for any and all help. It is greatly appreciated.