Post-Imputation QC Problem
0
0
Entering edit mode
10 days ago
Jesse • 0

I'm struggling with post-imputation processing of some data, and I would be very grateful for some guidance.

I have a data set that has been imputed through Michigan Imputation Server. I now need to perform post-imputation processing. I've attempted to run the data through Plink with the following command:

plink2 --vcf [vcffilename] dosage=DS --exclude-if-info "R2<=3" --score [scorefilename]

E.g.:

plink2 --vcf file1.DOSAGE.vcf dosage=DS --exclude-if-info "R2<=3" --score file1.INFO


I want to process this data per superpopulation (AFR, ALL, AMR, EUR, EAS, or SAS), but I am working with sample sizes less than n=50 per superpopulation. As a result, when I try to run the data through Plink, it reports that I need frequency files from larger, similar populations. I figured that the .freq files for the 1000G superpopulations would work for this, but I cannot for the life of me find any such files. I tried to create my own, but the files located at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ just have "." in the ID column, which seems to be a problem for Plink.

Questions:

1. Do .freq files per superpopulation already exist for the 1000G data? If not, is there a simple method for creating them?
2. Is this even the correct way to perform post-imputation QC? I'm very new to working with this kind of data, so I'm honestly just flying blind and hodgepodging steps together from tutorials and methods I've found around the web.

Thanks in advance for any and all help. It is greatly appreciated.

Imputation • 329 views
0
Entering edit mode

What are you planning to do with the data when you finish the QC? I think plink calculates R^2 using the data, hence why it needs more than 50 samples. Does your data have an imputation INFO score? You can filter on that instead since it's already calculated by the imputation server (I think).

0
Entering edit mode

Thank you very much for the reply!

I know that ultimately my PI's intent is to run analyses to search for correlations between variants and participant diagnoses, but I otherwise don't know what we're doing with the data post-QC.

I received two files per chromosome from the imputation server, e.g., chr1.dose.vcf.gz and chr1.info.gz. An example of the data contained in those files:

Dose file:

##fileformat=VCFv4.1
##filedate=2023.3.2
##contig=<ID=1>
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Minor Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy (R-square)">
##INFO=<ID=ER2,Number=1,Type=Float,Description="Empirical (Leave-One-Out) R-square (available only for genotyped variants)">
##INFO=<ID=IMPUTED,Number=0,Type=Flag,Description="Marker was imputed but NOT genotyped">
##INFO=<ID=TYPED,Number=0,Type=Flag,Description="Marker was genotyped AND imputed">
##INFO=<ID=TYPED_ONLY,Number=0,Type=Flag,Description="Marker was genotyped but NOT imputed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=HDS,Number=2,Type=Float,Description="Estimated Haploid Alternate Allele Dosage ">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1 ">
##pipeline=michigan-imputationserver-1.7.1
##imputation=minimac4-1.0.2
##phasing=eagle-2.4
##panel=apps@1000g-phase-3-v5
##r2Filter=0.0
#CHROM  POS ID         REF ALT  QUAL FILTER INFO                                    FORMAT          SAMPLE_#1
1  10177   1:10177:A:AC   A   AC   .    PASS   AF=0.37179;MAF=0.37179;R2=0.08507;IMPUTED   GT:DS:HDS:GP    0|1:0.947:0.278,0.669:0.239,0.575,0.186


Info file:

SNP         REF(0)  ALT(1)  ALT_Frq MAF AvgCall Rsq Genotyped   LooRsq  EmpR    EmpRsq  Dose0   Dose1
1:10177:A:AC    A   AC  0.37179 0.37179 0.66850 0.08507 Imputed         -   -   -   -   -


I started with attempting to filter by rsq since it was the only post-imputation QC recommendation I could find in the Michigan Imputation Server documentation. Do you know how I might use the INFO score to filter instead?

Thank you again!