Question

Post-Imputation QC Problem

1

Entering edit mode

2.3 years ago

Jesse ▴ 10

Edit: See answer. tl;dr: code should have been:

plink2 --vcf [input-name] dosage=HDS --exclude-if-info "R2<=0.3" --export vcf --out [output-name]

Original post

I'm struggling with post-imputation processing of some data, and I would be very grateful for some guidance.

I have a data set that has been imputed through Michigan Imputation Server. I now need to perform post-imputation processing. I've attempted to run the data through Plink with the following command:

plink2 --vcf [vcffilename] dosage=DS --exclude-if-info "R2<=3" --score [scorefilename]

E.g.:

plink2 --vcf file1.DOSAGE.vcf dosage=DS --exclude-if-info "R2<=3" --score file1.INFO

I want to process this data per superpopulation (AFR, ALL, AMR, EUR, EAS, or SAS), but I am working with sample sizes less than n=50 per superpopulation. As a result, when I try to run the data through Plink, it reports that I need frequency files from larger, similar populations. I figured that the .freq files for the 1000G superpopulations would work for this, but I cannot for the life of me find any such files. I tried to create my own, but the files located at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ just have "." in the ID column, which seems to be a problem for Plink.

Questions:

Do .freq files per superpopulation already exist for the 1000G data? If not, is there a simple method for creating them?
Is this even the correct way to perform post-imputation QC? I'm very new to working with this kind of data, so I'm honestly just flying blind and hodgepodging steps together from tutorials and methods I've found around the web.

Thanks in advance for any and all help. It is greatly appreciated.

Imputation • 2.8k views

ADD COMMENT • link 2.3 years ago by Jesse ▴ 10

0

Entering edit mode

What are you planning to do with the data when you finish the QC? I think plink calculates R^2 using the data, hence why it needs more than 50 samples. Does your data have an imputation INFO score? You can filter on that instead since it's already calculated by the imputation server (I think).

ADD REPLY • link 2.3 years ago by 4galaxy77 2.9k

0

Entering edit mode

Thank you very much for the reply!

I know that ultimately my PI's intent is to run analyses to search for correlations between variants and participant diagnoses, but I otherwise don't know what we're doing with the data post-QC.

I received two files per chromosome from the imputation server, e.g., chr1.dose.vcf.gz and chr1.info.gz. An example of the data contained in those files:

Dose file:

##fileformat=VCFv4.1
##filedate=2023.3.2
##contig=<ID=1>
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Minor Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy (R-square)">
##INFO=<ID=ER2,Number=1,Type=Float,Description="Empirical (Leave-One-Out) R-square (available only for genotyped variants)">
##INFO=<ID=IMPUTED,Number=0,Type=Flag,Description="Marker was imputed but NOT genotyped">
##INFO=<ID=TYPED,Number=0,Type=Flag,Description="Marker was genotyped AND imputed">
##INFO=<ID=TYPED_ONLY,Number=0,Type=Flag,Description="Marker was genotyped but NOT imputed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=HDS,Number=2,Type=Float,Description="Estimated Haploid Alternate Allele Dosage ">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1 ">
##pipeline=michigan-imputationserver-1.7.1
##imputation=minimac4-1.0.2
##phasing=eagle-2.4
##panel=apps@1000g-phase-3-v5
##r2Filter=0.0
#CHROM  POS ID         REF ALT  QUAL FILTER INFO                                    FORMAT          SAMPLE_#1
 1  10177   1:10177:A:AC   A   AC   .    PASS   AF=0.37179;MAF=0.37179;R2=0.08507;IMPUTED   GT:DS:HDS:GP    0|1:0.947:0.278,0.669:0.239,0.575,0.186

Info file:

SNP         REF(0)  ALT(1)  ALT_Frq MAF AvgCall Rsq Genotyped   LooRsq  EmpR    EmpRsq  Dose0   Dose1
1:10177:A:AC    A   AC  0.37179 0.37179 0.66850 0.08507 Imputed         -   -   -   -   -

I started with attempting to filter by rsq since it was the only post-imputation QC recommendation I could find in the Michigan Imputation Server documentation. Do you know how I might use the INFO score to filter instead?

Thank you again!

ADD REPLY • link 2.3 years ago by Jesse ▴ 10

score 1 · Accepted Answer · 2023-03-29

This was an operator error.

I realized that the error I was receiving was because of my use of the "--score" flag, which I did not need to perform rsq filtering. I also later found that I had written my "--exclude-if-info" expression incorrectly. I wanted to keep variants with an R2 greater than 0.3, so that flag should have read "--exclude-if-info "R2<=0.3", which would exclude all variants with an R2 less than 0.3. Additionally, I realized that I was using "DS" with the dosage modifier for --vcf, which is for data that was imputed with minimac3. MIS uses minimac4, so "HDS" should have been used.

For anyone who may find this in the future, the code should have read:

plink2 --vcf [input-name] dosage=HDS --exclude-if-info "R2<=0.3" --export vcf --out [output-name]