Question: Can not pass QC when doing imputation with Michigan Imputation Serve and Haplotype Reference Consortium panels
0
gravatar for archie.w.lee
23 months ago by
archie.w.lee20
archie.w.lee20 wrote:

Hello All.

I have 88 samples in 23andme format. The population is EUR. And I would like to do imputation with Michigan Imputation Serve and Haplotype Reference Consortium panels.

The first thing I did is converting the 23andme format to VCF format by bcftools with following commend. The reference is hg18. 
    bcftools convert -c ID,CHROM,POS,AA -s sampleID -f hg18.fa --tsv2vcf sample.txt -Oz -o sampleID.vcf.gz

Then, I merge all VCFs of 88 samples into one vcf file and upload it to Michigan Imputation Serve. I choose HRC panel for imputation.

But I got flowing errors.

Input Validation
    1 valid VCF file(s) found.
    Samples: 88
    Chromosomes: 1
    SNPs: 164386
    Chunks: 24
    Datatype: unphased
    Reference Panel: hrc
Quality Control
    Execution successful
    Statistics: 
        Alternative allele frequency > 0.5 sites: 60,835
        Reference Overlap: 1.42% 
        Match: 214
        Allele switch: 191
        Strand flip: 191
        Strand flip and allele switch: 215
        A/T, C/G genotypes: 4
    Filtered sites: 
        Filter flag set: 0
        Invalid alleles: 62,461
        Duplicated sites: 0
        NonSNP sites: 0
        Monomorphic sites: 0
        Allele mismatch: 630
        SNPs call rate < 90%: 172
    Excluded sites in total: 63,669
    Remaining sites in total: 100,717
    Warning: 24 Chunks excluded: reference overlap < 50% (see statistics.txt for details).
    Remaining chunk(s): 0
    Error: No chunks passed the QC step. Imputation cannot be started!

And I also got some statistics information like this
"
......
Invalid Alleles: 1 (C/.)
Invalid Alleles: 1 (G/.)
Invalid Alleles: 1 (G/.)
Invalid Alleles: 1 (C/.)
Invalid Alleles: 1 (G/.)
Invalid Alleles: 1 (G/.)
Invalid Alleles: 1 (C/.)
Invalid Alleles: 1 (G/.)
Invalid Alleles: 1 (G/.)
Invalid Alleles: 1 (G/.)
Invalid Alleles: 1 (T/.)
Invalid Alleles: 1 (A/.)
Invalid Alleles: 1 (G/C,T)
Invalid Alleles: 1 (A/.)
Invalid Alleles: 1 (G/.)
Invalid Alleles: 1 (C/.)
Invalid Alleles: 1 (C/.)
Invalid Alleles: 1 (C/.)
......
INFO - Allele switch: rs4970362 - pos: 1084601 (ref: G/A, data: A/G)
INFO - Allele switch: rs6697886 - pos: 1163474 (ref: A/G, data: G/A)
FILTER - Low call rate: rs6697886 - pos: 1163474 (0.25)
FILTER - Allele mismatch: rs12563338 - pos: 1188481 (ref: G/A, data: T/A)
......
chunk_1_0000000001_0010000000 (Snps: 4158, Reference overlap: 0.017069701280227598, low sample call rates: false)
chunk_1_0010000001_0020000000 (Snps: 4547, Reference overlap: 0.018189692507579038, low sample call rates: false)
chunk_1_0020000001_0030000000 (Snps: 4094, Reference overlap: 0.01352657004830918, low sample call rates: false)
chunk_1_0030000001_0040000000 Sample NA06985: call rate: 0.49807037457434733
chunk_1_0030000001_0040000000 (Snps: 4405, Reference overlap: 0.012578616352201259, low sample call rates: true)
chunk_1_0040000001_0050000000 (Snps: 3991, Reference overlap: 0.016057312252964428, low sample call rates: false)
chunk_1_0050000001_0060000000 Sample NA06985: call rate: 0.4794905008635579
chunk_1_0050000001_0060000000 (Snps: 4632, Reference overlap: 0.01344717182497332, low sample call rates: true)
chunk_1_0060000001_0070000000 (Snps: 4885, Reference overlap: 0.017154389505549948, low sample call rates: false)
chunk_1_0070000001_0080000000 (Snps: 3948, Reference overlap: 0.014024542950162784, low sample call rates: false)
chunk_1_0080000001_0090000000 (Snps: 4674, Reference overlap: 0.013550709294939657, low sample call rates: false)
chunk_1_0090000001_0100000000 (Snps: 4589, Reference overlap: 0.01058543961978829, low sample call rates: false)
chunk_1_0100000001_0110000000 (Snps: 3942, Reference overlap: 0.016504126031507877, low sample call rates: false)
chunk_1_0110000001_0120000000 (Snps: 4830, Reference overlap: 0.014309076042518397, low sample call rates: false)
chunk_1_0120000001_0130000000 Sample NA06985: call rate: 0.4512820512820513
chunk_1_0120000001_0130000000 Sample NA07346: call rate: 0.47692307692307695
chunk_1_0120000001_0130000000 Sample NA12145: call rate: 0.48205128205128206
chunk_1_0120000001_0130000000 Sample NA12287: call rate: 0.47692307692307695
chunk_1_0120000001_0130000000 Sample NA12751: call rate: 0.49230769230769234
chunk_1_0120000001_0130000000 Sample NA12843: call rate: 0.4717948717948718
chunk_1_0120000001_0130000000 (Snps: 195, Reference overlap: 0.01015228426395939, low sample call rates: true)
chunk_1_0140000001_0150000000 (Snps: 1360, Reference overlap: 0.002932551319648094, low sample call rates: false)
chunk_1_0150000001_0160000000 (Snps: 4526, Reference overlap: 0.01504907306434024, low sample call rates: false)
chunk_1_0160000001_0170000000 (Snps: 5688, Reference overlap: 0.013368055555555555, low sample call rates: false)
chunk_1_0170000001_0180000000 (Snps: 4290, Reference overlap: 0.011305952930318412, low sample call rates: false)
chunk_1_0180000001_0190000000 (Snps: 4107, Reference overlap: 0.01516610495907559, low sample call rates: false)
chunk_1_0190000001_0200000000 (Snps: 4062, Reference overlap: 0.014111922141119221, low sample call rates: false)
chunk_1_0200000001_0210000000 (Snps: 5175, Reference overlap: 0.01111963190184049, low sample call rates: false)
chunk_1_0210000001_0220000000 (Snps: 4956, Reference overlap: 0.012385137834598482, low sample call rates: false)
chunk_1_0220000001_0230000000 Sample NA06985: call rate: 0.4813989752728893
chunk_1_0220000001_0230000000 (Snps: 4489, Reference overlap: 0.011032656663724626, low sample call rates: true)
chunk_1_0230000001_0240000000 (Snps: 5860, Reference overlap: 0.01815126050420168, low sample call rates: false)
chunk_1_0240000001_0250000000 (Snps: 3314, Reference overlap: 0.017533432392273403, low sample call rates: false)

"

I am not quite nuderstand the "Invalid alleles: 62,461". It seems that I need clean up the raw data, but I think that will lost 62,461 of 164386 SNPs.

What should I do now? Any help would be greatly appreciated

genotype imputation 23andme hrc • 1.6k views
ADD COMMENTlink modified 10 months ago by shengchao.li10 • written 23 months ago by archie.w.lee20
1
gravatar for Vince
21 months ago by
Vince80
Montreal, Quebec, Canada
Vince80 wrote:

There is a nice tool for this:

http://www.well.ox.ac.uk/~wrayner/tools/HRC-check-bim.zip

ADD COMMENTlink written 21 months ago by Vince80
0
gravatar for shengchao.li
10 months ago by
shengchao.li10
shengchao.li10 wrote:

A possible reason is that your data is not in GRCh37/hg19 as required by the Michigan Imputation Server. Your data may be in hg18 or hg38. You may want to use tools such as UCSC liftover to modify your data to GRCh37.

ADD COMMENTlink written 10 months ago by shengchao.li10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1716 users visited in the last hour