Can not pass QC when doing imputation with Michigan Imputation Serve and Haplotype Reference Consortium panels
3
0
Entering edit mode
8.2 years ago
FAST_GENOME ▴ 50

Hello All.

I have 88 samples in 23andme format. The population is EUR. And I would like to do imputation with Michigan Imputation Serve and Haplotype Reference Consortium panels.

The first thing I did is converting the 23andme format to VCF format by bcftools with following commend. The reference is hg18.

bcftools convert -c ID,CHROM,POS,AA -s sampleID -f hg18.fa --tsv2vcf sample.txt -Oz -o sampleID.vcf.gz

Then, I merge all VCFs of 88 samples into one vcf file and upload it to Michigan Imputation Serve. I choose HRC panel for imputation.

But I got flowing errors.

Input Validation
    1 valid VCF file(s) found.
    Samples: 88
    Chromosomes: 1
    SNPs: 164386
    Chunks: 24
    Datatype: unphased
    Reference Panel: hrc
Quality Control
    Execution successful
    Statistics:
        Alternative allele frequency > 0.5 sites: 60,835
        Reference Overlap: 1.42%
        Match: 214
        Allele switch: 191
        Strand flip: 191
        Strand flip and allele switch: 215
        A/T, C/G genotypes: 4
    Filtered sites:
        Filter flag set: 0
        Invalid alleles: 62,461
        Duplicated sites: 0
        NonSNP sites: 0
        Monomorphic sites: 0
        Allele mismatch: 630
        SNPs call rate < 90%: 172
    Excluded sites in total: 63,669
    Remaining sites in total: 100,717
    Warning: 24 Chunks excluded: reference overlap < 50% (see statistics.txt for details).
    Remaining chunk(s): 0
    Error: No chunks passed the QC step. Imputation cannot be started!

And I also got some statistics information like this

......
Invalid Alleles: 1 (C/.)
Invalid Alleles: 1 (G/.)
Invalid Alleles: 1 (G/.)
Invalid Alleles: 1 (C/.)
Invalid Alleles: 1 (G/.)
Invalid Alleles: 1 (G/.)
Invalid Alleles: 1 (C/.)
Invalid Alleles: 1 (G/.)
Invalid Alleles: 1 (G/.)
Invalid Alleles: 1 (G/.)
Invalid Alleles: 1 (T/.)
Invalid Alleles: 1 (A/.)
Invalid Alleles: 1 (G/C,T)
Invalid Alleles: 1 (A/.)
Invalid Alleles: 1 (G/.)
Invalid Alleles: 1 (C/.)
Invalid Alleles: 1 (C/.)
Invalid Alleles: 1 (C/.)
......
INFO - Allele switch: rs4970362 - pos: 1084601 (ref: G/A, data: A/G)
INFO - Allele switch: rs6697886 - pos: 1163474 (ref: A/G, data: G/A)
FILTER - Low call rate: rs6697886 - pos: 1163474 (0.25)
FILTER - Allele mismatch: rs12563338 - pos: 1188481 (ref: G/A, data: T/A)
......
chunk_1_0000000001_0010000000 (Snps: 4158, Reference overlap: 0.017069701280227598, low sample call rates: false)
chunk_1_0010000001_0020000000 (Snps: 4547, Reference overlap: 0.018189692507579038, low sample call rates: false)
chunk_1_0020000001_0030000000 (Snps: 4094, Reference overlap: 0.01352657004830918, low sample call rates: false)
chunk_1_0030000001_0040000000 Sample NA06985: call rate: 0.49807037457434733
chunk_1_0030000001_0040000000 (Snps: 4405, Reference overlap: 0.012578616352201259, low sample call rates: true)
chunk_1_0040000001_0050000000 (Snps: 3991, Reference overlap: 0.016057312252964428, low sample call rates: false)
chunk_1_0050000001_0060000000 Sample NA06985: call rate: 0.4794905008635579
chunk_1_0050000001_0060000000 (Snps: 4632, Reference overlap: 0.01344717182497332, low sample call rates: true)
chunk_1_0060000001_0070000000 (Snps: 4885, Reference overlap: 0.017154389505549948, low sample call rates: false)
chunk_1_0070000001_0080000000 (Snps: 3948, Reference overlap: 0.014024542950162784, low sample call rates: false)
chunk_1_0080000001_0090000000 (Snps: 4674, Reference overlap: 0.013550709294939657, low sample call rates: false)
chunk_1_0090000001_0100000000 (Snps: 4589, Reference overlap: 0.01058543961978829, low sample call rates: false)
chunk_1_0100000001_0110000000 (Snps: 3942, Reference overlap: 0.016504126031507877, low sample call rates: false)
chunk_1_0110000001_0120000000 (Snps: 4830, Reference overlap: 0.014309076042518397, low sample call rates: false)
chunk_1_0120000001_0130000000 Sample NA06985: call rate: 0.4512820512820513
chunk_1_0120000001_0130000000 Sample NA07346: call rate: 0.47692307692307695
chunk_1_0120000001_0130000000 Sample NA12145: call rate: 0.48205128205128206
chunk_1_0120000001_0130000000 Sample NA12287: call rate: 0.47692307692307695
chunk_1_0120000001_0130000000 Sample NA12751: call rate: 0.49230769230769234
chunk_1_0120000001_0130000000 Sample NA12843: call rate: 0.4717948717948718
chunk_1_0120000001_0130000000 (Snps: 195, Reference overlap: 0.01015228426395939, low sample call rates: true)
chunk_1_0140000001_0150000000 (Snps: 1360, Reference overlap: 0.002932551319648094, low sample call rates: false)
chunk_1_0150000001_0160000000 (Snps: 4526, Reference overlap: 0.01504907306434024, low sample call rates: false)
chunk_1_0160000001_0170000000 (Snps: 5688, Reference overlap: 0.013368055555555555, low sample call rates: false)
chunk_1_0170000001_0180000000 (Snps: 4290, Reference overlap: 0.011305952930318412, low sample call rates: false)
chunk_1_0180000001_0190000000 (Snps: 4107, Reference overlap: 0.01516610495907559, low sample call rates: false)
chunk_1_0190000001_0200000000 (Snps: 4062, Reference overlap: 0.014111922141119221, low sample call rates: false)
chunk_1_0200000001_0210000000 (Snps: 5175, Reference overlap: 0.01111963190184049, low sample call rates: false)
chunk_1_0210000001_0220000000 (Snps: 4956, Reference overlap: 0.012385137834598482, low sample call rates: false)
chunk_1_0220000001_0230000000 Sample NA06985: call rate: 0.4813989752728893
chunk_1_0220000001_0230000000 (Snps: 4489, Reference overlap: 0.011032656663724626, low sample call rates: true)
chunk_1_0230000001_0240000000 (Snps: 5860, Reference overlap: 0.01815126050420168, low sample call rates: false)
chunk_1_0240000001_0250000000 (Snps: 3314, Reference overlap: 0.017533432392273403, low sample call rates: false)

I don't quite understand the Invalid alleles: 62,461. It seems that I need clean up the raw data, but I think that will lost 62,461 of 164386 SNPs.

What should I do now? Any help would be greatly appreciated

HRC 23andme imputation genotype • 7.4k views
ADD COMMENT
0
Entering edit mode

Hello,

I had the same problem today. The reference overlap is very low, around 1.7 %. No hint of what I happened. Anyone with some tips? Did you try the script shared by Vince? I'm sure that the data is hg19.

I'm new in Imputation

Thanks in advance

ADD REPLY
0
Entering edit mode

Hi,

I had the same problem today. My reference overlap is 1.74 %. How did you solve the problem?Could you please give me some advice?

Thank you very much.

ADD REPLY
0
Entering edit mode

Hi, I had the same problem right now! Did you manage to get it to work?

ADD REPLY
0
Entering edit mode

Hi ,I have the same problem now,Did you solve it ?

ADD REPLY
2
Entering edit mode
8.1 years ago
Vince ▴ 150

There is a nice tool for this:

http://www.well.ox.ac.uk/~wrayner/tools/HRC-check-bim.zip

ADD COMMENT
2
Entering edit mode
7.1 years ago
shengchao.li ▴ 30

A possible reason is that your data is not in GRCh37/hg19 as required by the Michigan Imputation Server. Your data may be in hg18 or hg38. You may want to use tools such as UCSC liftover to modify your data to GRCh37.

ADD COMMENT
0
Entering edit mode
3.1 years ago
binodregmi30 ▴ 10

The reference overlap as low as 2% or even less is a typical signal of wrong genomic build used

ADD COMMENT
0
Entering edit mode

How do you change the genomic build of the VCF? I am stuck on this now.

ADD REPLY

Login before adding your answer.

Traffic: 1703 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6