Question: CrossMap liftover: REF alleles in VCF files overwritten by reference genome file.
1
gravatar for freezingsleet97
12 months ago by
Scripps Research
freezingsleet9710 wrote:

Hi all,

I have tried to liftover VCF files from Affymetrix SNP array (NCBI36) to GRCh37 by CrossMap (v0.3.1, working on python3.7), but several fatal cases caused fail or false conversion. It only happened to VCF files when CrossMap incorporating reference genome file for conversion.

To repeat the following code, the chain file (NCBI36_to_GRCh37.chain.gz) was from Crossmap documentation and reference genome (human_g1k_v37.fasta.gz) was from 1000 Genomes Project. Sample input (181119_sample_0.vcf) and output (181119_sample_0.vcf.hg19) files were also shared for debugging.

$ Crossmap.py vcf NCBI36_to_GRCh37.chain.gz 181119_sample_0.vcf human_g1k_v37.fasta 181119_sample_0.vcf.hg19

For the sample input files with alleles information as:

##fileformat=VCFv4.2
##fileDate=20180912
##source=PLINKv1.90
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Sample_1    Sample_2
1    554484    SNP_A-8575125    C    T    .    .    .    GT    0/0    0/0
1    711153    SNP_A-8709646    C    G    .    .    .    GT    0/0    0/0
1    20314220    SNP_A-2063564    T    C    .    .    .    GT    0/1    0/0
1    34890869    SNP_A-1802822    A    G    .    .    .    GT    0/1    0/1

I got the stdout as :

@ 2018-11-19 10:24:57: Read chain_file:  NCBI36_to_GRCh37.chain.gz
@ 2018-11-19 10:24:57: Updating contig field ...
@ 2018-11-19 10:24:57: Total entries: 4
@ 2018-11-19 10:24:57: Failed to map: 2

So as the output file:

##fileformat=VCFv4.2
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference="" allele,="" may="" not="" be="" based="" on="" real="" reference="" genome"="">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
—- Skip contig info --
##liftOverProgram=CrossMap(https://sourceforge.net/projects/crossmap/)
##liftOverFile=NCBI36_to_GRCh37.chain.gz
##new_reference_genome=human_g1k_v37.fasta
##liftOverTime=November19,2018
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Sample_1    Sample_2
1    564621    SNP_A-8575125    C    T    .    .    .    GT    0/0    0/0
1    20441633    SNP_A-2063564    A    C    .    .    .    GT    0/1    0/0

with an unmap file:

##fileformat=VCFv4.2
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample_1    Sample_2
1   711153  SNP_A-8709646   C   G   .   .   .   GT  0/0 0/0
1   34890869    SNP_A-1802822   A   G   .   .   .   GT  0/1 0/1

As the summary of this issue, 2 type of the SNPs failed and 1 type of them was misinterpreted. When CrossMap converted VCF files, it seems to map the input REF allele with the reference fasta file, but not the ALT allele, and it generated an artifact or wrong genotype. Because the data generated by Affymetrix does not always stick to fwd/B strand, some of them are rev/ or fwd/T, CrossMap then would overwrite the different REF allele by the fasta file, causing mono-allelic failure or even false pair.

Is there any flag able to allow the divergence in CrossMap? This issue might have been observed before but not further elaborated. This would happen to any Affymetrix-based genotyping chip technology...


Table 1. Examples and brief summary of affected alleles.
rsID REF/ALT_on_dbSNP AFFY_ID NCBI36 NCBI_Assay_ID ss_to_rs_Orientation/Strand Input GRCh37 Allele_in_fasta Result
rs10458597 C/T SNP_A-8575125 1:554484 ss76713738 fwd/B C/T 1:564621 C C/T (Correct)
rs12565286 G/C SNP_A-8709646 1:711153 ss76848066 rev/ C/G 1:721290 G G/G (Failed: swap or reverse complement)
rs10916711 A/G SNP_A-2063564 1:20314220 ss66127892 rev/B T/C 1:20441633 A A/C (Wrong: ONLY reverse complement)
rs9787098 G/A SNP_A-1802822 1:34890869 ss66282998 fwd/T A/G 1:35118282 G/A G/G (Failed: ONLY swap)


ADD COMMENTlink modified 11 months ago by Biostar ♦♦ 20 • written 12 months ago by freezingsleet9710
1

Hello freezingsleet97 ,

I'm not sure if this helps you. But I think you first have to make sure, that all your variants in the vcf are described on the same strand. You can do this using bcftools:

$ bcftools +fixref input.vcf -Ov -o output.vcf -- -d -f ref.fa -m flip

fin swimmer

ADD REPLYlink written 12 months ago by finswimmer13k
1

If Crossmap expects us to have already fixed the REF alleles, why it tries to fix swaps/flips instead of reporting an error, or just ignoring them (like liftover tool)? What I'm saying is that Crossmap should either remove this REF allele "fix" function that creates artifacts, or do the fix correctly, fixing both REF and ALT alleles. This is a bug, not a feature in the software. Plus, this generates a complication and waste of disk space for high throughput fixing of hundreds of vcf files, because if the VCF isn't mapped to hg19 build, then you need to have one reference fasta file per genome build for running fixref (NCBI34, NCBI35, NCBI36, hg17, hg18, hg19, hg38, GRCh37, etc... ).

ADD REPLYlink written 12 months ago by rdlady30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2022 users visited in the last hour