Question: How do I interpret my 'bcftools +fixref' output and troubleshoot it.
gravatar for sandKings
2.1 years ago by
sandKings10 wrote:

Using GATK method to call variants from RNASeq data, I have VCF files from 30 patients. There's a lot of missing genotype data and I need to impute missing information from Sanger imputation service. When I uploaded my vcf file to Sanger, I received this error:

--- Aborted Job --- The input file sanity check failed, "bcftools norm -ce" exited with the following message: Reference allele mismatch at MT:150 .. REF_SEQ:'C' vs VCF:'T'

The error report further recommended that I try ref allele mismatch correction using bcftools.

Based on documentation on this page I tried a ref allele mismatch correction using : 'bcftools +fixref report.vcf -- -f ref.fa'

My output looks like this:

# SC, guessed strand convention
SC  TOP-compatible  0
SC  BOT-compatible  0
# ST, substitution types
A>C 1058    3.3%
ST  A>G 5927    18.5%
ST  A>T 726 2.3%
ST  C>A 1069    3.3%
ST  C>G 1512    4.7%
ST  C>T 5513    17.3%
ST  G>A 5551    17.4%
ST  G>C 1509    4.7%
ST  G>T 1107    3.5%
ST  T>A 744 2.3%
ST  T>C 6174    19.3%
ST  T>G 1064    3.3%
# NS, Number of sites:
NS  total           33609
NS  ref match       31954   100.0%
NS  ref mismatch    0   0.0%
NS  skipped         1655
NS  non-ACGT        0
NS  non-SNP         1647
NS  non-biallelic   8

In the vcf file itself, I manually found chrM 150 and it shows:

chrM    150 .   T   C   1536.77 SnpCluster  AC=2;AF=1.00;AN=2;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=34.76;SOR=1.460

Could anyone please help me troubleshoot this step. Thanks!

rna-seq matrix eqtl • 1.3k views
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by sandKings10

are you using the very same REFerence genome that was used to map he reads ? MT sequences are not always the same.

ADD REPLYlink written 2.1 years ago by Pierre Lindenbaum131k

Hi Pierre! Thanks so much for responding. Yes, in fact, to avoid the liftover procedure, I did the variant calling from scratch using GRCh37 since Sanger Imputation Service works on GRCh37.

In my very first impute attempt on Sanger, I got an error: faidx_fetch_seq failed at chr1:567242 and the error report further suggested that I change the UCSC-style chromosome names to Ensembl-style chromosome names which I did using:

bcftools annotate -Oz --rename-chrs ucsc2ensembl.txt ucsc.vcf.gz > ensembl.vcf.gz

as described on this page

EDIT: I'm wondering if it's relevant to point out that the variant calling was done on RNASeq data via GATk and the RNASeq is a 50bp SR dataset.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by sandKings10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1863 users visited in the last hour