Question: How do I interpret my 'bcftools +fixref' output and troubleshoot it.
0
gravatar for sandKings
5 months ago by
sandKings10
sandKings10 wrote:

Using GATK method to call variants from RNASeq data, I have VCF files from 30 patients. There's a lot of missing genotype data and I need to impute missing information from Sanger imputation service. When I uploaded my vcf file to Sanger, I received this error:

--- Aborted Job --- The input file sanity check failed, "bcftools norm -ce" exited with the following message: Reference allele mismatch at MT:150 .. REF_SEQ:'C' vs VCF:'T'

The error report further recommended that I try ref allele mismatch correction using bcftools.

Based on documentation on this page https://samtools.github.io/bcftools/howtos/plugin.fixref.html I tried a ref allele mismatch correction using : 'bcftools +fixref report.vcf -- -f ref.fa'

My output looks like this:

# SC, guessed strand convention
SC  TOP-compatible  0
SC  BOT-compatible  0
# ST, substitution types
A>C 1058    3.3%
ST  A>G 5927    18.5%
ST  A>T 726 2.3%
ST  C>A 1069    3.3%
ST  C>G 1512    4.7%
ST  C>T 5513    17.3%
ST  G>A 5551    17.4%
ST  G>C 1509    4.7%
ST  G>T 1107    3.5%
ST  T>A 744 2.3%
ST  T>C 6174    19.3%
ST  T>G 1064    3.3%
# NS, Number of sites:
NS  total           33609
NS  ref match       31954   100.0%
NS  ref mismatch    0   0.0%
NS  skipped         1655
NS  non-ACGT        0
NS  non-SNP         1647
NS  non-biallelic   8

In the vcf file itself, I manually found chrM 150 and it shows:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chrM    150 .   T   C   1536.77 SnpCluster  AC=2;AF=1.00;AN=2;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=34.76;SOR=1.460

Could anyone please help me troubleshoot this step. Thanks!

rna-seq matrix eqtl • 361 views
ADD COMMENTlink modified 5 months ago • written 5 months ago by sandKings10

are you using the very same REFerence genome that was used to map he reads ? MT sequences are not always the same.

ADD REPLYlink written 5 months ago by Pierre Lindenbaum116k

Hi Pierre! Thanks so much for responding. Yes, in fact, to avoid the liftover procedure, I did the variant calling from scratch using GRCh37 since Sanger Imputation Service works on GRCh37.

In my very first impute attempt on Sanger, I got an error: faidx_fetch_seq failed at chr1:567242 and the error report further suggested that I change the UCSC-style chromosome names to Ensembl-style chromosome names which I did using:

bcftools annotate -Oz --rename-chrs ucsc2ensembl.txt ucsc.vcf.gz > ensembl.vcf.gz

as described on this page

EDIT: I'm wondering if it's relevant to point out that the variant calling was done on RNASeq data via GATk and the RNASeq is a 50bp SR dataset.

ADD REPLYlink modified 5 months ago • written 5 months ago by sandKings10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1790 users visited in the last hour