100% Missmatch when run with +fixref of BCFtools
1
0
Entering edit mode
5.9 years ago
David_emir ▴ 490

Hi All,

I have ~ 200.Idat files from Illumina MEGAEx Genotyping array. I have done QCing using Plink and removed ambiguous samples. Next, i converted the .bam,.bim and .fim files into VCF using plink recode option.

When I try to determine and fix strand orientation using a +fixref plugin of BCFtools, it's showing 100% ref missmatch (I am using ref fasta from here)

Please let me know where I am going wrong or how to interpret this. My commands are:

bcftools +fixref GArray_qcd.vcf -- -f hg38.fa

And the output is as follows:

bcftools +fixref Genotypearray_qcd.vcf -- -f hg38.fa
[E::faidx_fetch_seq] The sequence "1" not found
Ignoring sequence "1"
[E::faidx_fetch_seq] The sequence "2" not found
Ignoring sequence "2"
[E::faidx_fetch_seq] The sequence "3" not found
Ignoring sequence "3"
[E::faidx_fetch_seq] The sequence "4" not found
Ignoring sequence "4"
[E::faidx_fetch_seq] The sequence "5" not found
Ignoring sequence "5"
[E::faidx_fetch_seq] The sequence "6" not found
Ignoring sequence "6"
[E::faidx_fetch_seq] The sequence "7" not found
Ignoring sequence "7"
[E::faidx_fetch_seq] The sequence "8" not found
Ignoring sequence "8"
[E::faidx_fetch_seq] The sequence "9" not found
Ignoring sequence "9"
[E::faidx_fetch_seq] The sequence "10" not found
Ignoring sequence "10"
[E::faidx_fetch_seq] The sequence "11" not found
Ignoring sequence "11"
[E::faidx_fetch_seq] The sequence "12" not found
Ignoring sequence "12"
[E::faidx_fetch_seq] The sequence "13" not found
Ignoring sequence "13"
[E::faidx_fetch_seq] The sequence "14" not found
Ignoring sequence "14"
[E::faidx_fetch_seq] The sequence "15" not found
Ignoring sequence "15"
[E::faidx_fetch_seq] The sequence "16" not found
Ignoring sequence "16"
[E::faidx_fetch_seq] The sequence "17" not found
Ignoring sequence "17"
[E::faidx_fetch_seq] The sequence "18" not found
Ignoring sequence "18"
[E::faidx_fetch_seq] The sequence "19" not found
Ignoring sequence "19"
[E::faidx_fetch_seq] The sequence "20" not found
Ignoring sequence "20"
[E::faidx_fetch_seq] The sequence "21" not found
Ignoring sequence "21"
[E::faidx_fetch_seq] The sequence "22" not found
Ignoring sequence "22"
[E::faidx_fetch_seq] The sequence "23" not found
Ignoring sequence "23"
[E::faidx_fetch_seq] The sequence "24" not found
Ignoring sequence "24"
[E::faidx_fetch_seq] The sequence "25" not found
Ignoring sequence "25"
[E::faidx_fetch_seq] The sequence "26" not found
Ignoring sequence "26"

# SC, guessed strand convention
SC  TOP-compatible  1
SC  BOT-compatible  1
# ST, substitution types
ST  A>C 0   -nan%
ST  A>G 0   -nan%
ST  A>T 0   -nan%
ST  C>A 0   -nan%
ST  C>G 0   -nan%
ST  C>T 0   -nan%
ST  G>A 0   -nan%
ST  G>C 0   -nan%
ST  G>T 0   -nan%
ST  T>A 0   -nan%
ST  T>C 0   -nan%
ST  T>G 0   -nan%
# NS, Number of sites:
NS  total           26
NS  ref match       0   0.0%
NS  ref mismatch    26  100.0%
NS  skipped         0
NS  non-ACGT        0
NS  non-SNP         0
NS  non-biallelic   0

Thanks a lot,

Dave!

bcftools fixref mismatch • 3.3k views
ADD COMMENT
2
Entering edit mode
5.9 years ago

Hello David_emir,

compare the chromosome names in your vcf file with those in the reference. I would guess in one of them the chromosome names are prefixed with "chr" and in the other not.

fin swimmer

ADD COMMENT
0
Entering edit mode

Thanks a lot for your help. My vcf file does not have chr prefixed whereas my fasta file has chr prefixed. I will add chr to my vcf file , and will re-run the command, hoping it will solve the error. Thanks a lot.

ADD REPLY
0
Entering edit mode

HI Finswimmer, you are right, it was due to Chr prefix error. I got the following, can you help me in understanding the output? I am not sure how to interpret this, thanks a lot for your help.

# SC, guessed strand convention
SC  TOP-compatible  0
SC  BOT-compatible  0
# ST, substitution types
ST  A>C 24291   8.3%
ST  A>G 98033   33.5%
ST  A>T 6519    2.2%
ST  C>A 25842   8.8%
ST  C>G 7934    2.7%
ST  C>T 0   0.0%
ST  G>A 112911  38.6%
ST  G>C 9020    3.1%
ST  G>T 0   0.0%
ST  T>A 7742    2.6%
ST  T>C 0   0.0%
ST  T>G 0   0.0%
# NS, Number of sites:
NS  total           294346
NS  ref match       73305   25.1%
NS  ref mismatch    218991  74.9%
NS  skipped         2050
NS  non-ACGT        2050
NS  non-SNP         0
NS  non-biallelic   0
ADD REPLY
0
Entering edit mode

Hello David,

i moved my comment to an answer, so you can mark it as accepted if this solved your problem.

About how to interpret the output: It's a simple statistic about what bcftools have done with your data. So e.g. there where 294346 variants in total (NS total), in 25% the ref in the vcf matches the ref base in the reference sequence, ... In 8.3% of the case in A must be switched to an C (ST A>C) and so on.

I've never done this type of analyses you described in your first post. But is this high number of mismatches reasonable? Which reference was used to produce the vcf? Also hg38 or was it hg19?

fin swimmer

ADD REPLY
0
Entering edit mode

Thanks a Lot, finswimmer, I will take a look at what ref was used to produce vcf file. I was trying to get this info from my lab, the people who produced this file has left long back and i don't know how to validate the vcf . Is there any way to identify what ref was used to produce the vcf file?

ADD REPLY
0
Entering edit mode

Hello David,

if you are lucky there is a hint in the header of your vcf file. Something like:

##reference=

Or if they are in the names and the length of the contigs, e.g.

##contig=<ID=chr1,length=249250621>

There could also be the information which command was used for variant calling. Maybe the name of the files can give you a hint.

##commandline=

Otherwise, just try it out. Get the ref of hg19 and repeat your bcftools command.

fin swimmer

ADD REPLY
0
Entering edit mode

Thanks Again for prompt reply, my vcf header has got the following,

##source=PLINKv1.90
##contig=<ID=1,length=249202675,IDX=0>
##contig=<ID=2,length=243007369,IDX=1>
##contig=<ID=3,length=197874725,IDX=2>
##contig=<ID=4,length=190986258,IDX=3>
##contig=<ID=5,length=180705540,IDX=4>
##contig=<ID=6,length=171012298,IDX=5>
##contig=<ID=7,length=159122660,IDX=6>
##contig=<ID=8,length=146284568,IDX=7>
##contig=<ID=9,length=141099664,IDX=8>
##contig=<ID=10,length=135506308,IDX=9>
##contig=<ID=11,length=134945710,IDX=10>
##contig=<ID=12,length=133816188,IDX=11>
##contig=<ID=13,length=115106997,IDX=12>
##contig=<ID=14,length=107287664,IDX=13>
##contig=<ID=15,length=102406529,IDX=14>
##contig=<ID=16,length=90162709,IDX=15>
##contig=<ID=17,length=81149901,IDX=16>
##contig=<ID=18,length=77985741,IDX=17>
##contig=<ID=19,length=59050357,IDX=18>
##contig=<ID=20,length=62960230,IDX=19>
##contig=<ID=21,length=48099932,IDX=20>
##contig=<ID=22,length=51198907,IDX=21>
##contig=<ID=23,length=154854339,IDX=22>
##contig=<ID=24,length=24411933,IDX=23>
##contig=<ID=25,length=155227608,IDX=24>
##contig=<ID=26,length=16520,IDX=25>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome",IDX=1>
ADD REPLY
0
Entering edit mode

Hmm, this is looking a little bit strange.

The first 22 contig looks very similar to Chromosome 1-22 from GRCh37 (hg19). Contig 23 might by Chromosome X and contig 26 the mitochondrial genome. Beside that in the human genomes we have no contigs with the names 23 to 26, I have no idea what regions the contig 24 and 25 are representing and the Y chromosome is missing.

fin swimmer

ADD REPLY

Login before adding your answer.

Traffic: 2959 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6