Bcftools Mpileup bias for alternative alleles?
Entering edit mode
17 months ago
cetin.m ▴ 50

I have called snps from a bam file using the below commands:

/usr/local/sw/bcftools-1.9/bcftools mpileup -f /mnt/NEOGENE2/share/ref/genomes/hsa/hs37d5.fa -I /mnt/NEOGENE2/share/dna/hsa/comparative_seqs/lazaridis2014/Loschbour.hg19_1000g.dq.bam.hs37d5.fa.cons.90perc.bam -R /mnt/NAS/projects/2018_MCetin_Selection/imputation/Genomic_Hist_SE_Europe/new225_fixed_chr21.bed -q 30 -Q 30 -o Loschbour_ori_chr21_bcftools.bcf

/usr/local/sw/bcftools-1.9/bcftools call -mV indels Loschbour_ori_chr21_bcftools.bcf > Loschbour_ori_chr21_bcftools.vcf

Then I calculated the percentage of reference and alternative alleles that are in the vcf file. I got

0/0 10019

0/1 3412

1/1 3281

% ref=  70.1591670656

% alt=  29.8408329344

The ratio of alt alleles seem too high to me, since when I do the same thing with GATK Unified genotyper or haplotype caller, I get 5% alternative, which is I think more in line with the expectations from population genetics (Most positions should be reference)

Has Anyone seen this before?

mpileup bcftools SNP gatk • 507 views
Entering edit mode
17 months ago


I do not believe there is anything overly suspicious here, but there's no way to be sure due to the fact that we cannot see other pertinent snippets of information, such as position read depths and strand bias p-values. Also, we cannot see the steps that have been conducted upstream of your two listed commands.

For your bcftools call command, you may want to add --redo-BAQ -q 60 -Q 60, and also post-filter the data for position read depth:

bcftools view Loschbour_ori_chr21_bcftools.vcf |\
  "${BCFtools_root}"/misc/vcfutils.pl varFilter -d 30

We have to remember that NGS data is very messy and that base mismatches to a reference genome are frequent, even in reads with high mapping qualities. The use of a reference genome itself is an issue in its own right - see A: Alternate nucleotide is more frequent than reference nucleotide. OMG I'm dizzy.

BCFtools is more flexible and 'lower level' than GATK, which is a program more tailoured for beginners that takes much away from experienced users. This is why I prefer BCFtools for calling germline variants.



Login before adding your answer.

Traffic: 1009 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6