For this (sorted) bam and reference file
(these will not be here forever, so don't expect the links to work if you come to this thread years from now) there are columns which freebayes, bcftools, and Platyplus all refuse to call which obviously contain variants. For instance, column 178 is all C or T, about 50:50. Doing it this way:
samtools index scf7180000354095.bam samtools faindex scf_7180000354095.fasta samtools mpileup -ugf scf_7180000354095.fasta scf7180000354095.bam \ | bcftools call -vmO z -o bcftools.vcf.gz Platypus.py callVariants --refFile=scf_7180000354095.fasta \ --bamFiles=scf7180000354095.bam \ --nCPU=20 --minReads=1 --mergeClusteredVariants=1 --getVariantsFromBAMs=1 \ --output=platypus.vcf freebayes -f scf_7180000354095.fasta -C 5 -p 2 -F 0.04 \ scf7180000354095.bam >scf7180000354095_freebayes.vcf
None of them call column 178. However....
cp scf_7180000354095.fasta scf_7180000354095_test.fasta nedit scf_7180000354095_test.fasta #change C->T at 178 samtools faidx scf_7180000354095_test.fasta samtools mpileup -ugf scf_7180000354095_test.fasta scf7180000354095.bam \ | bcftools call -vmO z -o bcftoolsB.vcf.gz Platypus.py callVariants --refFile=scf_7180000354095_test.fasta \ --bamFiles=scf7180000354095.bam \ --nCPU=20 --minReads=1 --mergeClusteredVariants=1 --getVariantsFromBAMs=1\ --output=platypusB.vcf freebayes -f scf_7180000354095_test.fasta -C 5 -p 2 -F 0.04 \ scf7180000354095.bam >scf7180000354095_freebayesB.vcf
and now they all call column 178.
There must be some common logic about what is and isn't considered a variant that these share, and it somehow depends critically on the base in the column in the reference sequence. What is this logic, and why is it the default?
More to the point, how does one force each of these to emit all the columns which have this issue, without having to physically change the reference sequence?
This may or may not be relevant, but the sample is a bit peculiar in that it represent a region with what appears to be ~3 copies of a small family of ~4 similar sequences. The organism is diploid and all ~8 copies are about equal edit distance away from each other. The assembly is suspect and the reference sequence is definitely a mix, with each position being "most like" the sequence in that position, but not exactly like it. This makes the read positions suspect as well. The end result is that there are up to ~8 different types of reads mapped in various positions onto the reference. The VCF would help in sorting this out, but to do so all the variants need to be in it, and that isn't what is happening.