bcftools missing most indels in variant calling
2
0
Entering edit mode
3.4 years ago

I am getting started with bioinformatics and experimenting with assembly tools.

I generated a random reference sequence of 10000bp, and then created a subject sequence from it by introducing 500 deletions. From the subject sequence, 5000 simulated reads with read length 50-200 was obtained in-silico, with all bases having quality of 30. The aim was to assemble these reads to re-create the subject sequence.

bowtie2-build reference.fasta.gz index
bowtie2 -x index -U reads.fastq -S alignment.sam
samtools sort alignment.sam -o alignment.bam
bcftools mpileup --fasta-ref reference.fasta.gz alignment.bam > subject.mpileup
bcftools call -mv --ploidy 1 subject.mpileup > subject.vcf
cat subject.vcf

The resultant VCF file only picked up 30 or so indels, out of the total 500. Part of the output as shown:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  alignment.bam
reference   581 .   AC  A   56  .   INDEL;IDV=56;IMF=0.949153;DP=59;VDB=0.482463;SGB=-0.693147;MQ0F=0.0338983;AC=1;AN=1;DP4=13,0,46,0;MQ=16 GT:PL   1:83,0
reference   717 .   CG  C   52  .   INDEL;IDV=61;IMF=0.924242;DP=66;VDB=0.967282;SGB=-0.693147;MQ0F=0;AC=1;AN=1;DP4=14,0,52,0;MQ=21 GT:PL   1:79,0
reference   1647    .   CGG CG  43.3557 .   INDEL;IDV=73;IMF=0.986486;DP=74;VDB=0.142911;SGB=-0.692717;MQ0F=0.135135;AC=1;AN=1;DP4=51,0,23,0;MQ=12  GT:PL   1:82,11
reference   1869    .   TAG T   10.5041 .   INDEL;IDV=3;IMF=0.046875;DP=64;VDB=0.963375;SGB=-0.676189;MQ0F=0.109375;AC=1;AN=1;DP4=53,0,11,0;MQ=9    GT:PL   1:38,0

Inspecting the alignment.bam file showed that besides indels at position 581 picked up, indels position 549, 634, 638 etc are all missed but they are very obvious from the reads. igv output

On the other hand, de-novo assembly with spades.py perfectly reconstructed the exact subject seqeunce. Did I misconfigure bcftools somehow? Or is this an inherent limitation of variant calling from reference? Any help would be greatly appreciated.

Other intermediate files are found here: subject.mpileup

Assembly samtools bcftools • 2.5k views
ADD COMMENT
0
Entering edit mode

Thanks for the insight. Though picking up less than 10% of indels in simulated data despite coverage of 50x sounds really worrying. Is there an alternative tool that may work better for general purpose variant calling?

ADD REPLY
1
Entering edit mode
3.4 years ago

Interesting, I don't know why bcftools is missing that much variants here. I was able to reproduce it.

In my experience freebayes is the better choice if you are mainly interested in indels. In a short test it finds variants in 400 positions.

ADD COMMENT
0
Entering edit mode
3.4 years ago
wulj2 ▴ 50

bcftools is not so sensitive for calling variants with very low frequency or with very few supports, and all the indels at one position will be made into only one consensus if I have got the source code working right~

ADD COMMENT

Login before adding your answer.

Traffic: 1865 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6