Recently, I have been using bcftools consensus to generate consensus sequence as the following commands:
bcftools mpileup -Ou -f ref.fa in.bam | bcftools call -Ou -mv --ploidy 1 | bcftools norm -f ref.fa -Oz -o norm.vcf.gz bcftools index norm.vcf.gz bcftools consensus -f ref.fa -o consensus.fa norm.vcf.gz
However, the generated fasta file has several different bases from the vcf file (I've already double-checked with bam file):
REF ALT vcf: A T fasta: A G bam: A T
I've tried different approaches but only vcfutils.pl vcf2fq generated the correct consensus sequence:
bcftools mpileup -f ref.fa in.bam | bcftools call -c --ploidy 1 | vcfutils.pl vcf2fq > cns.fastq seqtk seq -aQ64 –q20 -n N cns.fastq > cns.fasta
But again, I've ran into an issue which is the lowercase bases in the fasta file. From what I've read, the lowercase bases can be putative indel bases or low-quality bases. Instead of only soft-masking indel base, several bases around the indel site also get masked. For example (the indel site is at 567th base) :
562 580 AagacaccccccacagttT
For future work, I need to distinguish between indel bases and low-quality bases. Is there anyway I can convert lowercase bases to something else but only at indel site?
Also, it'd be nice if someone could tell me why bcftools consensus and vcfutils.pl vcf2fq generated different fasta file. If you must know, I'm working with mtDNA samples.