I want to extract consensus sequences for a number of bam files with the intent to align sequences across individuals and build trees.
I am following the advice here https://github.com/samtools/bcftools/wiki/HOWTOs via vcf2fq: samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq
via bcftools consensus: samtools mpileup -uf ref.fa aln.bam | bcftools call -mv -Oz -o calls.vcf.gz tabix calls.vcf.gz cat ref.fa | bcftools consensus calls.vcf.gz > cns.fa
But I have some questions I cannot find the answer to:
How does bcftools consensus work differently from vcfutils.pl vcf2fq?
How does either deal with missing data? For example if a site is not covered by reads in the bam file, ideally I would like the consensus sequence to contain an N. Is this the case or is the allele in the reference sequence used?
Does the consensus callers filter the SNPs? And if not, how do you do this? is this achieved by running bcftools filter after bcftools calls?