I'd like to extract consensus of reads from a bam/sam file which would follow particular criteria:
- The portions of reference which have 0 coverage are represented in the consensus as N
- If there is at least single read mapping to a particular nucleotide, this nucleotide is retrieved in a consensus.
- If there is conflicting evidence from reads, the nucleotide with highest frequency (lets say 75%) is retrieved in consensus. If there is no nucleotide with >= 75% frequency, N is retrieved in the consensus.
Is there a tool which would enable this or similar level of fine-tuning of the consensus extraction from bam/sam?
What I tried: I used samtools, bcftools and vcfutils to get consensus:
samtools mpileup -uf reference.nt mapping.bam | bcftools call -c | vcfutils.pl vcf2fq > consensus.fasta
However, the consensus I'm getting this way contains a substantial proportion of the reference sequence and I can't find parametres which could modify how the consensus is calculated.
Yet another approach - extracting consensus using IGV viewer (option Copy consensus sequence) gives me basically the consensus I'm looking for but there is apparently no way how to automatize this for hundreds of reference sequences which I have.
Note: I posted this question also as a comment to my original question Map eukaryotic genomic reads to transcript reference of closely related organism however I'm posting it here separately as it diverged from my original question.