I am trying to get a consensus sequence for each block of reads aligned at two different locus from bam file. The reads at locus looks like:
TTCAATGGATCTATAAATCTCCCCCTGGGATTCATCATAGTAACTGTAACGGAACCTTTGACCTTGAAACAAATATCATG
TTCAATGGATCTATAAATCTCCCCCTGGGATTCATCATAGTAACTGTAACGGAACCTTTGACCTTGAAACAAATATCATG
TTCAATGGATCTATAAATCTCCCCCTGGGATTCATCATAGTAACTGTAACGGAACCTTTGACCTTGAAACAAATATCATG
TTCAATGGATCTATAAATCTCCCCCTGGGATTCATCATAGTAACTGTAACGGAACCTTTGACCTTGAAACAAATATCATG
TTCAATGGATCTATAAATCTCCCCCAGGGATTCATCATAGTAACTGTAACGGAACCTTTGACCTTGAAACAAATATCATG
TTCAATGGATCTATAAATCTCCCCCAGGGATTCATCATAGTAACTGTAACGGAACCTTTGACCTTGAAACAAATATCATG
TTCAATGGATCTATAAATCTCCCCCAGGGATTCATCATAGTAACTGTAACGGAACCTTTGACCTTGAAACAAATATCATG
TTCAATGGATCTATAAATCTCCCCCAGGGATTCATCATAGTAACTGTAACGGAACCTTTGACCTTGAAACAAATATCATG
TTCAATGGATCTATAAATCTCCCCCAGGGATTCATCATAGTAACTGTAACGGAACCTTTGACCTTGAAACAAATATCATG
TTCAATGGATCTATAAATCTCCCCCAGGGATTCATCATAGTAACTGTAACGGAACCTTTGACCTTGAAACAAATATCATG
# ^
GGGGACCAGTCACAGCAGCCTTTTTAACTTTGGGAAAACCCAATGCTCACTTCACTCAAACGTCGCAAAGTGGTGCCATG
GGGGACCAGTCACAGCAGCCTTTTTAACTTTGGGAAAACCCAATGCTCACTTCACTCAAACGTCGCAAAGTGGTGCCATG
GGGGACCAGTCACAGCAGCCTTTTTAACTTTGGGAAAACCCAATGCTCACTTCACTCAAACGTCGCAAAGTGGTGCCATG
GGGGACCAGTCACAGCAGCCTTTTTAACTTTGGGAAAACCCAATGCTCACTTCACTCAAACGTCGCAAAGTGGTGCCATG
GGGGACCAGTCACAGCAGCCTTTTTAACTTTGGGAAAACCCAATGCTCACTTCACTCAAACGTCGCAAAGTGGTGCCATG
GGGGACCAGTCACAGCAGCCTTTTTAACTTTAGGAAAACCCAATGCTCACTTCACTCAAACGTCGCAAAGTGGTGCCATG
GGGGACCAGTCACAGCAGCCTTTTTAACTTTAGGAAAACCCAATGCTCACTTCACTCAAACGTCGCAAAGTGGTGCCATG
GGGGACCAGTCACAGCAGCCTTTTTAACTTTAGGAAAACCCAATGCTCACTTCACTCAAACGTCGCAAAGTGGTGCCATG
GGGGACCAGTCACAGCAGCCTTTTTAACTTTAGGAAAACCCAATGCTCACTTCACTCAAACGTCGCAAAGTGGTGCCATG
GGGGACCAGTCACAGCAGCCTTTTTAACTTTAGGAAAACCCAATGCTCACTTCACTCAAACGTCGCAAAGTGGTGCCATG
# ^
The one position highlighted using the lines beginning with #
is a SNP. There I would like to keep a IUPAC code in the consensus as shown below.
TTCAATGGATCTATAAATCTCCCCCWGGGATTCATCATAGTAACTGTAACGGAACCTTTGACCTTGAAACAAATATCATG
# ^
GGGGACCAGTCACAGCAGCCTTTTTAACTTTRGGAAAACCCAATGCTCACTTCACTCAAACGTCGCAAAGTGGTGCCATG
# ^
I could do it in python to make consensus, but I would like to do it through an SNP calling program like mpileup
, so that it takes the statistics in to account to call confident SNP. I am using a restricted enzyme digested sequence data, there is no random fragmentation, hence reads mapping at any given locus will have 100% overlap.
I have seen other posts How To Generate A Consensus Fasta Sequence From Sam Tools Pileup? but I did not find what I need.
I don't understand why you cannot use the output of mpileup.
I see lot of 'nnnn' in the output i.e contain sequence for the whole reference. I am looking for the consensus sequences in a fasta file only from the mapped regions. I tried