I am currently writing some cancer genomics software for crunching on NGS data. I have extracted groups of reads where each read from each group shares a 30bp exact string (sequence) match, with every other read within that group. I also know the position of this match for each read. As such, with these two pieces of information, I already posses the information to align all the reads. I now want to build a consensus sequence, aligned at this position. Does anyone know of software (that I can call from a C++ program, either from inside the program, or externally via command line call) that will allow me to do this? I get the feeling that SamTools is commonly used, namely the program mpileup. However, from their example
samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
It looks like the tool requires me to provide a reference genome to align the reads
ref.fa. However, the
aln.bam file should already contain the alignment information. So, I'm a little confused as to why
ref.fa is needed. Does anyone know if this software can be used without a reference? I.e just provide the reads, align against themselves and output the consensus?
It looks like I may have to generate a
.bam from the reads to do this if I wish to use
mpileup, but seeming as I already know the alignment, I may have to generate this artificially.
If anyone could shed some light about the best software to use, when trying to build consensus seq. with data where alignment information is readily acquirable that would be of great help!