Question

Generating consensus sequences without reference genome

0

Entering edit mode

7.7 years ago

thenameisizaak • 0

Hello All,

I am currently writing some cancer genomics software for crunching on NGS data. I have extracted groups of reads where each read from each group shares a 30bp exact string (sequence) match, with every other read within that group. I also know the position of this match for each read. As such, with these two pieces of information, I already posses the information to align all the reads. I now want to build a consensus sequence, aligned at this position. Does anyone know of software (that I can call from a C++ program, either from inside the program, or externally via command line call) that will allow me to do this? I get the feeling that SamTools is commonly used, namely the program mpileup. However, from their example

samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

It looks like the tool requires me to provide a reference genome to align the reads ref.fa. However, the aln.bam file should already contain the alignment information. So, I'm a little confused as to why ref.fa is needed. Does anyone know if this software can be used without a reference? I.e just provide the reads, align against themselves and output the consensus? It looks like I may have to generate a .bam from the reads to do this if I wish to use mpileup, but seeming as I already know the alignment, I may have to generate this artificially.

If anyone could shed some light about the best software to use, when trying to build consensus seq. with data where alignment information is readily acquirable that would be of great help!

Best, Izaak

conensus samtools alignment • 3.4k views

ADD COMMENT • link updated 7.7 years ago by guillaume.rbt ★ 1.0k • written 7.7 years ago by thenameisizaak • 0

1

Entering edit mode

Samtools by design is a reference-guided consensus tool. What you are looking for is a de novo assembler. Try spades, velvet, abyss, sga or fermi-lite. As a side note, you are really talking about two distinct types of alignments: read-to-reference alignment and read-to-read alignments. Make sure you understand their differences.

ADD REPLY • link 7.7 years ago by lh3 33k

score 0 · Answer 1 · 2016-08-01

HI Izaak,

I had the same problematic.

As you did I first tried to get the consensus from the alignments. Unfortunately I haven't managed to find a tool that does this kind of thing. The problem is that sometimes it's difficult to tell which base is the alternate consensus (for example for heterozygous position).

The solution I found is to compute local assembly on subsample of reads in the regions I was interested in. For this I used minia assembler, which is easy to use. (~/soft/minia-2.0.3-Linux/bin/minia -in ../sample_86012.fq -kmer-size 31 -abundance-min 3 -out minia_assembly_k31_m3)

Hope this help