Question: Get the consensus sequence ... (I know, should be easy)
gravatar for Marvin
5.7 years ago by
Marvin190 wrote:


I have been trying to generate a consensus sequence with samtools/bcftools for a couple of days now and I'm stuck.

First my test data:

Marvin$ cat reference.fa

Marvin$ cat reads.fastq

I have copied parts of the reference to generate these reads. They span a contig across region
0-130 in the reference.

Goal: Reconstruct the consensus sequence of that region (length 130).

My solution/try:

bwa index reference.fa #index the reference

bwa mem reference.fa reads.fastq  > aln.sam #map the reads

samtools view -bh aln.sam > aln.bam #convert sam to bam

samtools sort aln.bam aln_sorted #sort the bam, output = aln_sorted.bam

samtools index aln_sorted.bam #don't remember which downstream program needed this

bedtools merge -i aln_sorted.bam #shows me the contigs in the mapping. only 1 present at myseq:0-130

Then I create the vcf file:
samtools view -h aln_sorted.bam "myseq:0-130" | samtools mpileup -v - > my_vcf_file.bgz

Now I index that file with tabix, because earlier I've had an error message which told me that the index was missing:

tabix -p vcf my_vcf_file.bgz

And now I try to finally get the consensus:

bcftools consensus -i my_vcf_file.bgz #option -i to get iupac codes at ambiguous sites

But it doesn't work, it throws the Usage page at me.
And it says:
"Usage:   bcftools consensus [OPTIONS] <file.vcf>"
Acutally all it should need is the input file (which I have provided), because OPTIONS are optional right?
However if I try it with the parameter -f, I get:

Marvin$ bcftools consensus -i -f reference.fa my_vcf_file.bgz
The fasta sequence does not match the REF allele at myseq:1:
   .vcf: [N]
   .vcf: [N] <- (ALT)

I don't understand this error and I am not sure if I am moving into the right direction, I find no way out of this maze, please help ...

ADD COMMENTlink modified 5.7 years ago by Lesley Sitter550 • written 5.7 years ago by Marvin190

You'll want to use bcftools call on the output of mpileup (assuming you're using a recent version of samtools). The -f option is only optional if you want useless output (the documentation should make this clearer).

ADD REPLYlink written 5.7 years ago by Devon Ryan98k

I managed to get a consensus by now via:

samtools view -h aln_sorted.bam "myseq:0-130" | samtools mpileup -vf reference.fa - | bcftools call -m -Oz -o calls.vcf.gz -

tabix calls.vcf.gz

bcftools consensus -f reference.fa -i calls.vcf.gz


However the output is not what I need. I have a site in the alignment (it's not in the sequences of my original post because I have modified the test reads by now) where I have the following situation:

The reference says T.

2 reads say C

and one read says G.

In the consensus I get the iupac code Y, which means 'T or C'.

Side question: Why doesn't he say B (iupac for 'C or G or T')?

Main question: So he says it's either the reference or the base with the highest support. BUT: Since I am mapping aDNA reads to a modern reference, I trust the signal in the reads more than the reference. I do not want the reference base to be considered when building the consensus. How do I achieve this?

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Marvin190
gravatar for Lesley Sitter
5.7 years ago by
Lesley Sitter550
Lesley Sitter550 wrote:

I'm not familiar with this method of using bcftools to create a consensus sequence. What i did when i needed a consensus was first building a blast database from fasta A, then using blastn to blast fasta B to it. Take output format 6 so that you get a nice tabled format.

Then using some sort of filtering method (i used R, but excel could also work) filter out alignments that have no gaps, an x-amount of mismatches (if you want them to be identical you only take alignments with score 0 in that column) and a very high ID match.

You should then end up with a list that contains contigs names, region that aligns and a several scoring columns like bit-score, P value, etc. You can then easily write a python script that looks for a specific header in one of the fastas and extract the location from the sequence based on the filtered blastn output. 

ADD COMMENTlink written 5.7 years ago by Lesley Sitter550

That's going to take forever for NGS-sized datasets.

ADD REPLYlink written 5.7 years ago by Devon Ryan98k

I did it on demultiplexed reads (so about 2 million reads) against an assembly of 280mb with 70000 contigs and it only takes a couple of hours on 8 threads... and seeing that he has working on this for days, a couple of hours does not seem so bad :P

ADD REPLYlink written 5.7 years ago by Lesley Sitter550

A small dataset like yours would take maybe 10 or 15 minutes with bwa/samtools if one uses the commands properly.

ADD REPLYlink written 5.7 years ago by Devon Ryan98k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 954 users visited in the last hour