Questions Regarding Consensus Sequence Calling With Samtools / Bcftools / Vcfutils.Pl
3
5
Entering edit mode
10.0 years ago

I have genomic reads of a resequencing project of a bacterial genome and want to determine the differences between the reference genome and our genome (which should be very similar) and get the corrected sequence of the genome. To generate a VCF file I did something like this:

samtools \
mpileup \
-B \
-u \
-f example_reference_seq.fa \
| bcftools view -cv - \
> my_snps.vcf


There are some variations listed in the VCF file with a high quality score (some over 100) and a visual inspection of the BAM file in a genome browser confirms these variations. Some are in nearly every read covering the considered position so would be expected candidates for a changed nucleotide. The next step was to generate a consensus sequence based on the reference sequence and the mappings in which the differences are taken into account.

samtools \
mpileup \
-B \
-u \
-f example_reference_seq.fa\
bcftools view -cg - | \
vcfutils.pl vcf2fq \
> example_consensus.fq
seqtk fq2fa -l 70 example_consensus.fq > example_consensus.fa


The outcome of this call contained some surprises. I cannot completely follow how vcfutils.pl vcf2fq performs the consensus calling and the my Perl knowledge is a little bit rusty (plus the variable names are not really helpful). There are locations in the resulting consensus sequence where the nucleotides are changed to lower case (I think to show lower confidence). There is a change from "C" in the reference genome to "S" in the new consensus sequence. Looking at the position in the a genome browser show that 65 % C and 35% G (rest - other). I found in the code of vcfutils.pl in the function vcf2fq the following hash but I don't know why this "S" is introduced.

my %het = (AC=>'M', AG=>'R', AT=>'W', CA=>'M', CG=>'S', CT=>'Y',
GA=>'R', GC=>'S', GT=>'K', TA=>'W', TC=>'Y', TG=>'K');


Additionally I don't understand which entries are used to perform changes in the references sequence. Judging from the code the "AF1" and the "FQ" flag values are critical. FQ ("Phred probability of all samples being the same") makes sense to me but I don't understand it completely (it seem make a difference if is positive of negative).

So what is my question/my problem? I would appreciate if somebody could shed light on the processes of the consensus generation. Which entries in the VCF file would correspond to changes in the consensus sequences in comparison to the reference sequence. Why is the "S" introduced and what would be the usual procedure (e.g. replacing by C or G) to make the sequence usable further steps.

samtools bcftools snp • 19k views
4
Entering edit mode
10.0 years ago

First of all the consensus making part of that software isn't perfect. Lowercase means there is a putative indel there. But the vcfutils will NOT try and put an indel into the consensus for you, no matter how nice it looks.

Usually positive FQ indicates a mixed SNPs. That's why it looked up the ambiguous letter code when that is true.

Don't discount the reality of mixed calls just because you have a haploid organism. The submitters might not have given you a clonal sample.

0
Entering edit mode

Thanks, it looks like I can not pure rely on vcfutils to generate a finished sequence based on default cut-offs but have to decide some cases myself. I wanted to avoid this as this is rather subjective but good to know that this seems to be generally required.

4
Entering edit mode
10.0 years ago

The bcftools view command accepts a file with the list of samples to use (-s) which can be annotated with their ploidy (1 or 2), this might help with your variant calling.

Also note that samtools mpileup has a standard maximum read depth (not sure what it is by default) which might bite you if you have a well-covered genome (change with -d).

And as swbarnes2 said, vcf2fq will not incorporate indels. As an alternative, you could just call the variants (-v) and use GATK's FastaAlternateReferenceMaker.

As an example, I use something like the following for consensus calling on mtDNA genomes (also haploid):

samtools mpileup -F 0.2 -E -L 4000 -d 4000 -uf ref.fasta sample.bam \
| bcftools view -s <(echo sample.bam 1) -vcg - > sample.vcf
java -Xmx2g -jar GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -l ERROR \
-R ref.fasta -o sample.fasta --variant sample.vcf


See the samtools reference and help screens for more info on the different parameters.

Keep in mind that FastaAlternateReferenceMaker is also not very sophisitcated, so I ended up writing some code to do the consensus calling myself, but it is a start.

0
Entering edit mode

Thanks, Martjin. I was aware of the minimal (and maximal) depth parameter but not that I can use the -s parameter for the purpose of defining the ploidy. I will give it a try.

1
Entering edit mode
10.0 years ago
Christof Winter ★ 1.0k

S is one of the IUPAC ambiguity base symbols and stands for the standard bases C and G. If you think of a diploid organism such as human, where you can have two different bases at a genomic position (one from your mother, and one from your father -- not considering any somatic mutations), these ambiguity symbols are helpful to encode a heterozygous base in a single letter. That's what happens in that Perl line that you quoted.

Now, to decide whether it's called homozygous (standard consensus base) or heterozygous (ambiguous consensus base) is a bit arbitrary, and I don't know how vcfutils does it exactly. A simple approach would be to say that everything where the two base frequencies are between 25% and 75% is called heterozygous (which would explain your 65% / 35% example), and everything else homozygous (either reference or variant).

Think about how to best interpret your heterozygous base calls in the context of the ploidy of your organism.

0
Entering edit mode

Thanks, Christof. In the end I need to have an unambiguous sequence. I personally would would interpret the S in this situation (haploid organism) as C as this is the dominating variant. But maybe there is already a common way of how to deal with this in bacterial genome re-sequencing.