Question: Questions Regarding Consensus Sequence Calling With Samtools / Bcftools / Vcfutils.Pl
gravatar for Konrad
7.9 years ago by
Konrad690 wrote:

I have genomic reads of a resequencing project of a bacterial genome and want to determine the differences between the reference genome and our genome (which should be very similar) and get the corrected sequence of the genome. To generate a VCF file I did something like this:

samtools \
    mpileup \
    -B \
    -u \
    -f example_reference_seq.fa \
    example_read_mappings.bam \
    | bcftools view -cv - \
    > my_snps.vcf

There are some variations listed in the VCF file with a high quality score (some over 100) and a visual inspection of the BAM file in a genome browser confirms these variations. Some are in nearly every read covering the considered position so would be expected candidates for a changed nucleotide. The next step was to generate a consensus sequence based on the reference sequence and the mappings in which the differences are taken into account.

samtools \
    mpileup \
    -B \
    -u \
    -f example_reference_seq.fa\
    example_read_mappings.bam | \
    bcftools view -cg - | \ vcf2fq \
    > example_consensus.fq
seqtk fq2fa -l 70 example_consensus.fq > example_consensus.fa

The outcome of this call contained some surprises. I cannot completely follow how vcf2fq performs the consensus calling and the my Perl knowledge is a little bit rusty (plus the variable names are not really helpful). There are locations in the resulting consensus sequence where the nucleotides are changed to lower case (I think to show lower confidence). There is a change from "C" in the reference genome to "S" in the new consensus sequence. Looking at the position in the a genome browser show that 65 % C and 35% G (rest - other). I found in the code of in the function vcf2fq the following hash but I don't know why this "S" is introduced.

my %het = (AC=>'M', AG=>'R', AT=>'W', CA=>'M', CG=>'S', CT=>'Y',
                         GA=>'R', GC=>'S', GT=>'K', TA=>'W', TC=>'Y', TG=>'K');

Additionally I don't understand which entries are used to perform changes in the references sequence. Judging from the code the "AF1" and the "FQ" flag values are critical. FQ ("Phred probability of all samples being the same") makes sense to me but I don't understand it completely (it seem make a difference if is positive of negative).

So what is my question/my problem? I would appreciate if somebody could shed light on the processes of the consensus generation. Which entries in the VCF file would correspond to changes in the consensus sequences in comparison to the reference sequence. Why is the "S" introduced and what would be the usual procedure (e.g. replacing by C or G) to make the sequence usable further steps.

samtools bcftools snp • 17k views
ADD COMMENTlink modified 5.6 years ago by Biostar ♦♦ 20 • written 7.9 years ago by Konrad690
gravatar for swbarnes2
7.9 years ago by
United States
swbarnes29.4k wrote:

First of all the consensus making part of that software isn't perfect. Lowercase means there is a putative indel there. But the vcfutils will NOT try and put an indel into the consensus for you, no matter how nice it looks.

Usually positive FQ indicates a mixed SNPs. That's why it looked up the ambiguous letter code when that is true.

Don't discount the reality of mixed calls just because you have a haploid organism. The submitters might not have given you a clonal sample.

ADD COMMENTlink written 7.9 years ago by swbarnes29.4k

Thanks, it looks like I can not pure rely on vcfutils to generate a finished sequence based on default cut-offs but have to decide some cases myself. I wanted to avoid this as this is rather subjective but good to know that this seems to be generally required.

ADD REPLYlink written 7.9 years ago by Konrad690
gravatar for Martijn Vermaat
7.9 years ago by
Martijn Vermaat180 wrote:

The bcftools view command accepts a file with the list of samples to use (-s) which can be annotated with their ploidy (1 or 2), this might help with your variant calling.

Also note that samtools mpileup has a standard maximum read depth (not sure what it is by default) which might bite you if you have a well-covered genome (change with -d).

And as swbarnes2 said, vcf2fq will not incorporate indels. As an alternative, you could just call the variants (-v) and use GATK's FastaAlternateReferenceMaker.

As an example, I use something like the following for consensus calling on mtDNA genomes (also haploid):

samtools mpileup -F 0.2 -E -L 4000 -d 4000 -uf ref.fasta sample.bam \
    | bcftools view -s <(echo sample.bam 1) -vcg - > sample.vcf
java -Xmx2g -jar GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -l ERROR \
    -R ref.fasta -o sample.fasta --variant sample.vcf

See the samtools reference and help screens for more info on the different parameters.

Keep in mind that FastaAlternateReferenceMaker is also not very sophisitcated, so I ended up writing some code to do the consensus calling myself, but it is a start.

ADD COMMENTlink modified 7.9 years ago • written 7.9 years ago by Martijn Vermaat180

Thanks, Martjin. I was aware of the minimal (and maximal) depth parameter but not that I can use the -s parameter for the purpose of defining the ploidy. I will give it a try.

ADD REPLYlink written 7.9 years ago by Konrad690
gravatar for Christof Winter
7.9 years ago by
Lund, Sweden
Christof Winter990 wrote:

S is one of the IUPAC ambiguity base symbols and stands for the standard bases C and G. If you think of a diploid organism such as human, where you can have two different bases at a genomic position (one from your mother, and one from your father -- not considering any somatic mutations), these ambiguity symbols are helpful to encode a heterozygous base in a single letter. That's what happens in that Perl line that you quoted.

Now, to decide whether it's called homozygous (standard consensus base) or heterozygous (ambiguous consensus base) is a bit arbitrary, and I don't know how vcfutils does it exactly. A simple approach would be to say that everything where the two base frequencies are between 25% and 75% is called heterozygous (which would explain your 65% / 35% example), and everything else homozygous (either reference or variant).

Think about how to best interpret your heterozygous base calls in the context of the ploidy of your organism.

ADD COMMENTlink written 7.9 years ago by Christof Winter990

Thanks, Christof. In the end I need to have an unambiguous sequence. I personally would would interpret the S in this situation (haploid organism) as C as this is the dominating variant. But maybe there is already a common way of how to deal with this in bacterial genome re-sequencing.

ADD REPLYlink written 7.9 years ago by Konrad690
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2583 users visited in the last hour