Question: using bcftools/samtools to generate consensus using reference + variants; output is a truncated reference
0
gravatar for nirvan
2.8 years ago by
nirvan10
United States
nirvan10 wrote:

I am using samtools/bcftools to generate a 'de novo' consensus assembly for use with rc454 to clean my sequencing data. The outputted consensus sequence is identical to the reference (see the BLAST alignment graphic at the bottom of the post) till it is truncated at ~5000 of ~7000 bp.

I don't understand why it's happening or what to do about it. 

I've used

1) samtools mpileup to generate .vcf

2) bcftools consensus to (try to) generate a consensus assembly. 

According to A: Questions Regarding Consensus Sequence Calling With Samtools / Bcftools / Vcfuti, the consensus function is buggy. Does anyone else have similar issues? Any ideas of what could be going wrong or a viable workaround? I have to process ~150 files w/ 10,000-60,000 reads of lengths 450-720 bp. 

A: Questions Regarding Consensus Sequence Calling With Samtools / Bcftools / Vcfuti

Details of commands, options, and outputs 


###command to generate Variant calls

> samtools mpileup -vf mnv1.fa -r mnv1:5473-5922 APL000000034.34.sorted.bam > APL000000034.34.amp2.vcf

[fai_load] build FASTA index.

[mpileup] 1 samples in 1 input files

<mpileup> Set max per-file depth to 8000


###consensus command and error message

> bcftools consensus -f mnv1.fa APL000000034.34.amp2.vcf > scratch.fa

Symbolic alleles other than <DEL> are currently not supported: <X> at mnv1:5474


###first three lines of purported "consensus" sequence

> head -3 scratch.fa 

>mnv1 gi|113478394|ref|NC_008311.1| Murine norovirus 1, complete genome

GTGAAATGAGGATGGCAACGCCATCTTCTGCGCCCTCTGTGCGCAACACAGAGAAACGCA

AAAACAAGAAGGCTTCGTCTAAAGCTAGTGTCTCCTTTGGAGCACCTAGCCCCCTCTCTT


###first three lines of reference sequence

> head -3 mnv1.fa

>mnv1 gi|113478394|ref|NC_008311.1| Murine norovirus 1, complete genome

GTGAAATGAGGATGGCAACGCCATCTTCTGCGCCCTCTGTGCGCAACACAGAGAAACGCAAAAACAAGAA

GGCTTCGTCTAAAGCTAGTGTCTCCTTTGGAGCACCTAGCCCCCTCTCTTCGGAGAGCGAAGACGAAATT

 

ADD COMMENTlink written 2.8 years ago by nirvan10

Can you show the variant at mnv1:5474? I assume it's some sort of duplication or other large structural variant. The consensus command can't handle them, so you might need to filter them out. However, I think GATK has a similar tool and perhaps it can handle this better.

ADD REPLYlink written 2.8 years ago by Devon Ryan74k
1

Also, I just tried generating a consensus sequence using bcftools call--A: error with vcfutils when building consensus sequence with samtools and bcftools , which you suggested to another user about a month ago. The result was basically the same. The output of the consensus function was the EXACT reference sequence, not truncated as in the original workflow

ADD REPLYlink written 2.8 years ago by nirvan10

Ah, I'd forgotten that I'd written that, good catch.

ADD REPLYlink written 2.8 years ago by Devon Ryan74k

Variant at mnv1:5474


> grep 5474 APL000000034.34.amp2.vcf

mnv1    5474    .    A    <X>    0    .   

\

DP=1321;I16=671,645,0,0,39501,1.24369e+06,0,0,44237,1.49124e+06,0,0,1358,1456,0,0;QS=1,0;

\

MQSB=2.39844e-05;MQ0F=0    PL    0,255,255

ADD REPLYlink written 2.8 years ago by nirvan10

Interesting, what version of bcftools is this? I would have expected the most recent ones to hand gVCFs.

ADD REPLYlink written 2.8 years ago by Devon Ryan74k

This is bcftools1.2. I don't believe the earlier versions have "consensus" function. 

 

ADD REPLYlink written 2.8 years ago by nirvan10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 874 users visited in the last hour