bcftools consensus sequence shorter than reference
0
0
Entering edit mode
5.7 years ago
ieie ▴ 10

Dear all, I am trying to get the consensus sequence of a genome using:

~/samtools-1.3/samtools mpileup -Ou -f REF alnsorted.bam | bcftools call -mv -Oz -o 1aln_paired.vcf.gz 

tabix aln_paired.vcf.gz

cat REF | bcftools consensus aln_paired.vcf.gz > consensus_paired.fa

I actually get a consensus sequence but when I perform the last line I get several times this message:

the site x overlaps with another variant, skipping...

I also get shorter consensus sequences, my reference is 159.295 bp long and for each constructed genome I get around 158.700 158.900 bp.

Does anybody know if these results are wrong or it can happen that I get a warning message and the consensus sequences are shorter than the reference? thanks a lot!

genome bcftools • 1.9k views
ADD COMMENT
1
Entering edit mode

Just a guess: Are there lot's of deletions in your vcf?

ADD REPLY
0
Entering edit mode

If I understand the format correctly probably yes, as I have a lot of positions where there is this

INDEL;IDV=19;IMF=0.904762;DP=21;VDB=0.209342;SGB=-0.69168;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,13,6;MQ=60 GT:PL 1/1:255,57,0

ADD REPLY
0
Entering edit mode

Not sure if still on time, but you can try to workaround it by retaining only biallelic sites. Also, you can run a script to remove lines with more than one character in your sequence. This will remove indels. You can do it manually with awk 'length($REF_column))==1' {print $0}' input.file | awk 'length($ALT_column))==1' {print $0}'

(Substitute the REF and ALT columns for your corresponding column numbers)

ADD REPLY

Login before adding your answer.

Traffic: 2559 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6