Question

IUAPC codes in consensus sequence ? samtools mpileup / bcftools call / bcftools consensus

1

Entering edit mode

7.1 years ago

Kevin D ▴ 30

Hello everyone!

I mapped Forward and Reverse reads from one plant genome against a reference fasta sequence. As expected, there are varaints (SNP and indels). I am interested in retrieving the consensus sequence for downstream analysis. The reads come from a diploid plant genome and the gene of interest is likely in single copy. However, my plant genome is clearly not 100% homozygous. So, in the consensus sequence I should see homozygous variants and heterozygous variants coded with IUAPC letters. I ran this codes and everything worked well:

samtools mpileup -d8000 -A -uf ref.fasta sample.sorted.bam > sample.mpileup
bcftools call -mv -Oz -o sample.calls.vcf.gz sample.mpileup
tabix sample.calls.vcf.gz
cat ref.fasta | bcftools consensus sample.calls.vcf.gz > sample.consensus.fasta

The only problem is that I have no heterozygous site in the consensus sequence despite I clearly see some in the raw mapping alignment. How to get heterozygous site codes with IUAPC in the consensus sequence ?

Kevin

Ps: I tried the -i option in bcftools consensus but then all variant sites were codes heterozygous. Also the coverage is quite OK (20x)

bcftools consensus IUAPC • 2.6k views

ADD COMMENT • link updated 7.1 years ago by Pierre Lindenbaum 161k • written 7.1 years ago by Kevin D ▴ 30

score 0 · Answer 1 · 2017-03-17

0

Entering edit mode

7.1 years ago

Pierre Lindenbaum 161k

as far as I can see there is a --iupac-codes option: in https://github.com/samtools/bcftools/blob/master/consensus.c#L626

 fprintf(stderr, "    -i, --iupac-codes          output variants in the form of IUPAC ambiguity codes\n");

ADD COMMENT • link 7.1 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Yes, that is this option I mentionned I my post scriptum but when I turn it on, then all the variant positions appear heterozygous. In fact that is not true: I checked the mapping output and some variant are homozygous and others are heterozygous between reads, that is because they reflect two different alleles... I think that the reason why I got all variant heterozygous is beacause the comparison is made between reads from my plant genome and the reference and not just between my reads despite they sometimes carry two alleles (diploid plant, highly heterozygous, single copy gene) in a 1:1 proportion