Question: Expression to tell bcftools consensus to treat phased heterozygotes differently from unphased?
gravatar for m.holly.elmore
5 months ago by
m.holly.elmore10 wrote:

tl;dr: I want help writing an expression for bcftools consensus to tell it to put Ns in my haplotype alternate references when it comes across unphased heterozygous sites. Ideally, someone would just give me an expression for "when allele 1 and 2 are not the same and they are separated by a /, put an N in the consensus sequence" or something else that will do the job.

I have 86 non-model genomes from the same species, one PacBio+Ilumina reference assembly, 86 Illumina-only individual sample alignments to that reference, and a vcf of variants called by GATK which I then phased (as much as I could) with WhatsHap.

Now I want sequences for the two alleles in each individual over a ~2 kb stretch of the mating locus. This region is highly heterozygous, and it looks as though 50-80% of it phased (varying greatly by individual). I'm attempting to make short "alternate reference" sequences of each haplotype by sticking the variants for each sample onto the reference sequence using bcftools consenus. The problem is that, if I use the program with default settings, for unphased heterozygous sites it always puts the REF allele in haplotype 1 and ALT in haplotype 2. I want to do phylogenetics on these sequences, so I can't have them biased in this way. I want unphased heterozygous genotypes to be entered as Ns (or possiby with the right IUPAC ambiguity code), but I can't figure out how to tell the program that.

bcftools consensus allows "expressions" to make more complicated requests, but I am not seeing how to tell it to distinguish between phased (put in consensus sequence) and unphased (IUPAC or N instead) heterozygotes. None of the canned expressions make any reference to phasing status. I'm hoping there's some way to tell the consensus program to note the difference between a | and a / but there are no examples of expressions that parse the text in the fields of the vcf. Surely there must be some way to logically express "when allele 1 and 2 are not the same and they are separated by a /, put an N in the consensus sequence," but the examples all use "het" for heterozygotes instead of something that uses the / character.

Please and thank you!

snp sequence genome • 185 views
ADD COMMENTlink modified 4 months ago • written 5 months ago by m.holly.elmore10
gravatar for m.holly.elmore
4 months ago by
m.holly.elmore10 wrote:

Update: Petr Danecek added a new option to address this.

ADD COMMENTlink written 4 months ago by m.holly.elmore10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2046 users visited in the last hour