Question

can discoSnp make the distinction between homeologous loci

1

Entering edit mode

7.4 years ago

Yahan ▴ 400

In polyploid genomes with limited variation between subgenomes generally read mapping is a challenge leading to frequent read mismapping and hence calling homeologous variants, eg variations that are actually differences between the subgenomes.

When using discoSnp, you can imagine that these loci could also collapse and variants would appear heterozygous among all samples if the locus is non-branching. One could then use read frequency to decide if the variant is homeologous. One would expect to observe a 50/50 distribution of the two alleles in case of a tetraploid or a 25/75 distribution in case of a true variant as the non variant locus would contribute relatively more to the stack.

However when branching becomes more complex such an approach may become difficult and such variants might still end up in the same locus while actually originating from different loci. Are there any strategies that could be applied to recognize these cases and discern between the different loci?

Related to this, when does discoSnp decide that a graph becomes too complex and decides to split it into separate graphs.

Thanks for the reply.

discosnp SNP homeologous loci • 1.5k views

ADD COMMENT • link updated 7.3 years ago by pierre.peterlongo ▴ 900 • written 7.4 years ago by Yahan ▴ 400

score 1 · Accepted Answer · 2018-03-05

Hello Yahan,

Thanks for your question here.

Generally speaking, discoSnp confuses true variants with inter-repetition or inter-genome variations.

There are then several ways of sorting out the true and false positives calls.

One expects that true variants discriminate individuals and thus discriminate read sets. This is why we proposed the rank value, associated with each variant. In our experiments, we have shown that variants with rank < 0.2 are likely false positives, while most (>95%) other ones are true positives (see fig7 of the paper).
If variants are mapped on a reference genome (for instance using the VCFcreator tool integrated to disco), uniquely mapped variants are likely to be true positives (marked as PASS) while other ones (marked as MULTIPLE) are more likely due to inter-repetition or inter-genome variations.

Hoping this helps,

Pierre