Question: can discoSnp make the distinction between homeologous loci
gravatar for Yahan
24 days ago by
Yahan370 wrote:

In polyploid genomes with limited variation between subgenomes generally read mapping is a challenge leading to frequent read mismapping and hence calling homeologous variants, eg variations that are actually differences between the subgenomes.

When using discoSnp, you can imagine that these loci could also collapse and variants would appear heterozygous among all samples if the locus is non-branching. One could then use read frequency to decide if the variant is homeologous. One would expect to observe a 50/50 distribution of the two alleles in case of a tetraploid or a 25/75 distribution in case of a true variant as the non variant locus would contribute relatively more to the stack.

However when branching becomes more complex such an approach may become difficult and such variants might still end up in the same locus while actually originating from different loci. Are there any strategies that could be applied to recognize these cases and discern between the different loci?

Related to this, when does discoSnp decide that a graph becomes too complex and decides to split it into separate graphs.

Thanks for the reply.

snp homeologous loci discosnp • 140 views
ADD COMMENTlink modified 17 days ago by pierre.peterlongo710 • written 24 days ago by Yahan370
gravatar for pierre.peterlongo
17 days ago by
pierre.peterlongo710 wrote:

Hello Yahan,

Thanks for your question here.

Generally speaking, discoSnp confuses true variants with inter-repetition or inter-genome variations.

There are then several ways of sorting out the true and false positives calls.

  • One expects that true variants discriminate individuals and thus discriminate read sets. This is why we proposed the rank value, associated with each variant. In our experiments, we have shown that variants with rank < 0.2 are likely false positives, while most (>95%) other ones are true positives (see fig7 of the paper).
  • If variants are mapped on a reference genome (for instance using the VCFcreator tool integrated to disco), uniquely mapped variants are likely to be true positives (marked as PASS) while other ones (marked as MULTIPLE) are more likely due to inter-repetition or inter-genome variations.

Hoping this helps,


ADD COMMENTlink written 17 days ago by pierre.peterlongo710

Thanks for the clear explanation Pierre.

ADD REPLYlink written 16 days ago by Yahan370
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1515 users visited in the last hour