can discoSnp make the distinction between homeologous loci
Entering edit mode
5.6 years ago
Yahan ▴ 400

In polyploid genomes with limited variation between subgenomes generally read mapping is a challenge leading to frequent read mismapping and hence calling homeologous variants, eg variations that are actually differences between the subgenomes.

When using discoSnp, you can imagine that these loci could also collapse and variants would appear heterozygous among all samples if the locus is non-branching. One could then use read frequency to decide if the variant is homeologous. One would expect to observe a 50/50 distribution of the two alleles in case of a tetraploid or a 25/75 distribution in case of a true variant as the non variant locus would contribute relatively more to the stack.

However when branching becomes more complex such an approach may become difficult and such variants might still end up in the same locus while actually originating from different loci. Are there any strategies that could be applied to recognize these cases and discern between the different loci?

Related to this, when does discoSnp decide that a graph becomes too complex and decides to split it into separate graphs.

Thanks for the reply.

discosnp SNP homeologous loci • 1.1k views
Entering edit mode
5.6 years ago

Hello Yahan,

Thanks for your question here.

Generally speaking, discoSnp confuses true variants with inter-repetition or inter-genome variations.

There are then several ways of sorting out the true and false positives calls.

  • One expects that true variants discriminate individuals and thus discriminate read sets. This is why we proposed the rank value, associated with each variant. In our experiments, we have shown that variants with rank < 0.2 are likely false positives, while most (>95%) other ones are true positives (see fig7 of the paper).
  • If variants are mapped on a reference genome (for instance using the VCFcreator tool integrated to disco), uniquely mapped variants are likely to be true positives (marked as PASS) while other ones (marked as MULTIPLE) are more likely due to inter-repetition or inter-genome variations.

Hoping this helps,


Entering edit mode

Thanks for the clear explanation Pierre.


Login before adding your answer.

Traffic: 1032 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6