I'm helping a colleague analyze their RAD / GBS data from a flowering plant. They thought their genome was 2n based on flow cytometry in the samples I'm analyzing and a karyotype or 2 from the literature. After genotyping, however, there's excess heterozygosity and a lot of loci are not in HWE.
When I make a histogram of allele balance (AB, the proportion of alternate reads in heterozygotes) using each SNP (ripped from the vcf), there are several peaks - multimodal. For simplicity, if the AB value was greater than 0.5, I subtracted it from one and the following are the values at which there is a peak In order of prevalence: 0.25, 0.125, 0.5, 0.33. I compared this to a known diploid which only had peaks at 0.5, 0.25, and maybe 0.33.
My initial interpretation is that this plant is tetraploid. The AB peak at 0.25 is due to 1 alternate and 3 reference alleles or vice-versa. The 0.5 peak can be caused by equal numbers of alt and ref alleles. The 0.125 peak can be caused by copies of the homologs, allowing for 1 alt and 7 refs or vice versa. The 0.33, though, is harder to explain. the fruits of all samples had seeds, so triploidy is not a viable hypothesis.
An alternate explanation to the "tetraploid" hypothesis, is that this genome has a high proportion of duplicated sections but is still 2n. In this scenario, it would be possible for one chromatid to have one copy of a locus while the second chromatid had 2 copies of that locus, allowing for 0.33. The other AB values could be explained by copy number variation among loci. Copy number is not varying among individuals. I'm just not sure if this level of duplication without changing ploidy is possible.
Can anybody shed some light here?