The general question is this: what do platanus and redundans (or any other similar programs) do when they encounter a sequence segment with copy number greater than 2X?
The methods suggested in the preceding discussion
all address the reduction of both copies of a diploid genome to one copy. However, in the echinoderm genomes I have been examining (~1% polymorphic, lots of large TE polymorphisms) there are other cases to handle, like 4X and 3X regions, and it seems like the methods in those two programs are going to try to reduce these to 1X. That isn't right though, it should be 4X to 2X and 3X to 1X (or 2X, see below).
A genome has a small direct duplication in one haplotype resulting in 3X copy number for that region. These can be seen in the Sanger reads where we have those, or in the megareads from MaSuRCA where we don't. They are definitely real. Won't Platanus -u 1.0 collapse the repeat to a single copy? If it doesn't, what will redundans do with it? In other words ABC/ABBC, where A,B,C are short sequences longer than a tuple. I guess ABBC to ABC is OK in this instance, since that is what the other allele is. However, if both alleles have the small duplication ABBC/ABBC should reduce to ABBC, not ABC. Does it?
Consider a gene duplication: ABC/abc and A'B'C'/a'b'c'. If the tuple count plot across this region is examined one finds 1X,2X,3X, and 4X tuples, with the proportion depending on how much change has accumulated. Ideally the two alleles in each set would be closer to each other than they were to the other set, and those pairs would reduce to ABC and A'B'C'. In practice it might not be possible to determine allele pairs.
A genome has a low copy number TE, like 6X. These are not guaranteed to line up in pairs in the same places on the two haplotypes, although in many cases they will be like that. In other instances there might be 4 copies in one haplotype and 2 in the other. What would happen?