I am working with a de novo genome assembly with thousands of scaffolds. This assembly is not close to have some of the scaffolds localized to chromosomes. One of the problems I have noticed is that two SNPs that are next to each other on a scaffold, for example, are not actually from the same chromosome at all. How do I know - based on blast searches, I know that this stretch of the assembly maps to two different chromosomes on the genome of a closely related species. This is not an isolated occurrence.
Now I know that one way such things occur is by over aggressive joining of contigs. However, the average guy using these assemblies doesn't know these things occur and isn't really looking for them either. This could easily have been me two years ago.
My questions are 1. Is there a systematic approach to look for these types of mis-assembly errors and then break them up 2. If not breaking up, then how reliable (or even worthwhile) is it to perform variant calling, de novo assembly of haplotypes, and any sort of linkage analyses with the variant dataset? I know the short answer is that it may not be reliable, but we do these types of works to make use of them for population studies. So there has to be a way to either use things the way they are, or to manually exclude regions on this assembly from further consideration.