Hi everyone,
I am working on getting a phylogeny based on BUSCOs extracted from low coverage genome assemblies. The coverage of my genomes range from 2-10X. As expected, the recovery success of complete single copy BUSCOs is very variable and rather low in most of the cases (~5-68%).
I generated a list of complete single copy BUSCOs for each terminal based on the .tsv output files and extracted the corresponding sequences directly from the single_copy_busco_sequences output folder. When I checked the individual loci alignments, I found that ~30% of the alignments contain more than one set of different sequences. In some cases the alignment contains only a couple of "weird" sequences. In other cases the alignment consist in 2 or more different sets of sequences. I attach here a couple of alignments as an example. I am sure the sequences are wrong because they affect to a random set of not closely related taxa.
I wanted to ask if anyone has experienced this issue before, and what could be the reason. The only reason I can imagine is that since the coverage is low, when the proper gene is not present maybe I am getting as best hit a wrongly assigned sequence. But even in that case, I wouldn't expect getting so many missasigned sequences, and sequences so different for the same BUSCO.
Finally, I tried to find an automatic strategy to clean the alignments, i.e. remove "weird" sequences from problematic alignments, or directly getting rid of the problematic alignments. But nothing I tried worked, and the only solution I found is removing the bad alignments manually.
I would appreciate any insight or suggestion about my problem I how could I solve it.
Thank you in advance.