Hi, all. Can anyone tell me why I might have very low duplicate BUSCOs after assembly but quite high after annotation?
After assembly: C:98.1%[S:97.0%,D:1.1%],F:0.6%,M:1.3%,n:2124
After annotation: C:97.0%[S:59.7%,D:37.3%],F:1.0%,M:2.0%,n:2124
By the way, this is an insect sequenced via 10x Genomics and assembled "long reads" with Supernova 2.0.1. I annotated with a close relative as a reference, with Maker set for Eukaryota, and I used BUSCO 5.2.2 with Endopterygota odb10.
According to the recent BUSCO paper, duplicated BUSCOs could be formed from poor assembly of haplotypes, but the level of phasing I got with Supernova should be decent.
Thanks for any advice!
<strike>Sounds like a fragmented annotation.</strike>
You might check what is going on to the BUSCO genes from your assembly to the annotation using agat_sp_compare_two_BUSCOs.pl from AGAT.
Thanks for the quick response, Juke. I'll look into AGAT, although I think I've had problems with both Singularity and Docker.
What exactly do you mean by fragmented annotation? I masked my custom repeat library and did 3 rounds of Maker with a close relative as protein evidence, as well as the Swiss-Prot omnibus. Each round used Augustus and SNAP trained on the prior round. So I'm not sure where the problem would occur, or what would be done differently. (I've also done downstream GO work, but may need to redo this depending on what the issue is.)
My first thought was fragmented annotation (i.e. a gene seen complete in the assembly annotated in several pieces in the annotation). But I'm wrong because it should end up in the fragmented part of the BUSCO which is not your case. Using
agat_sp_compare_two_BUSCOs.pl
you will probably decipher your thought. You need to load the tracks within a genome browser and look at what are the duplicated one found in the annotation. Were they already found by BUSCO in the assembly or it is just new genes.Your annotation BUSCO score is really good. I think your annotation went well and you annotated "new" genes... are they real duplicates or artifact due to assembly/phasing issues, you should investigate.
Thanks, Juke. I think you're right. I've been using JBrowse to visualize, so I'll take another look. Yes, I'm concerned that the duplicates are legitimate, so I want to be certain before filtering. I will look into AGAT again to see whether it highlights any issues.