Hello everyone !
I need a few advices regarding the final step of my D. suzukii assembly, using long PacBio reads : the polishing step. First, let me explain how I obtained the file I am working on.
I have made two different assembly using different algorithms : Falcon and Canu. I have assessed and compared theses assembly using quast for the classic assembly metrics, and busco2 to assess gene content (using set Arthropoda and diptera). I also evaluated gene content using some handmade scripts that were looking for particular gene of interest.
The two assembly were really different, in terms of metrics and gene content, and I couldn't be happy with one or an other. I then used Mahul Chakraborty tool, called quickmerge (see here on github : https://github.com/mahulchak/quickmerge ). This tool created a merged assembly using both the advantages of each assembly. My Busco results were really nice compared to the previous one. The assembly was also more contiguous, with a greater N50 and way much less contigs.
For reminder, of for those who don't know busco, it is a tool that look for genes in your assembly, that are shared by different species (for Arthropoda, it is genes that are orthologs among the Arthropoda clad and it goes on) and always present in single copy. The genes are then categorized as following : S : Single , D : Duplicated, F : Fragmented, M : Missing. There is 800 genes assessed for Arthropoda, and 2800 genes assessed for Diptera.
My data come from a very polymorphic species, and I always tend to have high scores of Duplicated. I'm not really scared by it. What I absolutely want to reduce, are the numbers of fragmented or missing genes.
Then, using busco, I have tried different polishing using different coverage : 40X and 80X. The results are kind of confusing for me, and I need advises of an expert eye on it. Here are my different Busco Result depending on coverage :
Non polished assembly :
Arthropoda : S : 91% , D : 5.9% , F : 2.2% , M : 0.9%
Diptera : S : 87.1%, D : 5.1%, F : 5.1%, M : 2.7%
40X polished assembly :
Arthropoda : S : 89.1% , D : 9.4%, F : 0.9%, M : 0.9%
Diptera : S : 86.9% , D : 8.4%, F : 2.8%, M : 1.9%
80X polished assembly :
Arthropoda : S : 86.5%, D : 11.4%, F : 0.9%, M : 1.2%
Diptera : S : 84,6%, D : 11%, F : 2.6%, M : 1.8%
So, I am not that surprised that the more we polish, the more we get duplicated genes. My final assembly size is 280Mb, but the estimated size of the genome, using flux cytometry, is 250Mb. So, I was expecting duplicate of some polymorphic regions. What surprise me, and what I don't understand, is the variation of fragmented and missing genes. I was expecting that the more reads I will use, the less fragmented and missing genes I will get. it work for diptera clade, but not for Arthropoda. Doubling the coverage increased a little bit this number for Arthropoda, not for diptera, while keep dramatically increasing the duplicated genes in both clads.
I am confused now, because I found the BUSCO results from 40X polishing better for Arthropoda, but 80X better for Diptera. My interpretation of this, is that the polishing kind of "revealed" our true level of duplication, which is the reflect of an high polymorphism level. I think that the fact we loss a bit of genes in arthropoda set is because the sequences have maybe evolved a lot, and busco can't recognize some of the genes anymore.
I know it is a bit long to read, but I really need some outside point of view. Anyone already experienced assembly of an highly polymorphic species ? Should I keep the 40X polishing or the 80X polishing ? Or maybe continue polishing with an even higher coverage ? Any recommendations or critics about the pipeline I used ? (merging two different assembly for examples).
Thanks for reading me this far !