Entering edit mode
3 months ago
Maksim
•
0
Hey everyone! I was using Prokka recently to find unique genes for several different species from the same genus. I extracted sequences from the NCBI database regardless of the assembly level. Now I started to think that I'd be better of using complete genomes instead of scaffolds and contigs. Since for each non-complete sequence I'll get less genes and would not be able to compare gene presence/abscence tables to extract core/pan genome. Does it really matter what assembly level I use?
Complete genomes will be the 'cleanest' but this is often not an option for many organisms.
For decent quality genomes that are incomplete, you should still have a number of long contigs such that for core/pangenome work it likely makes very little difference.
Where it will probably crop up is if a gene happened to span a contig break (perhaps it is highly repetitive for example). Consequently these genes may not appear in the final counts, causing you to slightly under or overestimate the 'core-ness' or 'accessory-ness' of some genes, but if you're testing many genomes it likely isn't going to hugely alter your conclusions.
If you have a particular interest in certain genes where it could matter more, you can interrogate this further by hand, or stick to only complete genomes for certain types of question.
It really all depends what you hope to achieve/answer.