I have tried out purge_dups and found results to be puzzling.
First, there seems to be a silent bug where contig fasta headers are ignored (symptom - bed files are empty).
Solution - use simple contig fasta headers, like sample_contig1
and not sample-complex-name_x1-_contig-1
.
I analyzed plant genomes, which are repetitive, and found >80% of the contigs would be removed (700 mb reduced to 60 mb). This is surely too much.
I am using ONT and not Pacbio HiFi for my assemblies, so this could be one problem.
Has anyone optimized purge_dups for either plant genomes or nanopore ? It has over 850 citations and has been used on plant genomes widely before, yet there is no recommended parameter set for plants.
Thanks