I created two genome assemblies of a plant, based on short read data (Illumina PE). I used MEGAHIT in both cases with the same configuration. The only difference was that in one case I used 50x sequencing depth, and in the other I subsampled to 20x.
To my surprise, the 20x assembly ended up with slightly better stats: N50, total assembly size, and BUSCO score.
Can anyone help and suggest reasons for why this could happen? My understanding had always been that additional sequencing data can't harm the results, especially at relatively-low depth. I've found this paper, in which a similar phenomena was observed for bacterial genomes, but the reasons are not explained or discussed. Any ideas or relevant literature?