There are arbitrary cut offs to group short reads together and assign them to different taxonomy units. E.g 97 for Speices, 95 Genus ... and there are different tools for such clustering (usearch, for open reference clustering) and CD-HIT (for closed form).
I wonder, why this step is important ? why not directly mapping these reads to database and see what they are ? (given those fragments are not too rare in the cohort).