AFAIK, CD-HIT requires sorting by length before performing the clustering step. Am I right? If yes, please read on. If not, then there is no question :)
I used CD-HIT based clustering to remove 100% identical sequences after retaining just one representative, and then proceed with a BLASTp all-by-all. This final file is large ~ 170GB.
But I just remembered I did not sort by length initially before the CD-HIT 100% nr step!
With that as context, I have a few questions:
1. A while back, I remember sorting my input by length, before the CD-HIT step per se But now, I can't seem to remember if it was a Perl script or some other executable inside CD-HIT or from elsewhere. Can someone help?
2. What happens to the validity of my results if my clustering (at 100% identity) was performed without sorting by length?
I do not mind having a few additional sequences that should not have been there for the BLASTp step, BUT it would be a problem if sequences were removed that should have been retained in the results file (used for BLASTp).
3. Does anyone have advice based on theory or practice? Thanks! (apart from repeating it afresh ha)
Happy New Year 2019! :)