Question: CD-HIT results without sorting by length
gravatar for Anand Rao
4 months ago by
Anand Rao210
United States
Anand Rao210 wrote:


AFAIK, CD-HIT requires sorting by length before performing the clustering step. Am I right? If yes, please read on. If not, then there is no question :)

I used CD-HIT based clustering to remove 100% identical sequences after retaining just one representative, and then proceed with a BLASTp all-by-all. This final file is large ~ 170GB.

But I just remembered I did not sort by length initially before the CD-HIT 100% nr step!

With that as context, I have a few questions:

1. A while back, I remember sorting my input by length, before the CD-HIT step per se But now, I can't seem to remember if it was a Perl script or some other executable inside CD-HIT or from elsewhere. Can someone help?

2. What happens to the validity of my results if my clustering (at 100% identity) was performed without sorting by length?

I do not mind having a few additional sequences that should not have been there for the BLASTp step, BUT it would be a problem if sequences were removed that should have been retained in the results file (used for BLASTp).

3. Does anyone have advice based on theory or practice? Thanks! (apart from repeating it afresh ha)

Happy New Year 2019! :)

sort clustering cd-hit length • 274 views
ADD COMMENTlink modified 4 months ago • written 4 months ago by Anand Rao210

AFAIK, cd-hit will sort the input itself, so no need to do it yourself prior to running cd-hit.

Unless there are other reasons to first cluster them I would not take the effort myself and immediately proceed to running the blast. In that context I also advise you to request the tabular output (unless you are already doing so) to save quite some space for the output file.

ADD REPLYlink written 4 months ago by lieven.sterck4.8k

Thank you! It explains why I could not remember the sorting step clearly, or find any utility that performs it explicitly. HNY! Cheers!

ADD REPLYlink written 4 months ago by Anand Rao210

lieven.sterck : Apologies for high jacking this thread for a minute. You have been promoted to moderator on biostars. Please join the biostars slack channel as described here: Inviting NEW Biostars moderators to join Biostars slack channel

ADD REPLYlink written 4 months ago by genomax67k

CDHIT requires no manual sorting.

ADD REPLYlink written 4 months ago by jrj.healey12k

Thanks for confirming. Cheers and HNY!

ADD REPLYlink written 4 months ago by Anand Rao210
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1790 users visited in the last hour