Question

CD-HIT: representative sequence vs consensus sequence

0

Entering edit mode

3.3 years ago

rob_DNA ▴ 20

Hi,

with the CD-HIT command cd-hit-est it is possible to form sequence clusters. Per cluster, 1 "representative" sequence is generated, as stated at the CD-HIT website:

... and produces a set of 'non-redundant' (nr) representative sequences as output.

Is such a nr representative sequence the same as a consensus sequence in CD-HIT? I want to use cd-hit-est to cluster Nanopore amplicon sequence data.

The NCBI website "https://www.ncbi.nlm.nih.gov/mesh?Db=mesh&Cmd=DetailsSearch&Term=%22Consensus+Sequence%22%5BMeSH+Terms%5D" calls a consensus sequence a representative sequence. However, I'd like to know if CD-HIT also defines a representative sequence as a consensus sequence.

Any ideas? Thank you.

CD-HIT clustering • 2.7k views

ADD COMMENT • link 3.3 years ago by rob_DNA ▴ 20

2

Entering edit mode

To the best of my knowledge, a representative sequence is not the same as a consensus (in the world of CD-HIT at least).

A representative sequence is a sequence from that cluster, meaning that all the other sequences within that cluster are within some edit distance of the representative. I forget how CD-HIT chooses its representatives (might just be the longest or first sequence in order etc).

ADD REPLY • link 3.3 years ago by Joe 21k

4

Entering edit mode

As CD-HIT sorts and then processes the sequences from longest to shortest, it is both: each clusters representative sequence is both the longest for that cluster, and also the first to enter the cluster.

ADD REPLY • link 3.3 years ago by h.mon 35k

0

Entering edit mode

@Joe and @h.mon. Thank you for the valuable information. So if I understand it correctly: CD-HIT-EST first sorts all sequences. Then the first (and thus longest) sequence in that sorted list becomes the representative sequence of cluster 1. Then the 2nd sequence is evaluated. If this 2nd sequence is within the specified edit distance (parameter "-c") it is assigned to cluster 1. If this 2nd sequence differs more than the specified distance, this sequence becomes the representative sequence of cluster 2? And so on for the third, fourth ,...... n-th sequenc?

ADD REPLY • link 3.2 years ago by rob_DNA ▴ 20

0

Entering edit mode

That would be my assumption, yep :)

ADD REPLY • link 3.2 years ago by Joe 21k

0

Entering edit mode

In the fast mode - which is the default - yes, that is precisely what is being done. In accurate mode, a sequence is compared to all representative sequences, and is added to the most similar one. From the wiki:

In default manner (fast mode), a query is grouped into the first representative without comparing to other representatives. In accurate mode, a query is compared to all representatives and grouped to the most similar one.

You may check the wiki if you have further questions, it has a lot of information.

ADD REPLY • link 3.2 years ago by h.mon 35k