Question: Cluster plasmids of (very) different lengths
0
gravatar for predeus
13 months ago by
predeus1.3k
Russia
predeus1.3k wrote:

Hello all,

I have a task of clustering a large compendium of plasmids. They are quite different in size, and also not necessarily rotated the same way (that is, many sequences are identical, but are not starting at the same point).

What I want it to cluster them according to CUMULATIVE sequence homology; i.e. not the best blast hit, but rather the overall coverage of homologous regions on both sequences.

It clearly has to be a local-type algorithm (because of the substantial difference in lengths), but local version of cd-hit-est does not perform best because of the rotation issue (see above). Blastclust did even worse (I didn't dig too deep into why, and it's kind of hard because of very terse output).

Thus, I would appreciate any suggestions. Thank you in advance!

blast sequence clustering • 281 views
ADD COMMENTlink written 13 months ago by predeus1.3k
1

Try mash distances? That might give you the resolution you want to cluster them, and AFAIK is ‘reverse complement’ insensitive, as kmers and their reverse complements are considered the same kmer.

That might be good enough for a first pass?

ADD REPLYlink written 13 months ago by Joe15k

Excellent, thank you. Definitely worth a try.

ADD REPLYlink written 13 months ago by predeus1.3k

Guess what, in the newest NAR issue there's a paper about a plasmid database that is using mash strategy to cluster them. Talk about timing :)

ADD REPLYlink written 13 months ago by predeus1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2394 users visited in the last hour