Question: clustering sequence FASTA GSS, EST, Transcripts
0
gravatar for Annie
7 weeks ago by
Annie0
India, ICGEB
Annie0 wrote:

I want to do clustering (k-means) and redundancy removal of my FASTA sequences which are mainly GSS, EST and assembled transcripts, to create a reference set for my short query sequences. My short query sequences can target either DNA or RNA. So I need some expert guidance. Also should I convert lower case base sequences into upper case for doing this task. Any suggestion would be highly appreciated.

ADD COMMENTlink modified 4 weeks ago by Biostar ♦♦ 20 • written 7 weeks ago by Annie0
1

You should also look at CD-HIT which is specifically tailored for this type of application and has specific subprograms.

ADD REPLYlink written 7 weeks ago by genomax71k

Thanks for your answer genomax, but I have found uclust to be better than CD-HIT

ADD REPLYlink written 7 weeks ago by Annie0

You might look at dedupe.sh from BBTools. https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/dedupe-guide/

ADD REPLYlink written 7 weeks ago by jean.elbers1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1621 users visited in the last hour