clustering sequence FASTA GSS, EST, Transcripts

0

Entering edit mode

4.7 years ago

Annie • 0

I want to do clustering (k-means) and redundancy removal of my FASTA sequences which are mainly GSS, EST and assembled transcripts, to create a reference set for my short query sequences. My short query sequences can target either DNA or RNA. So I need some expert guidance. Also should I convert lower case base sequences into upper case for doing this task. Any suggestion would be highly appreciated.

genome assembly sequence next-gen alignment • 959 views

ADD COMMENT • link updated 4.7 years ago by Biostar 20 • written 4.7 years ago by Annie • 0

1

Entering edit mode

You should also look at CD-HIT which is specifically tailored for this type of application and has specific subprograms.

ADD REPLY • link 4.7 years ago by GenoMax 141k

0

Entering edit mode

Thanks for your answer genomax, but I have found uclust to be better than CD-HIT

ADD REPLY • link 4.7 years ago by Annie • 0

0

Entering edit mode

You might look at dedupe.sh from BBTools. https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/dedupe-guide/

ADD REPLY • link 4.7 years ago by jean.elbers ★ 1.7k

Login before adding your answer.