How To Cluster Nucleotides Sequences
0
1
Entering edit mode
10.4 years ago
bambus0725 ▴ 50

Hello everyone,

I am working on a few number of RNA-seq data libraries sequenced through Illumina Genome analyser technology, which consists of millions of reads in each(appr 2-10milloins,and 20-100nt length).I want to cluster sequences that are 97% identical into one cluster,this is to reduce the redundancy of my data library which will be used for further analysis.

For this, I tried both CD-HIT-EST and UCLUST clustering tools that could fulfill my criteria like identity 97% and minimum alignment coverage for both longer and shorter sequences >= 40.

By using CD_HIT_EST,I couldn't get the full description of the ID's in the output "cluster file" as given in the input file,although there is an option "-d " ---for length description.It ends when it come across the 1st space(tab de-limited),but I need the entire ID description until the end.As I am not a good programmer I couldn't make changes in the code(written in c++).

For example, sample input

HWI-ST365:262:C0RY7ACXX:6:2312:6978:74690 1:N:0:GTGAAA size|1
CCAACCAATGAACAGGGCTTTGGCGACGACGAACTCACTCCTCTCTGTTGACGAT

HWI-ST365:262:C0RY7ACXX:6:1305:5522:5869 1:N:0:GTGAAA size|5
TGAAATGCTGCGCGGTAGAGGAGCGTTCTGTAAGTCGCTGAAGCTGAGTCGCGAGGCTTGGTGGAGACATCAGAAGTGCGAATGCTGACATGAGCAACGA

sample output

Cluster 0
0 100nt, >HWI-ST365:262:C0RY7ACXX:6:1305:5522:5869... *

Cluster 1
0 100nt, >HWI-ST365:262:C0RY7ACXX:6:1208:7633:77605... *

To overcome this problem I found a solution with Uclust,but unlike described in the published papers it works very very slow for high range of data although,it prints the entire description of the ID.Is there any option to mention the usage of number of threads or an alternate solution.

It will be very helpful If anyone could help me to solve this problem.

Thank you in advance.

clustering • 3.2k views
ADD COMMENT
0
Entering edit mode

Couldn't you circumvent this issue by replacing spaces in the header with something (e.g. _)?

sed 's/ /_/g' seq.fa > seq2.fa 
ADD REPLY
0
Entering edit mode

Hi Manu,

that was really a good and simple idea and I am very thankful to you

ADD REPLY
0
Entering edit mode

If reducing redundancy of the data library is your goal, I would also suggest tools like Prinseq

ADD REPLY

Login before adding your answer.

Traffic: 2813 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6