Question

Clustering using CD-HIT and redundancy removal

0

Entering edit mode

4.2 years ago

vishalchanda364 ▴ 20

Hello, Can anyone help me with the command to run CD-HIT for clustering the aseembled metagenomic data.? And I also need to know, how may I remove redundant sequences from the assembly using CD-HIT.?

Assembly alignment • 3.7k views

ADD COMMENT • link 4.2 years ago by vishalchanda364 ▴ 20

1

Entering edit mode

what have you tried so far? (eg reading the manual or paper)?

on the redundancy part: CD-HIT will automatically merge (and thus remove) redundant sequences, so you don't need to do anything special for that.

ah, and do follow up on your earlier questions as well (quite similar to this one apparently) : Removing Contigs and Redundant Sequences.

ADD REPLY • link 4.2 years ago by lieven.sterck 15k

1

Entering edit mode

I have read the User's guide but there are so many option that confusing me, thanks. And sorry, I will follow up to my previous question.

ADD REPLY • link 4.2 years ago by vishalchanda364 ▴ 20

score 1 · Answer 1 · 2020-01-28

1

Entering edit mode

4.2 years ago

gb ★ 2.2k

To cluster (put similar reads "together") you can start with this:

cd-hit-est -i reads.fa -o output.fa -c 0.95 -n 10 -d 999 -M 0 -T 0

For more info see https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHITEST

The option -c declares the global sequence identity so in this example all reads that are 95% similar will be put together. For redundancy removal I guess you need to put this on -c 1

BUT! Keep in mind that this is a global alignment so for example the following reads:

>read1
AAAA
>read2
AAAAA

Are not 100% the same. So what means redundancy in your case?

The output (output.fa) will contain the representative sequences. In practice (sort of) cd-hit first sorts your input based on the length of the reads of your input fasta. After that it will go trough the sorted reads from top till bottom. So at the very first read there are no clusters yet, so this will be the representative read for the first cluster. If the second read is minimal 95% similar it will be part of that first cluster and if it is not 95% similar it will be a new cluster. Lets say those two reads are similar, then in your output file you will get only 1 sequence. So the redundancy is removed.

ADD COMMENT • link 4.2 years ago by gb ★ 2.2k

0

Entering edit mode

Thank you so much for the help. I tried running the command exactly the same, but could not interpret the meaning, now it is clear.

ADD REPLY • link 4.2 years ago by vishalchanda364 ▴ 20

1

Entering edit mode

Even that it works now, try to read that list of parameters once. Maybe you see more useful options