Question: Clustering using CD-HIT and redundancy removal
0
gravatar for vishalchanda364
23 days ago by
vishalchanda36410 wrote:

Hello, Can anyone help me with the command to run CD-HIT for clustering the aseembled metagenomic data.? And I also need to know, how may I remove redundant sequences from the assembly using CD-HIT.?

alignment assembly • 103 views
ADD COMMENTlink written 23 days ago by vishalchanda36410
1

what have you tried so far? (eg reading the manual or paper)?

on the redundancy part: CD-HIT will automatically merge (and thus remove) redundant sequences, so you don't need to do anything special for that.

ah, and do follow up on your earlier questions as well (quite similar to this one apparently) : Removing Contigs and Redundant Sequences.

ADD REPLYlink modified 23 days ago • written 23 days ago by lieven.sterck6.9k
1

I have read the User's guide but there are so many option that confusing me, thanks. And sorry, I will follow up to my previous question.

ADD REPLYlink written 23 days ago by vishalchanda36410
1
gravatar for gb
23 days ago by
gb1.5k
gb1.5k wrote:

To cluster (put similar reads "together") you can start with this:

cd-hit-est -i reads.fa -o output.fa -c 0.95 -n 10 -d 999 -M 0 -T 0

For more info see https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHITEST

The option -c declares the global sequence identity so in this example all reads that are 95% similar will be put together. For redundancy removal I guess you need to put this on -c 1

BUT! Keep in mind that this is a global alignment so for example the following reads:

>read1
AAAA
>read2
AAAAA

Are not 100% the same. So what means redundancy in your case?

The output (output.fa) will contain the representative sequences. In practice (sort of) cd-hit first sorts your input based on the length of the reads of your input fasta. After that it will go trough the sorted reads from top till bottom. So at the very first read there are no clusters yet, so this will be the representative read for the first cluster. If the second read is minimal 95% similar it will be part of that first cluster and if it is not 95% similar it will be a new cluster. Lets say those two reads are similar, then in your output file you will get only 1 sequence. So the redundancy is removed.

ADD COMMENTlink modified 22 days ago • written 23 days ago by gb1.5k

Thank you so much for the help. I tried running the command exactly the same, but could not interpret the meaning, now it is clear.

ADD REPLYlink written 22 days ago by vishalchanda36410
1

Even that it works now, try to read that list of parameters once. Maybe you see more useful options

ADD REPLYlink written 22 days ago by gb1.5k

Yes I was doing the same. Thank you so much.!

ADD REPLYlink written 21 days ago by vishalchanda36410
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1844 users visited in the last hour