Clustering using CD-HIT and redundancy removal
1
0
Entering edit mode
4.2 years ago

Hello, Can anyone help me with the command to run CD-HIT for clustering the aseembled metagenomic data.? And I also need to know, how may I remove redundant sequences from the assembly using CD-HIT.?

Assembly alignment • 3.7k views
ADD COMMENT
1
Entering edit mode

what have you tried so far? (eg reading the manual or paper)?

on the redundancy part: CD-HIT will automatically merge (and thus remove) redundant sequences, so you don't need to do anything special for that.

ah, and do follow up on your earlier questions as well (quite similar to this one apparently) : Removing Contigs and Redundant Sequences.

ADD REPLY
1
Entering edit mode

I have read the User's guide but there are so many option that confusing me, thanks. And sorry, I will follow up to my previous question.

ADD REPLY
1
Entering edit mode
4.2 years ago
gb ★ 2.2k

To cluster (put similar reads "together") you can start with this:

cd-hit-est -i reads.fa -o output.fa -c 0.95 -n 10 -d 999 -M 0 -T 0

For more info see https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHITEST

The option -c declares the global sequence identity so in this example all reads that are 95% similar will be put together. For redundancy removal I guess you need to put this on -c 1

BUT! Keep in mind that this is a global alignment so for example the following reads:

>read1
AAAA
>read2
AAAAA

Are not 100% the same. So what means redundancy in your case?

The output (output.fa) will contain the representative sequences. In practice (sort of) cd-hit first sorts your input based on the length of the reads of your input fasta. After that it will go trough the sorted reads from top till bottom. So at the very first read there are no clusters yet, so this will be the representative read for the first cluster. If the second read is minimal 95% similar it will be part of that first cluster and if it is not 95% similar it will be a new cluster. Lets say those two reads are similar, then in your output file you will get only 1 sequence. So the redundancy is removed.

ADD COMMENT
0
Entering edit mode

Thank you so much for the help. I tried running the command exactly the same, but could not interpret the meaning, now it is clear.

ADD REPLY
1
Entering edit mode

Even that it works now, try to read that list of parameters once. Maybe you see more useful options

ADD REPLY
0
Entering edit mode

Yes I was doing the same. Thank you so much.!

ADD REPLY

Login before adding your answer.

Traffic: 2773 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6