Question: Help with CD-hit commands for DNA sequences
0
gravatar for frcamacho
2.9 years ago by
frcamacho160
United States
frcamacho160 wrote:

Hi,

I want to use cd-hit to stringently cluster and remove redundant DNA sequences (~14,500 sequences). I was doing a blastn all vs all and filtering for 98% qcovsHSP and 98% percent identity. Then running a script to find out of all the matches to keep the longer sequences. However, I found cd-hit and this allows me to do the same , but also keeps track of the clusters for me. I was going through the commands and found some that would do what I have been doing 1) removing 98% query coverage and 98% percent identity 2) keep the longer sequence in a match.

Here is what I got to try to replicate a blastn all vs all: (Please correct me if I am wrong!)

cdhit -i input.fa -o output.fa -n 11 -g 1 -G 0 -aL .98

-n word size -g accurate mode -G local sequence identity -aL # of bases in longer sequence in alignment / longer sequence length

However, I can't seem to find an argument for percent identity. I want 98% of the bases to match correctly in alignment. Any help will be appreciated!

cdhit software • 1.7k views
ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by frcamacho160

If you are dealing with DNA, you probably want to use cdhit-est

ADD REPLYlink written 2.9 years ago by h.mon29k

I thought cdhit -est were not good for large sequences. My max is 71KB large. Is cdhit-est ok?

ADD REPLYlink written 2.9 years ago by frcamacho160
2
gravatar for abascalfederico
2.9 years ago by
abascalfederico1.1k
Spain
abascalfederico1.1k wrote:

Hi, In the version I have the maximum % identity is controlled through "-c"

    -c  sequence identity threshold, default 0.9
this is the default cd-hit's "global sequence identity" calculated as:
number of identical amino acids in alignment
divided by the full length of the shorter sequence
ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by abascalfederico1.1k

Which version are you running? I am running 4.6 (built on Jul 29 2016)

ADD REPLYlink written 2.9 years ago by frcamacho160
1

Mine is version 4.5.4, but I've checked more recent versions still use -c

ADD REPLYlink written 2.9 years ago by abascalfederico1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1735 users visited in the last hour