Question: cd-hit-est total seq value doesn't match number of sequences being compared?
0
gravatar for c.e.chong
10 months ago by
c.e.chong10
c.e.chong10 wrote:

Hi,

I am trying to use cd-hit-est to cluster merged contig files (containing contigs from 22 different metagenome samples), to remove any contigs which are 99% similar to any others. So I am left with a a contig.fasta file containing no duplicates.

I am running cd-hit-est on 4 merged contig files: 1. reads filtered for human sequences using bbmap and then assembled with metaspades 2. reads filtered for human sequences using bbmap and then assembled with spades 3. reads filtered for human sequences using bowtie & samtools and then assembled with metaspades 4. reads filtered for human sequences using bowtie & samtools and then assembled with spades.

This is the code I used:

cd-hit-est -M 100000 -i mergedcontigs.fasta -o merged_cd.fasta -c 0.99 -n 8 -A 0.90

When I run this code on files 1 & 2 everything seems to work fine. But when I run files 3 & 4, the total seq value and the number of sequences being compared are different and capped at 40000. Whereas these numbers were the same for 1 & 2.

      Output
----------------------------------------------------------------
total seq: 502084
longest and shortest : 898003 and 11
Total letters: 418459779
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 485M
Buffer          : 1 X 358M = 358M
Table           : 1 X 9M = 9M
Miscellaneous   : 6M
Total           : 860M

Table limit with the given memory limit:
Max number of representatives: 40000
Max number of word counting entries: 12392460836

comparing sequences from          0  to      40000

If any one has any idea why this might be I would be grateful!!

Thanks in advance!

ADD COMMENTlink written 10 months ago by c.e.chong10

have you considered -s parameter?

cd-hit-est help for `-s`
 -s   length difference cutoff, default 0.0 if set to 0.9, the shorter sequences need to be at least 90% length of the representative of the cluster
ADD REPLYlink modified 10 months ago • written 10 months ago by Nitin Narwade420
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1562 users visited in the last hour