Question: CD-HIT Clustering output
0
gravatar for BSP
5.3 years ago by
BSP0
European Union
BSP0 wrote:

Hello,

 

currently I am analysing several metagenome and transcriptome datasets. For downstream analysis I am clustering these datasets with CD-HIT. Unfortunately CD-HIT produces no default output with an overview of sequence abundance within each cluster (simple tab delimited file). I found an old script which grabs this information but it uses the .bak.clstr output which was only available till version 4.3. Is there a way to produce this output in the newest version or an alternative way to produce a simple sequence abundance to cluster name overview?

 

Sincerely,

 

Felix

cd-hit clustering • 5.0k views
ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by BSP0

By sequence abundance do you mean how many times the duplicate sequence is found in the cluster?

ADD REPLYlink written 5.3 years ago by Prakki Rama2.3k

Yes, exactly!

Zitat von Prakki Rama on Biostar <notifications@biostars.org>:

ADD REPLYlink written 5.3 years ago by BSP0

Check if this useful. (assuming no empty lines between Cluster 0 and Cluster 1)

My input:

>Cluster 0
0       15679nt, >SB1234_Contig35475... at +/99.99%
1       15436nt, >SB1234_Contig35476... at +/99.62%
2       15764nt, >Contig18540... *
3       15438nt, >Contig39392... at +/99.69%
4       15679nt, >comp263440_c8_seq4... at -/99.99%
5       15667nt, >comp263440_c8_seq6... at -/99.99%
>Cluster 1
0       15684nt, >SB1234_Contig35474... at +/99.98%
1       15685nt, >Contig11682... *
2       15684nt, >comp263440_c8_seq3... at -/99.98%
3       15672nt, >comp263440_c8_seq5... at -/99.97%

My script

$TotalClusters=`grep -c '>Cluster' cd-hit.test.txt`;
for($i=0;$i<$TotalClusters;$i++)
{
$j=$i+1;
$lines=`perl -e 'while (<>){print if (/^>Cluster $i/../^>Cluster $j/);}' cd-hit.test.txt | wc -l`;
$linesExcludingPattern=$lines-2;
print "Cluster $i has $linesExcludingPattern sequences\n";
}

OUTPUT

Cluster 0 has 6 sequences
Cluster 1 has 4 sequences

Is this what you wanted?

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Prakki Rama2.3k

Dear Prakki, I wonder if you have a solution if within each Cluster are different individuals like this:

>Cluster 0
0       15679nt, >SpecA_Contig35475... at +/99.99%
1       15436nt, >SpecA_Contig35476... at +/99.62%
2       15764nt, >SpecB_Contig18540... *
3       15438nt, >SpecA_Contig39392... at +/99.69%
4       15679nt, >SpecC_comp263440_c8_seq4... at -/99.99%
>Cluster 1
0       15684nt, >SpecC_SB1234_Contig35474... at +/99.98%
1       15685nt, >SpecC_Contig11682... *
>Cluster 2
0       15684nt, >SpecA_comp263440_c8_seq3... at -/99.98%
1       15672nt, >SpecB_comp263440_c8_seq5... at -/99.97%

and I would like to find out how many Clusters own only one, two or all three species as it is the case in Cluster 0?

Thanks a lot!

ADD REPLYlink modified 5 hours ago by RamRS24k • written 4.1 years ago by Elli0

the above script should serve your purpose!

ADD REPLYlink written 4.1 years ago by Prakki Rama2.3k

Is this supposed to run in bash?

I get the following output...

./cdhit_parser.sh: line 2: =291: command not found
./cdhit_parser.sh: line 3: syntax error near unexpected token `('
./cdhit_parser.sh: line 3: `for($i=0;$i<$TotalClusters;$i++)'
ADD REPLYlink modified 5 hours ago by RamRS24k • written 2.5 years ago by germanbr0

Dear Prakki, I find your script super useful. This is something I exactly need. More specifically I need this script to split my cd-hit output for clusters having 1 sequence and more than 1 sequences. What I do is I rename my output.clstr file as cd-hit.test.txt. Then I copy your script and save in the same folder as perl clstr_num.pl. Then I simply run it by

$ perl clstr_num.pl

My output.clstr or cd-hit.test.txt file looks like this

 >Cluster 0
0   22431nt, wt1_800... *
>Cluster 1
0   21560nt, mock2_6320... *
>Cluster 2
0   2668nt, wt2_2859... at 99.06%
1   4470nt, wt2_5059... at 98.95%
2   17970nt, h-stress3_1002... *
>Cluster 3
0   17850nt, wt2_422... *
1   30nt, mock1_3879... at 100.00%
2   5548nt, c-stress2_3811... at 99.33%
>Cluster 4
0   17798nt, wt2_704... *
1   24nt, wt2_2505... at 100.00%
2   3064nt, h-stress2_3822... at 99.58%
3   3000nt, inf3_1050... at 99.67%
4   2727nt, c-stress2_2857... at 98.57%
>Cluster 5
0   4826nt, wt2_2557... at 99.11%
1   17143nt, h-stress1_643... *
>Cluster 6
0   16920nt, c-stress3_3308... *

but the script outputs

Cluster 0 has 1 sequences
Cluster 1 has 25423 sequences
Cluster 2 has 24537 sequences
Cluster 3 has 25804 sequences
Cluster 4 has 23779 sequences
Cluster 5 has 23693 sequences
Cluster 6 has 8353 sequences

Where I expect the output

Cluster 0 has 1 sequences
Cluster 1 has 1 sequences
Cluster 2 has 3 sequences
Cluster 3 has 3 sequences
Cluster 4 has 4 sequences
Cluster 5 has 2 sequences
Cluster 6 has 1 sequences

Can you please guide me how to get the expected output? Please excuse my childish language, I have 0 coding knowledge. It's actually my 1st comment on biostar. Your help would be a huge favor and highly appreciated. Thanks much, Shan

ADD REPLYlink modified 4 months ago • written 4 months ago by Syed Shan-e-Ali Zaidi0

Hi @syed. I just saw your message. The above script has actually worked for me. Let me show you, how I did.

$ cat test.fasta
>Cluster 0
0   22431nt, wt1_800... *
>Cluster 1
0   21560nt, mock2_6320... *
>Cluster 2
0   2668nt, wt2_2859... at 99.06%
1   4470nt, wt2_5059... at 98.95%
2   17970nt, h-stress3_1002... *
>Cluster 3
0   17850nt, wt2_422... *
1   30nt, mock1_3879... at 100.00%
2   5548nt, c-stress2_3811... at 99.33%
>Cluster 4
0   17798nt, wt2_704... *
1   24nt, wt2_2505... at 100.00%
2   3064nt, h-stress2_3822... at 99.58%
3   3000nt, inf3_1050... at 99.67%
4   2727nt, c-stress2_2857... at 98.57%
>Cluster 5
0   4826nt, wt2_2557... at 99.11%
1   17143nt, h-stress1_643... *
>Cluster 6
0   16920nt, c-stress3_3308... *

$ cat test.pl
$TotalClusters=`grep -c '>Cluster' test.fasta`;
for($i=0;$i<$TotalClusters;$i++)
{
$j=$i+1;
$lines=`perl -e 'while (<>){print if (/^>Cluster $i/../^>Cluster $j/);}' test.fasta | wc -l`;
$linesExcludingPattern=$lines-2;
print "Cluster $i has $linesExcludingPattern sequences\n";
}

$ perl test.pl
Cluster 0 has 1 sequences
Cluster 1 has 1 sequences
Cluster 2 has 3 sequences
Cluster 3 has 3 sequences
Cluster 4 has 5 sequences
Cluster 5 has 2 sequences
Cluster 6 has 1 sequences

I just had to change the filenames. I did not change any code. Make sure you input is properly formatted as the example used above. Then, I guess the script should work.

ADD REPLYlink written 4 months ago by Prakki Rama2.3k

Hi there,

This script does not work very well - wrong count. Particulary for large output..

ADD REPLYlink written 4 weeks ago by geizetomazetto0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1418 users visited in the last hour