CD-HIT Clustering output
1
0
Entering edit mode
9.7 years ago
BSP • 0

Hello,

Currently I am analysing several metagenome and transcriptome datasets. For downstream analysis I am clustering these datasets with CD-HIT. Unfortunately CD-HIT produces no default output with an overview of sequence abundance within each cluster (simple tab delimited file). I found an old script which grabs this information but it uses the .bak.clstr output which was only available till version 4.3. Is there a way to produce this output in the newest version or an alternative way to produce a simple sequence abundance to cluster name overview?

Sincerely,
Felix

Clustering CD-HIT • 11k views
ADD COMMENT
0
Entering edit mode

By sequence abundance do you mean how many times the duplicate sequence is found in the cluster?

ADD REPLY
0
Entering edit mode

Yes, exactly!

ADD REPLY
0
Entering edit mode

Check if this useful. (assuming no empty lines between Cluster 0 and Cluster 1)

My input:

>Cluster 0
0       15679nt, >SB1234_Contig35475... at +/99.99%
1       15436nt, >SB1234_Contig35476... at +/99.62%
2       15764nt, >Contig18540... *
3       15438nt, >Contig39392... at +/99.69%
4       15679nt, >comp263440_c8_seq4... at -/99.99%
5       15667nt, >comp263440_c8_seq6... at -/99.99%
>Cluster 1
0       15684nt, >SB1234_Contig35474... at +/99.98%
1       15685nt, >Contig11682... *
2       15684nt, >comp263440_c8_seq3... at -/99.98%
3       15672nt, >comp263440_c8_seq5... at -/99.97%

My script

$TotalClusters=`grep -c '>Cluster' cd-hit.test.txt`;
for($i=0;$i<$TotalClusters;$i++)
{
$j=$i+1;
$lines=`perl -e 'while (<>){print if (/^>Cluster $i/../^>Cluster $j/);}' cd-hit.test.txt | wc -l`;
$linesExcludingPattern=$lines-2;
print "Cluster $i has $linesExcludingPattern sequences\n";
}

OUTPUT

Cluster 0 has 6 sequences
Cluster 1 has 4 sequences

Is this what you wanted?

ADD REPLY
0
Entering edit mode

Dear Prakki, I wonder if you have a solution if within each Cluster are different individuals like this:

>Cluster 0
0       15679nt, >SpecA_Contig35475... at +/99.99%
1       15436nt, >SpecA_Contig35476... at +/99.62%
2       15764nt, >SpecB_Contig18540... *
3       15438nt, >SpecA_Contig39392... at +/99.69%
4       15679nt, >SpecC_comp263440_c8_seq4... at -/99.99%
>Cluster 1
0       15684nt, >SpecC_SB1234_Contig35474... at +/99.98%
1       15685nt, >SpecC_Contig11682... *
>Cluster 2
0       15684nt, >SpecA_comp263440_c8_seq3... at -/99.98%
1       15672nt, >SpecB_comp263440_c8_seq5... at -/99.97%

and I would like to find out how many Clusters own only one, two or all three species as it is the case in Cluster 0?

Thanks a lot!

ADD REPLY
0
Entering edit mode

the above script should serve your purpose!

ADD REPLY
0
Entering edit mode

Is this supposed to run in bash?

I get the following output...

./cdhit_parser.sh: line 2: =291: command not found
./cdhit_parser.sh: line 3: syntax error near unexpected token `('
./cdhit_parser.sh: line 3: `for($i=0;$i<$TotalClusters;$i++)'
ADD REPLY
0
Entering edit mode

Dear Prakki, I find your script super useful. This is something I exactly need. More specifically I need this script to split my cd-hit output for clusters having 1 sequence and more than 1 sequences. What I do is I rename my output.clstr file as cd-hit.test.txt. Then I copy your script and save in the same folder as perl clstr_num.pl. Then I simply run it by

$ perl clstr_num.pl

My output.clstr or cd-hit.test.txt file looks like this

 >Cluster 0
0   22431nt, wt1_800... *
>Cluster 1
0   21560nt, mock2_6320... *
>Cluster 2
0   2668nt, wt2_2859... at 99.06%
1   4470nt, wt2_5059... at 98.95%
2   17970nt, h-stress3_1002... *
>Cluster 3
0   17850nt, wt2_422... *
1   30nt, mock1_3879... at 100.00%
2   5548nt, c-stress2_3811... at 99.33%
>Cluster 4
0   17798nt, wt2_704... *
1   24nt, wt2_2505... at 100.00%
2   3064nt, h-stress2_3822... at 99.58%
3   3000nt, inf3_1050... at 99.67%
4   2727nt, c-stress2_2857... at 98.57%
>Cluster 5
0   4826nt, wt2_2557... at 99.11%
1   17143nt, h-stress1_643... *
>Cluster 6
0   16920nt, c-stress3_3308... *

but the script outputs

Cluster 0 has 1 sequences
Cluster 1 has 25423 sequences
Cluster 2 has 24537 sequences
Cluster 3 has 25804 sequences
Cluster 4 has 23779 sequences
Cluster 5 has 23693 sequences
Cluster 6 has 8353 sequences

Where I expect the output

Cluster 0 has 1 sequences
Cluster 1 has 1 sequences
Cluster 2 has 3 sequences
Cluster 3 has 3 sequences
Cluster 4 has 4 sequences
Cluster 5 has 2 sequences
Cluster 6 has 1 sequences

Can you please guide me how to get the expected output? Please excuse my childish language, I have 0 coding knowledge. It's actually my 1st comment on biostar. Your help would be a huge favor and highly appreciated. Thanks much, Shan

ADD REPLY
0
Entering edit mode

Hi @syed. I just saw your message. The above script has actually worked for me. Let me show you, how I did.

$ cat test.fasta.clstr
>Cluster 0
0   22431nt, wt1_800... *
>Cluster 1
0   21560nt, mock2_6320... *
>Cluster 2
0   2668nt, wt2_2859... at 99.06%
1   4470nt, wt2_5059... at 98.95%
2   17970nt, h-stress3_1002... *
>Cluster 3
0   17850nt, wt2_422... *
1   30nt, mock1_3879... at 100.00%
2   5548nt, c-stress2_3811... at 99.33%
>Cluster 4
0   17798nt, wt2_704... *
1   24nt, wt2_2505... at 100.00%
2   3064nt, h-stress2_3822... at 99.58%
3   3000nt, inf3_1050... at 99.67%
4   2727nt, c-stress2_2857... at 98.57%
>Cluster 5
0   4826nt, wt2_2557... at 99.11%
1   17143nt, h-stress1_643... *
>Cluster 6
0   16920nt, c-stress3_3308... *

$ cat test.pl
$TotalClusters=`grep -c '>Cluster' test.fasta.clstr`;
for($i=0;$i<$TotalClusters;$i++)
{
$j=$i+1;
$lines=`perl -e 'while (<>){print if (/^>Cluster $i\n/../^>Cluster $j\n/);}' test.fasta.clstr | wc -l`;
$linesExcludingPattern=$lines-2;
print "Cluster $i has $linesExcludingPattern sequences\n";
}

$ perl test.pl
Cluster 0 has 1 sequences
Cluster 1 has 1 sequences
Cluster 2 has 3 sequences
Cluster 3 has 3 sequences
Cluster 4 has 5 sequences
Cluster 5 has 2 sequences
Cluster 6 has 1 sequences

I just had to change the filenames. I did not change any code. Make sure you input is properly formatted as the example used above. Then, I guess the script should work.

ADD REPLY
0
Entering edit mode

Hi there,

This script does not work very well - wrong count. Particulary for large output..

ADD REPLY
2
Entering edit mode
3.4 years ago

In this code, when Cluster 2 is expected to be picked Cluster 20000 will be picked if there are 20000 number of clusters, if there are 27 clusters, Cluster 27 will be picked. In the examples it looks as if it worked fine because there are less than 10 Clusters in the examples.

add \n to Cluster $i and Cluster $j as follows and the script would work fine.

$lines=perl -e 'while (<>){print if (/^>Cluster $i\n/../^>Cluster $j\n/);}' clusters_0.5n2.clstr | wc -l;

ADD COMMENT
0
Entering edit mode

Thanks @bagdevi.mishra. updated!

ADD REPLY

Login before adding your answer.

Traffic: 2948 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6