Sorting out CD HiT output
2
0
Entering edit mode
2.7 years ago
gbl1 ▴ 80

Hi,

I have a CD-hit output looking like this:

## 73 MIXED Sb-40 4 6 66.66666666666667 cluster_stats(length_max=319, length_min=205, length_mean=296.6666666666667, length_variance=2042.2666666666667, length_stdev=45.19144461805428, length_members_max=317, length_members_min=205, length_members_mean=292.2, length_members_variance=2403.2000000000003, length_members_stdev=49.0224438395313, ident_perc_max=97.48, ident_perc_min=85.85, ident_perc_mean=93.986, ident_perc_variance=22.420530000000017, ident_perc_stdev=4.735032206859845, counter=Counter({'Sb-40': 4, 'Sj-A': 2}))

73 0 319 Sj-A_M02764:115:000000000-C3GKK:1:1118:4143:8248 1

73 1 317 Sj-A_M02764:115:000000000-C3GKK:1:2107:8743:9281 1 317 5 319 + 96.85 0

73 3 317 Sb-40_M02764:115:000000000-C3GKK:1:2104:16139:22698 1 317 5 319 + 97.48 0

73 5 317 Sb-40_M02764:115:000000000-C3GKK:1:2115:7096:7098 1 317 5 319 + 94.01 0

73 2 305 Sb-40_M02764:115:000000000-C3GKK:1:1113:14798:13772 1 305 1 319 + 95.74 0

73 4 205 Sb-40_M02764:115:000000000-C3GKK:1:2106:18903:18118 1 205 1 217 + 85.85 0

SJ-A and Sb-40 are two samples, and I want to know how many fragment of each sample I have in each cluster. For exemple for this "73" cluster

# Sj-A Sb-40

73 2 4

How to proceed?

Thanks in advance

cd-hit • 1.1k views
ADD COMMENT
0
Entering edit mode
2.7 years ago
5heikki 10.0k

cd-hit ships with many helper scripts, e.g. clusters2txt (or something like that) turns a cluster file into a far more parsing friendly format..

ADD COMMENT
0
Entering edit mode
2.7 years ago
ahaswer ▴ 150

Assuming that the first line of file is the comment line you can simply use awk command:

awk -F'[ _]' 'NF>0 && NR>1 {print $1" "$4}' cdhit.txt | sort | uniq -c | awk '{print $2, $3, $1}' > counts.txt

I'm not sure if your file contains any blank lines. Running the command you will obtain file with structure like this (with counts in the last column):

73 Sb-40 4
73 Sj-A 2

Edit2: assuming multiple comment-like lines; remove multiple whitespaces.

tr -s " " < cdhit.txt | sed '/^#/ d' | awk -F'[ _]' 'NF>0 {print $1" "$4}' | sort | uniq -c | awk '{print $2, $3, $1}' > counts.txt
ADD COMMENT
0
Entering edit mode

Hmmm… I tried and:

## Sb-40 290

## Sj-A 60

0 0 1

0 1 1

0 10 1

0 100 1

0 1000 1

0 1001 1

0 1002 1

0 1003 1

0 1004 1

0 1005 1

etc…

ADD REPLY
0
Entering edit mode

Can you provide the file in the code window? I'm not sure about your field separators (specify them). It seems like single whitespaces.

ADD REPLY
0
Entering edit mode

input file:

## 0 MIXED Sj-A 4267 7229 59.026144694978555 cluster_stats(length_max=587, length_min=63, length_mean=100.38373218979112, length_variance=962.8900839658992, length_stdev=31.030470250479596, length_members_max=582, length_members_min=63, length_members_mean=100.31640841173216, length_members_variance=930.2534072177071, length_members_stdev=30.50005585597684, ident_perc_max=98.96, ident_perc_min=80.42, ident_perc_mean=94.99588682899834, ident_perc_variance=5.127216634208676, ident_perc_stdev=2.264335804205877, counter=Counter({'Sj-A': 4267, 'Sb-40': 2962}))
0   5052    587 Sb-40_M02764:115:000000000-C3GKK:1:1109:23652:14682                         1
0   6391    582 Sb-40_M02764:115:000000000-C3GKK:1:2110:25762:12872 1   582 1   587 +   80.76   0
0   270 290 Sj-A_M02764:115:000000000-C3GKK:1:1102:19199:23933  1   290 303 587 +   82.76   0
0   5426    290 Sb-40_M02764:115:000000000-C3GKK:1:1113:15242:21658 1   290 303 587 +   82.41   0
0   5533    290 Sb-40_M02764:115:000000000-C3GKK:1:1117:26788:14121 1   290 303 587 +   83.45   0
0   5564    290 Sb-40_M02764:115:000000000-C3GKK:1:1117:13232:21408 1   290 303 587 +   82.76   0
0   1453    288 Sj-A_M02764:115:000000000-C3GKK:1:1113:3739:15200   1   288 303 587 +   85.76   0
0   2839    288 Sj-A_M02764:115:000000000-C3GKK:1:2106:9831:10273   1   288 303 587 +   86.11   0
0   2895    288 Sj-A_M02764:115:000000000-C3GKK:1:2110:13364:5014   1   288 303 587 +   84.72   0
0   4750    288 Sb-40_M02764:115:000000000-C3GKK:1:1106:28205:8430  1   288 303 587 +   86.11   0
0   6472    288 Sb-40_M02764:115:000000000-C3GKK:1:2111:8362:20071  1   288 303 587 +   85.76   0
0   6938    288 Sb-40_M02764:115:000000000-C3GKK:1:2117:7499:19237  1   288 303 587 +   85.07   0
0   7149    288 Sb-40_M02764:115:000000000-C3GKK:1:2117:8321:6441   1   288 303 587 +   80.9    0
0   82  287 Sj-A_M02764:115:000000000-C3GKK:1:1101:17157:13826  1   287 303 587 +   84.67   0
0   290 287 Sj-A_M02764:115:000000000-C3GKK:1:1103:7587:4546    1   287 303 587 +   86.76   0    
0   740 287 Sj-A_M02764:115:000000000-C3GKK:1:1106:27250:11889  1   287 302 587 +   82.93   0

Output:

## Sb-40 290
## Sj-A 60
0 0 1
0 1 1
0 10 1
0 100 1
0 1000 1
0 1001 1
0 1002 1
ADD REPLY
0
Entering edit mode

Try this:

awk 'BEGIN{FS=" "}{if(/^[1-9]/){print $1,$4}}' file.txt \
    | awk 'BEGIN{FS="_"}{print $1}' \
    | sort \
    | uniq -c \
    | sed 's/ \+//' \
    | awk 'BEGIN{FS=" "}{print $2,$3,$1}'
ADD REPLY
0
Entering edit mode

Quiet ok

1 Sb-40 1
1 Sj-A 1
10 Sb-40 1
100 Sj-A 1
101 Sb-40 1
102 Sj-A 1
103 Sb-40 1
104 Sj-A 1
105 Sb-40 1
106 Sb-40 1
107 Sb-40 4
107 Sj-A 1

I'll sort that out in a spreadsheet … thanks

ADD REPLY
0
Entering edit mode

Issue: it forgets the 0 values…

1   Sj-A    1   1   Sb-40   1
2   Sj-A    2   2   Sb-40   7
7   Sj-A    240 3   Sb-40   1
9   Sj-A    30443   4   Sb-40   1
17  Sj-A    2   5   Sb-40   1
19  Sj-A    1   6   Sb-40   4
23  Sj-A    1   7   Sb-40   363
ADD REPLY
0
Entering edit mode

Replace [1-9] with [0-9] in awk command provided by 5heikki.

ADD REPLY
0
Entering edit mode

It still omit to say when there is 0

ADD REPLY
0
Entering edit mode

Basically, the final output I need is something like:

    id  Sj-A    Sb-40
    0   4267    2962
    1   1   1
    2   2   7
    3   0   1
    4   0   1
    5   0   1
    6   0   4
    7   240 363
    8   0   1
    9   30443   20499
    10  0   1

I need to know when there is 0 for one plant in a cluster

ADD REPLY
0
Entering edit mode

I recommend you take this to stackoverflow (be sure to describe your problem well)

ADD REPLY

Login before adding your answer.

Traffic: 2055 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6