How to parse cd-hit output (given in the description) to create a mapping file of clusters and representative sequence names?
2
1
Entering edit mode
3.0 years ago
mrj ▴ 180

cd-hit has given me an output that looks just like the example below.

>Cluster 0
0   496aa, >SRR5892231.2396932... *
>Cluster 1
0   496aa, >SRR5892231.3763255... *
1   390aa, >SRR5892231.1558909... at 91.03%
>Cluster 2
0   496aa, >SRR5892231.1710795... *
>Cluster 3
0   496aa, >SRR5892231.2083014... *
1   464aa, >SRR5892231.14158... at 91.59%
2   423aa, >SRR5892231.1116524... at 94.56%
3   314aa, >SRR5892231.1717279... at 95.86%
4   268aa, >SRR5892231.2309241... at 99.63%
5   371aa, >SRR5892231.480233... at 99.46%
>Cluster 4
0   496aa, >SRR5892231.3954388... *
1   319aa, >SRR5892231.1752373... at 99.69%
>Cluster 5
0   496aa, >SRR5892231.14746... *
>Cluster 6
0   496aa, >SRR5892231.2340653... *
1   407aa, >SRR5892231.2608197... at 100.00%
2   340aa, >SRR5892231.1216749... at 100.00%
3   345aa, >SRR5892231.3205930... at 92.46%

Each line that starts with a > shows the cluster label. The next line that ends with a ... * shows the representative sequence for the particular cluster.

How do I create a csv file containing the cluster label and its representative sequence?

(note: a cluster that appeared once in this file can re-appear somewhere in the middle of the file even though it is not shown in this case)

Thank you in advance for the help.

clusters sequences mapping parse cd-hit • 5.1k views
ADD COMMENT
2
Entering edit mode
$ awk -v OFS="," '/^>/ {getline seq; print $0, seq}' test.txt | awk -v OFS="," -F ">|,|\.\." '{print $2,$5}'


Cluster 0,SRR5892231.2396932
Cluster 1,SRR5892231.3763255
Cluster 2,SRR5892231.1710795
Cluster 3,SRR5892231.2083014
Cluster 4,SRR5892231.3954388
Cluster 5,SRR5892231.14746
Cluster 6,SRR5892231.2340653

You can try using scripts provided by CD-HIT devs: https://github.com/weizhongli/cdhit/blob/master/clstr2txt.pl and https://github.com/weizhongli/cdhit/blob/master/clstr_select_rep.pl.

ADD REPLY
0
Entering edit mode

Thanks for this solution. This solution works if the representative sequence is exactly the next line after line starting with >Cluster. I noticed that some lines may be present as the third or fifth etc line. Therefore, it is better if we capture the line ending with *. I created my own solution (posted here) which does this using python.

If you can modify your solution using awk, that would be great.

ADD REPLY
2
Entering edit mode
3.0 years ago
mrj ▴ 180

I created my own answer for this question.

# this program takes in cltr file and create a mapping file
#cluster,representative sequence name

import re
# Using readlines()
file1 = open('sub_SRR5892231_90.clstr', 'r')
file2 = open('map.sub_SRR5892231_90.clstr.csv', 'w')
Lines = file1.readlines()

# since I am using a while loop I need the following index
LineIndex = 0
while LineIndex < len(Lines):
    # check if >Cluster lines are there
    FF = re.search(r'^(>Cluster.*)', Lines[LineIndex])
    # if the lines re found, then prep it for first column
    if FF != None:
        #print(FF.group())
        firstcol = FF.group().replace('>','')
        file2.write(firstcol+',')
        print(firstcol, end =",")
        LineIndex = LineIndex + 1
        while LineIndex < len(Lines):
            KK = re.search(r'^(.*\*)', Lines[LineIndex])
            if KK != None:

                seccol = KK.group()
                seccol = re.sub(r'.*>','',seccol)
                seccol = re.sub(r'\.\.\. \*','',seccol)
                print(seccol)
                file2.write(seccol+'\n')
                break
            LineIndex = LineIndex + 1
    LineIndex = LineIndex + 1

file1.close()
file2.close()
ADD COMMENT
1
Entering edit mode
3.0 years ago
Mensur Dlakic ★ 28k

There is a set of tools by Joe that may help you here.

https://github.com/jrjhealey/bioinfo-tools

Look for ParseCDHIT.py. It doesn't do exactly what you want, but close enough. You should be able to modify the script with minimal Python knowledge.

ADD COMMENT
0
Entering edit mode

Thank you for these resources.

ADD REPLY

Login before adding your answer.

Traffic: 2024 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6