Modify the code to take most abundant reads from a cluster and process it.
0
0
Entering edit mode
7 months ago
Mo ▴ 40

I have a code that processes the cd-hit-est cluster file. The code looks like this:

#!/usr/bin/awk -f />Cluster/{
    getline
    a=$3
    b=$3
    gsub(/[.]/,"",a)
    gsub(/[>0-9_.]/,"",b)
    print a "\n" b }

One of the clusters in cluster file looks like this

>Cluster 9      
0       22nt, >35067_10_CCAATTCACTTGTCCCGCCCCC... *    
1       21nt, >2636_236_CCACCACTTGTCCCGCCCCCC... at +/85.71%
2       19nt, >55159_6_CCGCACTTGTCCCGCCCCC... at +/84.21%
3       21nt, >42880_8_CCACTCACTTGTCCTGCCCCC... at +/85.71%
4       21nt, >7315_60_CACACCACTTGTCCCGCCCCC... at +/85.71%
5       19nt, >134546_2_TAATTCTCTGTCCGCCCCG... at +/84.21%
6       18nt, >27435_13_CCTACTTGTCCCGCCCCC... at +/83.33%

The code takes the first line from each cluster (In this case, 0 22nt, >35067_10_CCAATTCACTTGTCCCGCCCCC... * ) and processes it.

I want to modify it to take the line from each cluster with the highest number of reads. The number of reads is indicated by the values just before the nucleotide sequence, separated by an underscore. This taken line >35067_10_CCAATTCACTTGTCCCGCCCCC... has 10 reads, but I want the code to take the second line with 236 reads. The position of the most abundant read sequence is random in each cluster.

This will be a great help. Thanks a lot.

cd-hit-est clustering • 493 views
ADD COMMENT
0
Entering edit mode

Probably there are people here for whom this is a trivial task, and maybe you will luck out and one of them will write a solution for you. To me this would not be a trivial task, and given that you seem to know enough about awk programming I think you should be putting more effort into it rather than asking for a complete solution.

I will give you an idea: if you split the header string using _ as a separator, the line with largest second column is what you want.

ADD REPLY
0
Entering edit mode

Hi, Thanks for the help. I am just a beginner and the code was written by someone else, I am trying to modify it for myself. I will try to do the same as you have suggested. Thank you so much.

ADD REPLY

Login before adding your answer.

Traffic: 1771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6