cd hit for removing sequence redundancy
1
0
Entering edit mode
7.0 years ago

Hi all

I want to use cd hit to remove redundancy from file were collected from miRbase. It ia miRNA sequence. I need the command line for that.

Thanks

sequence • 2.5k views
ADD COMMENT
1
Entering edit mode

What have you tried?

ADD REPLY
1
Entering edit mode

If you only want to deduplicate the sequences then dedupe.sh from BBMap may be much simpler to use.

ADD REPLY
1
Entering edit mode
7.0 years ago
stolarek.ir ▴ 700
 cat all_mapped.fastq | paste - - - - | sed 's/^@/>/g'| cut -f1-2 | tr '\t' '\n' > file_out
time ./cd-hit -i file_out -o otput_cd_hit -M 8000 -T 3

then you go and examine each cluster with something like:

   for i in *.clstr; do \
    echo -n $(echo $i| cut -f 3 -d '_')" "; \
    cut -f 1 $i | sort | uniq -c | awk '{val += $1; count +=1; \
    if ($1 == 1) sing += 1 } END{ printf("cov: %.2f\tsingletons: \
    %d\tuniq: %d\ttotal: %d\n", val/count,sing,count,val)
    }'; \
    done
ADD COMMENT

Login before adding your answer.

Traffic: 994 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6