Hi. I am trying to use CD-hit to remove the duplicates from the file that is the combine of all nucleic acid fastq output from prodigal. I used the following parameters:
cd-hit-est -i nuc_sum.fa -o cd-hit_sum -c 0.95 -s 0.8 -M 0 -T 0 -n 8
The representative sequencing shows in the fasta file. But there are many small or fragmented sequence with the same header ID.
Does anyone know how to set the parameters in cd-hit-est to make sure there will be only one sequence for one header ID?
I tried cd-hit-est -i nuc_sum.fa -o cd-hit_sum -c 1 -t 1 -d 0
, which someone recommend. But it does not solve this problem.
Please don't post data/code examples as images. Post actual data and format it using
101
button in editor. It makes it difficult for people to grab the example (not in this case) to provide suggestions.Thanks very much for your suggestion, it is my first time to post questions. I will try to use editor another time.
Are you certain that is the case? Perhaps there is more stuff beyond
_1
than what we see in image?I am sure it is the case. As I used bowtie2 to align clean raw reads to the index made by this fasta file cd-hit_sum. And I got cd-hit_sum.sam file. But when I ran
samtools view -b -S samfile.sam -o bamfile.bam
, it turned out fail as reason of the duplicate entry in sam header. Then I figure out the reason should be that the cd-hit_sum.fa has many sequences with the same header, but the length of them are different.Check your
CD-HIT
outputFASTA
file, and check whether you have more than one representative sequence from each of your headers. So something likegrep -oP "^>k[0-9]+_[0-9]+_[0-9]+" cd-hit_sum | sort | uniq -c
.The information depicted in your screenshot here is the cluster output, and it just indicates what sequences belong to each cluster.
Hi. Thanks for suggestions. I tried
grep -oP 'k141_158430_1' cd-hit_sum | sort | uniq -c
, the output is 9 k141_158430_1.So you have nine sequences that share the identifier
k141_158430_1
(or at least that particular portion of it) that aren't/weren't similar enough to be clustered away under the settings you calledCD-HIT
with. As @Mensur Dlakic mentioned,CD-HIT
does not cluster on the basis of sequence identifiers but on the basis of the sequences themselves.Now I am running again of cd-hit-est adding the command of -G 0 -aS 0.9.
-aS, alignment coverage for the shorter sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence.
Hope it will help. But not sure is this the case to make sure the output will have no same header for each gene or cluster.
Could you perhaps elaborate on the provenance of your data? Why is it that you wish to have one representative sequence per identifier? If one representative sequence per identifier that you strictly want--and want some clustering on top of that--I suggest you group your sequences by whatever happens to be the common identifier string, and select the longest sequence as the representative (or something to that effect), and follow that up with a light clustering step.