duplicate gene IDs with different length after cd-hit-est dereplicate and cluster
2
0
Entering edit mode
5 months ago
phxu6780 • 0

Hi. I am trying to use CD-hit to remove the duplicates from the file that is the combine of all nucleic acid fastq output from prodigal. I used the following parameters:

cd-hit-est -i nuc_sum.fa -o cd-hit_sum -c 0.95 -s 0.8 -M 0 -T 0 -n 8

The representative sequencing shows in the fasta file. But there are many small or fragmented sequence with the same header ID. Does anyone know how to set the parameters in cd-hit-est to make sure there will be only one sequence for one header ID? I tried cd-hit-est -i nuc_sum.fa -o cd-hit_sum -c 1 -t 1 -d 0, which someone recommend. But it does not solve this problem.

enter image description here

replicate duplicate cd-hit cluster • 1.0k views
ADD COMMENT
0
Entering edit mode

Please don't post data/code examples as images. Post actual data and format it using 101 button in editor. It makes it difficult for people to grab the example (not in this case) to provide suggestions.

ADD REPLY
0
Entering edit mode

Thanks very much for your suggestion, it is my first time to post questions. I will try to use editor another time.

ADD REPLY
0
Entering edit mode

But there are many small or fragmented sequence with the same header ID

Are you certain that is the case? Perhaps there is more stuff beyond _1 than what we see in image?

ADD REPLY
0
Entering edit mode

I am sure it is the case. As I used bowtie2 to align clean raw reads to the index made by this fasta file cd-hit_sum. And I got cd-hit_sum.sam file. But when I ran samtools view -b -S samfile.sam -o bamfile.bam, it turned out fail as reason of the duplicate entry in sam header. Then I figure out the reason should be that the cd-hit_sum.fa has many sequences with the same header, but the length of them are different.

ADD REPLY
0
Entering edit mode

Check your CD-HIT output FASTA file, and check whether you have more than one representative sequence from each of your headers. So something like grep -oP "^>k[0-9]+_[0-9]+_[0-9]+" cd-hit_sum | sort | uniq -c.

The information depicted in your screenshot here is the cluster output, and it just indicates what sequences belong to each cluster.

ADD REPLY
0
Entering edit mode

Hi. Thanks for suggestions. I tried grep -oP 'k141_158430_1' cd-hit_sum | sort | uniq -c , the output is 9 k141_158430_1.

ADD REPLY
1
Entering edit mode

So you have nine sequences that share the identifier k141_158430_1 (or at least that particular portion of it) that aren't/weren't similar enough to be clustered away under the settings you called CD-HIT with. As @Mensur Dlakic mentioned, CD-HIT does not cluster on the basis of sequence identifiers but on the basis of the sequences themselves.

ADD REPLY
0
Entering edit mode

Now I am running again of cd-hit-est adding the command of -G 0 -aS 0.9.

cd-hit-est -i nuc_sum.fa -o cd-hit_G -c 0.95 -G 0 -aS 0.9 -M 0 -T 64 -n 9 -r 1 -t 1

-aS, alignment coverage for the shorter sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence.

Hope it will help. But not sure is this the case to make sure the output will have no same header for each gene or cluster.

ADD REPLY
0
Entering edit mode

Could you perhaps elaborate on the provenance of your data? Why is it that you wish to have one representative sequence per identifier? If one representative sequence per identifier that you strictly want--and want some clustering on top of that--I suggest you group your sequences by whatever happens to be the common identifier string, and select the longest sequence as the representative (or something to that effect), and follow that up with a light clustering step.

ADD REPLY
3
Entering edit mode
5 months ago
Mensur Dlakic ★ 14k

CD-HIT removes sequence redundancy, not ID redundancy. So if there are repeated IDs after clustering, that is because you have multiple IDs that actually contain different sequences, at least given the clustering parameters.

First, I suggest you used the -d 0 switch, which will not truncate fasta header descriptions. That way you can be sure that the names are truly identical. If multiple names still persist, i don't think you will get CD-HIT to fix that because it doesn't look into IDs at all. If you start with all unique names before clustering, you will not have this problem.

ADD COMMENT
0
Entering edit mode

Thanks very much. I will get all unique names first then.

ADD REPLY
0
Entering edit mode

Do you have any suggestions to get unique names before clustering?

ADD REPLY
1
Entering edit mode

use seqkit rename -n. This would work following way:

Example input fasta:

$ cat file.fa         

>seq1
atgc
>seq1
gctc

output:

$ seqkit rename file.fa      

>seq1
atgc
>seq1_2 seq1
gctc
ADD REPLY
0
Entering edit mode

Great. If possible, I would like to keep all duplicate sequencing. I will try this suggestion first and see how the output will be in the downstream analysis.

ADD REPLY
1
Entering edit mode
5 months ago

There are two other ways to handle it:

  1. Use seqkit rmdup -n. This would remove sequences with identical names. However, I don't know how it works once it finds identical IDs (first encounter or retain one with longest sequence). Then feed the resultant fasta to CD-HIT.

  2. Use awk or seqkit to convert fasta to tab file, group by column 1 (ID) and then pick the longest one in 2nd column, from each group, then convert back to fasta. Then feed the fasta to CD-HIT.

ADD COMMENT
0
Entering edit mode

Sounds promising way to get unique IDs before cd-hit. Do you know the exact command of awk or seqkit to run this?

ADD REPLY
1
Entering edit mode

for seqkit:

$ seqkit rmdup  -ni <input.fa> -o <output.fa>

If you want to save duplicated sequences in a separate file:

$ seqkit rmdup -ni -o <new.fa> -d <duplicated.fa> -D <duplicated_details.txt>

You need to use new.fa for further analysis. Duplicated.fa will give you duplicated sequences. This is for your reference only. seqkit retains first encounter, not the longest one.

I will post the awk code after some time.

ADD REPLY
0
Entering edit mode

Awesome. Thanks very much. I would like to get the longest one. Looking forwards to your post then.

ADD REPLY
1
Entering edit mode

with datamash and awk (datamash is available in most of the debian,*buntu, brew, conda repos):

input data

$ cat test.fa                                                                                                                                                      
>seq1
atgc
catg
>seq1
gtac
cagtgacg
>seq1
atgcgctgacgacgagcg
>seq11
atgc
>seq11
gatc

output:

$ awk  '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}} END {print "\r"}' test.fa | awk -v OFS="\t" -v RS=">" 'NR>1 {print $1,$2, length($2)}' | datamash -sg1 max 3 -f | awk -v OFS="\n" '{print ">"$1,$2}'

>seq1
atgcgctgacgacgagcg
>seq11
gatc

with awk only (not tested):

$ awk  '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}} END {print "\r"}' test.fa| awk -v OFS="\n" -v RS=">" ' NR>1 {if (length(a[$1]) < length($2))a[$1]=$2;}END{for(i in a){print ">"i,a[i];}}'

If sequences in fasta files are in single line, you can remove '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}} END {print "\r"}' part of the code. This would linearize the fasta file.

ADD REPLY

Login before adding your answer.

Traffic: 1916 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6