Substring dereplication of protein sequences
1
1
Entering edit mode
4.0 years ago
smiller ▴ 70

I would like to dereplicate a 3 GB fasta file of amino acid sequences. I would like this to include the removal of shorter sequences found in longer sequences (substring dereplication). The purpose of this is the construction of a smaller database against which I search peptide mass spectra and the identification of abundant sequences in this file.

So far, I have explored prefix-dereplication (slightly different than what I ideally want) in vsearch and substring-dereplication in usearch, but neither is satisfactory. Prefix-dereplication by vsearch does not support protein sequences. Substring dereplication by usearch requires the use of v.5.2. The freely available version of usearch-5.2 has an insufficient memory limit of 2 GB.

Does anyone know of a tool that will suit my needs? Thanks in advance.

dereplication proteomics • 1.5k views
0
Entering edit mode

I am not sure if the substring de-replication is part of it but you can take a look at CD-HIT for this purpose.

0
Entering edit mode

cd-hit-dup fails with the message

cd-hit-dup: cdhit-dup.cxx:193: int HashingDepth(int, int): Assertion len >= min' failed.


This may be due to the fact that I have sequences as short as length 9. Fundamentally, this command is geared toward longer nucleotide sequences. It also does not do any form of substring dereplication.

0
Entering edit mode

0
Entering edit mode

User genomax's comment led me to the exact solution that I wanted. Instead of the CD-HIT tool, cd-hit-dup, use instead the tool, cd-hit. This clusters sequences, including subsequences. One can specify a sequence identity of 100%. The following command writes two files: a dereplicated fasta file and a file identifying the sequences in each cluster.

./cd-hit -i <input fasta> -o <output fasta> -c 1 -t 1 -d 0

0
Entering edit mode
4.0 years ago
smiller ▴ 70

./cd-hit -i <input fasta> -o <output fasta> -c 1 -t 1 -d 0
`