Question: Dereplication of FASTA file with VSEARCH
0
gravatar for lvogel
17 months ago by
lvogel20
Western Europe
lvogel20 wrote:

Regarding sequence dereplication with vsearch, I have seen the following statement:

"During dereplication, strictly identical sequences are grouped and receive the name of the first sequence of the group."

Now, I'm not exactly an expert on hash tables, so how do I know which exactly is the first sequence of the group--is it the one which occurs first in the input fasta file? If so, that would make things easy for me, because some of the sequences have important designations in their headers, which need to not get lost, so they will show up in BLAST results. Or is it more complicated? I ask because I am creating a custom database, composed of fasta files originating from different sources.

vsearch • 914 views
ADD COMMENTlink modified 6 days ago by Biostar ♦♦ 20 • written 17 months ago by lvogel20

If you want to retain the descriptions in the headers (whether the sequences are duplicate or not) you will have to keep them. Sounds like you need to merge the headers from multiple sequences (where the sequence is identical) so only one sequence copy (but multiple headers) are kept?

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax60k

Yes, that would be a solution. But, if it is not (easily) possible to achieve, then I could add that I can arrange the input FASTA so that all the headers I need to not lose are at the top.

ADD REPLYlink written 17 months ago by lvogel20

all the headers I need to not lose are at the top

Are those headers for unique sequences though because a deduplicating program is going to not pay attention to the headers? So you may still lose some headers.

Not sure how many duplicates there are but perhaps you could just do the search first and then handle the duplicates in post-processing/parsing?

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax60k

The sequences with these special designations in the headers could be duplicates of other sequences that don't have them, so no, they're not necessarily for unique sequences. So then, how do I implement your original suggestion? (And if it's not possible with VSEARCH, please feel free to suggest another dereplicating program with which it is.)

ADD REPLYlink written 17 months ago by lvogel20

perhaps you could just do the search first and then handle the duplicates in post-processing/parsing?

If you mean to deal with the duplicates after BLASTing, I'll consider it if I can't find an easier way. Some BLAST programs only keep the top hit.

ADD REPLYlink written 17 months ago by lvogel20
1

Some BLAST programs only keep the top hit.

You should be able to keep as many as you want.

ADD REPLYlink written 17 months ago by genomax60k
2
gravatar for lelle
17 months ago by
lelle780
Berlin
lelle780 wrote:

You can use the --uc filename option to get a detailed table which tells you which sequences are represented by which sequence. Of course to map that back with your blast results will be some work...

EDIT: removed the suggestion to use --relabel_keep, because it actually has nothing to do with the problem.

ADD COMMENTlink modified 16 months ago • written 17 months ago by lelle780

Thanks. But the --relabel_keep option appears to not be available in version 2.4.3. It's not letting me use it. ??

ADD REPLYlink written 17 months ago by lvogel20

So now, it allows "--relabel keep" (with space instead of underscore) but doesn't actually keep the labels besides the first one. I'm still trying to figure out how to get it to do what I want.

ADD REPLYlink written 16 months ago by lvogel20
1

The --relabel_keep is working fine for my in version 2.4.3. "--relabel keep" should rename all you sequences to keep1, keep2, keep3 and so on Anyway, I think I completely misinterpreted your original question. I don't know why I was thinking you were using hashes. Sorry. If you just run "vsearch --derep_fulllength" the sequences should not be renamed. But you will only have one name in the header (as you described) and not know which sequences are represented by this. If you want to know which sequences each sequence in your output is representing you will have to use the file you get from the --uc option (as far as I know).

ADD REPLYlink written 16 months ago by lelle780

Thanks for the reply. You're right on all counts, as far as I now know, too. Before, I was confused too, and using the wrong version. I'll keep putting all the sequences with headers I want to keep at the top of the fasta, and I'll be able to tell from the uc table if it's not using the names I expected. I'll accept your answer now. :)

ADD REPLYlink written 16 months ago by lvogel20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1117 users visited in the last hour