Question: How to remove the same sequences in the FASTA files?
 
10
 
 

Some FASTA files (e.g., ESTs) have sequences with different IDs that nonetheless have the same sequence. I want to remove duplicate sequences based on the nucleotide sequence, rather than ID.HOW to do?Thankyou to all!

 
 
 

What is your understanding of "the same" are your including approximate stringmatching?

log in to reply • written 12 weeks ago by Peri4N  9419
 

In fact you might not want to remove all the duplicated sequences but collapse them into a single sequence. I guess you meant to say it like this but your question is not clearly stating it.

log in to reply • written 12 weeks ago by Michael Dondrup ♦♦ 14601826

12 answers

 
24
 
 

you can do this with fastx-toolkit

usage is like:

fastx_collapser < some.fasta > some.unique.fasta
 
 
 

always love toolkits

log in to reply • written 19 months ago by Will  334316
 
 
6
 
 

There are quite a lot of utilities that will do this; usually they are found as part of larger software packages (often aimed at motif discovery). For example, RSA-tools contains the utility purge-sequence, MEME contains purge. So those are some terms for your web search.

A roll-your own solution is quite easy if you store the sequences as hash keys (which have to be unique). For example using Bioperl SeqIO you could try something like:

use strict;
use Bio::SeqIO;
my %unique;

my $file   = "myseqs.fa";
my $seqio  = Bio::SeqIO->new(-file => $file, -format => "fasta");
my $outseq = Bio::SeqIO->new(-file => ">$file.uniq", -format => "fasta");

while(my $seqs = $seqio->next_seq) {
  my $id  = $seqs->display_id;
  my $seq = $seqs->seq;
  unless(exists($unique{$seq})) {
    $outseq->write_seq($seqs);
    $unique{$seq} +=1;
  }
}

This will write out a new FASTA file, myseqs.fa.uniq, with only unique sequences (but no record of the other IDs with that sequence).

 
 
 
 
5
 
 

Depends on you application, but you may also try to filter out entries with the out same sequence but i.e. just 1bp shorter.

uclust --sort seqs.fasta --output seqs_sorted.fasta
uclust --input seqs_sorted.fasta --uc results.uc --id 1.00
uclust --uc2fasta results.uc --input seqs.fasta --output results.fasta --types S

This should give you results.fasta purged from identical but equal in size / shorter sequences.

 
 
 

good to know! uclust looks to be a useful tool.

log in to reply • written 18 months ago by brentp  12151135
 
 
4
 
 

Encode the sequences in hash using SHA-1 or MD5, and then check for collision. If you are familiar with Python, here is a useful recipe to start.

 
 
 
 
4
 
 

According to SeqAnswers: WU-blast, EBI-Exonerate, and bioperl all have stand-alone programs to make an "nrdb" = non-redundant database of sequences.

Note: WU-blast is now being distibuted commercially as AB-Blast - get a free personal license. Older archived versions can be found here.

 
 
 

Strictly the nrdb program generates a non-identical database. The source for nrdb can be found in http://blast.advbiocomp.com/pub/nrdb/, no license required. This was a newer version than that bundled in WU-BLAST, no idea if this is still the case for AB-BLAST. Originally nrdb was used to produce NCBI's 'nr' and 'nt' databases. Of course 'nt' is no longer non-identical and 'nr' is produced using a slightly different process today, but nrdb is still a popular way to generate non-identical databases for use with BLAST.

log in to reply • written 12 weeks ago by Hamish  6416
 
 
3
 
 

linearize the sequences: surround the fasta headers by '@' and '#' , remove the CR, replace '#' by CR and '@' by 't'

sort this tab delimited file on the second column (the sequence) , with case-insensible option, only the uniq columns

restore the fasta header and sequence

sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta  |\
tr -d '\n' | tr "#" "\n" | tr "@" "\t" |\
sort -u -t '  ' -f -k 2,2  |\
sed -e 's/^/>/' -e 's/\t/\n/'
 
 
 
 
2
 
 

Here's a python script that should be able to do it:

from itertools import groupby

if __name__ == '__main__':

    ishead = lambda x: x.startswith('>')
    all_seqs = set()
    with open(inname) as handle:
        with open(oname, 'w') as outhandle:
            head = None
            for h, lines in groupby(handle, ishead):
                if h:
                    head = lines.next()
                else:
                    seq = ''.join(lines)
                    if seq not in all_seqs:
                        all_seqs.add(seq)
                        outhandle.write('%s\n%s\n' % (head, seq))
 
 
 

Does this script only collapse sequences that are completely overlapping or will it collapse sequences where there is only partial overlap but the overlapping region is redundant?

log in to reply • written 9 weeks ago by Rduncan  514
 

For this one it only checks for complete identity ... the if seq not in all_seqs only does at membership test for the whole sequence. However, with a little work you could probably modify this to look at sequence regions.

log in to reply • written 7 weeks ago by Will  334316
 
 
2
 
 

Already nice answers:

I would recommend to try the CD-HIT version designed for EST / nucleotides (CD-HIT-EST and CD-HIT-EST-2D) for this purpose.

 
 
 
 
2
 
 

I have to say that the remove of duplicate sequences is not at al trivial, especialy when you consider real data with sequencing/alignment errors. Also problematic is that sequences are identical, but do not sully overlap. I can say with certainty that CD-HIT-EST will filter some duplicates, but far from all.

I am working on a pipeline that uses BLAST, filtering out those contigs that have significant similarity to more that just themselves.

 
 
 
 
2
 
 

uclust is about the fastest and one of the most flexible. If your sequences are not exactly the same length, or you also want to cluster some sequences that are almost, but not exactly, the same then look carefully at the flexibility and options of the thing you choose.

 
 
 
 
2
 
 

Works with big sets, aa/prot seq files (at the opposite to fastx-toolkit), very fast and simple to use: genometools:

gt sequniq -o out.fasta in.fasta
 
 
 
 
0
 
 

for smaller sets try this:

perl -ne 'BEGIN{$/=">";$"=";"}($d,$_)=/(.*?)\n(.+?)>?$/s;push @{$h{lc()}},$d if$_;END{for(keys%h){print">@{$h{$_}}$_"}}' multi.seq.fasta
 
 
 

and so readable too. ;) now which has "... all the visual appeal of oatmeal with fingernail clippings mixed in." ?? http://en.wikiquote.org/wiki/Larry_Wall

log in to reply • written 18 months ago by brentp  12151135
 
Log in to add a post