Question: How To Remove The Same Sequences In The Fasta Files?
26
gravatar for Zhangleisdau
7.1 years ago by
Zhangleisdau260
Zhangleisdau260 wrote:

Some FASTA files (e.g., ESTs) have sequences with different IDs that nonetheless have the same sequence. I want to remove duplicate sequences based on the nucleotide sequence, rather than ID.HOW to do?Thankyou to all!

fasta duplicates sequence • 40k views
ADD COMMENTlink modified 9 months ago by Eslam Samir90 • written 7.1 years ago by Zhangleisdau260

What is your understanding of "the same" are your including approximate stringmatching?

ADD REPLYlink written 5.8 years ago by Fabian Bull1.2k

In fact you might not want to remove all the duplicated sequences but collapse them into a single sequence. I guess you meant to say it like this but your question is not clearly stating it.

ADD REPLYlink written 5.8 years ago by Michael Dondrup43k
38
gravatar for brentp
7.1 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

you can do this with fastx-toolkit

usage is like:

fastx_collapser < some.fasta > some.unique.fasta
ADD COMMENTlink written 7.1 years ago by brentp22k
1

always love toolkits

ADD REPLYlink written 7.1 years ago by Will4.4k
1

bummer, apparently only works for nucleotide sequences. Where's the love for protein fasta files?

ADD REPLYlink written 5.3 years ago by Andrew Su4.8k

In my example (nucleotides), it removed well the duplicated but renamed the sequences...

ADD REPLYlink written 4 months ago by Ludo Cottret0

Try Dedupe; it won't change the headers.

ADD REPLYlink written 4 months ago by Brian Bushnell14k
15
gravatar for Pierre Lindenbaum
7.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum101k wrote:

linearize the sequences: surround the fasta headers by '@' and '#' , remove the CR, replace '#' by CR and '@' by '\t'

sort this tab delimited file on the second column (the sequence) , with case-insensible option, only the uniq columns

restore the fasta header and sequence

sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta  |\
tr -d '\n' | tr "#" "\n" | tr "@" "\t" |\
sort -u -t '  ' -f -k 2,2  |\
sed -e 's/^/>/' -e 's/\t/\n/'
ADD COMMENTlink written 7.1 years ago by Pierre Lindenbaum101k

I think this is the only answer that actually works if your fasta file contains a multiple sequence alignment.

Both fastx_collapser and gt sequniq fail because they consider '–' to be an invalid character in a fasta sequence.  I didn't ask those tools to validate the sequences, and that's not their job, but they did it anyway making them useless for their actual purpose.  Have these authors never heard of the Robustness Principle?

ADD REPLYlink written 2.2 years ago by Chris Warth90
12
gravatar for Neilfws
7.1 years ago by
Neilfws47k
Sydney, Australia
Neilfws47k wrote:

There are quite a lot of utilities that will do this; usually they are found as part of larger software packages (often aimed at motif discovery). For example, RSA-tools contains the utility purge-sequence, MEME contains purge. So those are some terms for your web search.

A roll-your own solution is quite easy if you store the sequences as hash keys (which have to be unique). For example using Bioperl SeqIO you could try something like:

use strict;
use Bio::SeqIO;
my %unique;

my $file   = "myseqs.fa";
my $seqio  = Bio::SeqIO->new(-file => $file, -format => "fasta");
my $outseq = Bio::SeqIO->new(-file => ">$file.uniq", -format => "fasta");

while(my $seqs = $seqio->next_seq) {
  my $id  = $seqs->display_id;
  my $seq = $seqs->seq;
  unless(exists($unique{$seq})) {
    $outseq->write_seq($seqs);
    $unique{$seq} +=1;
  }
}

This will write out a new FASTA file, myseqs.fa.uniq, with only unique sequences (but no record of the other IDs with that sequence).

ADD COMMENTlink written 7.1 years ago by Neilfws47k
10
gravatar for Darked89
7.1 years ago by
Darked894.1k
Barcelona, Spain
Darked894.1k wrote:

Depends on you application, but you may also try to filter out entries with the out same sequence but i.e. just 1bp shorter.

uclust --sort seqs.fasta --output seqs_sorted.fasta
uclust --input seqs_sorted.fasta --uc results.uc --id 1.00
uclust --uc2fasta results.uc --input seqs.fasta --output results.fasta --types S

This should give you results.fasta purged from identical but equal in size / shorter sequences.

ADD COMMENTlink written 7.1 years ago by Darked894.1k

good to know! uclust looks to be a useful tool.

ADD REPLYlink written 7.0 years ago by brentp22k
9
gravatar for Haibao Tang
7.1 years ago by
Haibao Tang2.9k
Richmond, CA
Haibao Tang2.9k wrote:

Encode the sequences in hash using SHA-1 or MD5, and then check for collision. If you are familiar with Python, here is a useful recipe to start.

ADD COMMENTlink written 7.1 years ago by Haibao Tang2.9k

Sorry i was thinking to use the same approaches to refine big sequence file contains ~ 14.000 sequence of influenza virus isn't that will consume the memory "ram" is there other way?

ADD REPLYlink written 4.4 years ago by Medhat7.1k
7
gravatar for Will
7.1 years ago by
Will4.4k
United States
Will4.4k wrote:

Here's a python script that should be able to do it:

from itertools import groupby

if __name__ == '__main__':

    ishead = lambda x: x.startswith('>')
    all_seqs = set()
    with open(inname) as handle:
        with open(oname, 'w') as outhandle:
            head = None
            for h, lines in groupby(handle, ishead):
                if h:
                    head = lines.next()
                else:
                    seq = ''.join(lines)
                    if seq not in all_seqs:
                        all_seqs.add(seq)
                        outhandle.write('%s\n%s\n' % (head, seq))
ADD COMMENTlink written 7.1 years ago by Will4.4k

Does this script only collapse sequences that are completely overlapping or will it collapse sequences where there is only partial overlap but the overlapping region is redundant?

ADD REPLYlink written 5.7 years ago by Rduncan60

For this one it only checks for complete identity ... the if seq not in all_seqs only does at membership test for the whole sequence. However, with a little work you could probably modify this to look at sequence regions.

ADD REPLYlink written 5.7 years ago by Will4.4k

Wow, very clever. I don't fully understand a lot of the tool packages in python, but this is very intuitive. I modified the script slightly to strip the extra carriage returns in the outhandle.write command. This removes whitespaces that are aesthetically appealing but causes some programs to choke.

 
 
ADD REPLYlink written 3.1 years ago by joelrosenbaum0
7
gravatar for Manu Prestat
5.8 years ago by
Manu Prestat3.8k
Marseille, France
Manu Prestat3.8k wrote:

Works with big sets, aa/prot seq files (at the opposite to fastx-toolkit), very fast and simple to use: genometools:

gt sequniq -o out.fasta in.fasta
ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by Manu Prestat3.8k
6
gravatar for Dave Lunt
7.0 years ago by
Dave Lunt2.0k
Hull, UK
Dave Lunt2.0k wrote:

uclust is about the fastest and one of the most flexible. If your sequences are not exactly the same length, or you also want to cluster some sequences that are almost, but not exactly, the same then look carefully at the flexibility and options of the thing you choose.

ADD COMMENTlink written 7.0 years ago by Dave Lunt2.0k
5
gravatar for Hanif Khalak
7.1 years ago by
Hanif Khalak1.2k
Doha, QA
Hanif Khalak1.2k wrote:

According to SeqAnswers: WU-blast, EBI-Exonerate, and bioperl all have stand-alone programs to make an "nrdb" = non-redundant database of sequences.

Note: WU-blast is now being distibuted commercially as AB-Blast - get a free personal license. Older archived versions can be found here.

ADD COMMENTlink written 7.1 years ago by Hanif Khalak1.2k

Strictly the nrdb program generates a non-identical database. The source for nrdb can be found in http://blast.advbiocomp.com/pub/nrdb/, no license required. This was a newer version than that bundled in WU-BLAST, no idea if this is still the case for AB-BLAST. Originally nrdb was used to produce NCBI's 'nr' and 'nt' databases. Of course 'nt' is no longer non-identical and 'nr' is produced using a slightly different process today, but nrdb is still a popular way to generate non-identical databases for use with BLAST.

ADD REPLYlink written 5.8 years ago by Hamish3.0k
5
gravatar for Khader Shameer
7.0 years ago by
Manhattan, NY
Khader Shameer17k wrote:

Already nice answers:

I would recommend to try the CD-HIT version designed for EST / nucleotides (CD-HIT-EST and CD-HIT-EST-2D) for this purpose.

ADD COMMENTlink written 7.0 years ago by Khader Shameer17k
5
gravatar for Rm
7.0 years ago by
Rm7.6k
Danville, PA
Rm7.6k wrote:

for smaller sets try this:

perl -ne 'BEGIN{$/=">";$"=";"}($d,$_)=/(.*?)\n(.+?)>?$/s;push @{$h{lc()}},$d if$_;END{for(keys%h){print">@{$h{$_}}$_"}}' multi.seq.fasta
ADD COMMENTlink written 7.0 years ago by Rm7.6k
1

Wonderful solution!!! Thank you for sharing it.

ADD REPLYlink written 5.2 years ago by deepthithomaskannan190

and so readable too. ;) now which has "... all the visual appeal of oatmeal with fingernail clippings mixed in." ?? http://en.wikiquote.org/wiki/Larry_Wall

ADD REPLYlink written 7.0 years ago by brentp22k
3
gravatar for Brian Bushnell
2.7 years ago by
Walnut Creek, USA
Brian Bushnell14k wrote:

I wrote a program for this problem, called Dedupe.  It's very fast, and also (optionally) handles contained sequences.  Usage:

dedupe.sh in=file.fasta out=nodupes.fasta

If you don't want fully-contained substrings removed, then add the flag ac=f (short for absorbcontainments=false).

 

ADD COMMENTlink written 2.7 years ago by Brian Bushnell14k

Really helpful, thank you Brian Bushnell it solved my desire to get unique sequences in my set

ADD REPLYlink written 18 months ago by nhaituan0
2
gravatar for User 0726
7.0 years ago by
User 072620
User 072620 wrote:

I have to say that the remove of duplicate sequences is not at al trivial, especialy when you consider real data with sequencing/alignment errors. Also problematic is that sequences are identical, but do not sully overlap. I can say with certainty that CD-HIT-EST will filter some duplicates, but far from all.

I am working on a pipeline that uses BLAST, filtering out those contigs that have significant similarity to more that just themselves.

ADD COMMENTlink written 7.0 years ago by User 072620

Why doesn't CD-HIT-EST filter all duplicates? Shouldn't it be possible to set this in the parameters?

ADD REPLYlink written 5.0 years ago by Yannick Wurm2.2k
2
gravatar for pyjiang2
2.7 years ago by
pyjiang220
United States
pyjiang220 wrote:

I tried fastx_collapser first, but it gives error for multiple aligned fasta sequences.

 

I found this useful website, which gives unique fasta sequences and concatenate the header name for the same sequences as well:

http://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html_ncbi/html/fasta/uniqueseq.cgi

 

 

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by pyjiang220

Great find! Only tool that I could find that works with amino acids.

ADD REPLYlink written 8 months ago by tantrev10
2
gravatar for BioApps
21 months ago by
BioApps680
Spain
BioApps680 wrote:

On Windows you can use my Kitten Sequence Dereplicator (which by the way, was updated recently).

How to dereplicate sequences in Fasta file? Dereplication of a FASTA file via clustering

sequence dereplication

The program is based on CD-Hit which is pretty accurate and fast.

ADD COMMENTlink modified 21 months ago • written 21 months ago by BioApps680
1
gravatar for Shaun Jackman
3.6 years ago by
Shaun Jackman340
Vancouver, Canada
Shaun Jackman340 wrote:

seqmagick convert --deduplicate-sequences will remove duplicate sequence. See here:

http://fhcrc.github.io/seqmagick/convert_mogrify.html#examples

ADD COMMENTlink written 3.6 years ago by Shaun Jackman340

Hello Shaun:

I tried the seqmagic as you suggested, but it did not give what is expected.

>seq
ATCGATCGATATATATATAT
>seq2 part of seq1
CGATCGATATATATATA
>seq3 part of seq2
ATCGATATATAT
>seq4 reverse complementary of seq 2
TATATATATATCGATCG
>seq5 new seq
ATCGATCGACGATCGAGCGCG
>seq6 another new
ATCGATCGCGCGCGCGCGCGCGC
>seq7 psubstring of seq6
CGCGCGCGCGCGCGCG

Here is the command I used:

$ seqmagick convert --deduplicate-sequences test.fasta test_seqmagick.fasta

 $ cat test_seqmagick.fasta
>seq1
ATCGATCGATATATATATAT
>seq2 part of seq1
CGATCGATATATATATA
>seq3 part of seq2
ATCGATATATAT
>seq4 reverse complementary of seq 2
TATATATATATCGATCG
>seq5 new seq
ATCGATCGACGATCGAGCGCG
>seq6 another new
ATCGATCGCGCGCGCGCGCGCGC
>seq7 psubstring of seq6
CGCGCGCGCGCGCGCG

Did I miss anything?

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by yifangt860

yifangt, one of us is certainly confused.   None of your input sequences are duplicated.  Feed seqmagick a file with some duplicated sequences and it will work.

ADD REPLYlink written 2.1 years ago by Chris Warth90

Looks like he also wanted to remove contained sequences and reverse complements.  That can be done with Dedupe (see the dedupe post on this page).

ADD REPLYlink written 2.1 years ago by Brian Bushnell14k

Thanks Brian!

Yes, what I meant is to dedupe sequences with the original and reverse complemented sequences, and the containment of both strands.  Not sure how necassary this job is for any application like assembly, but that's another question.

ADD REPLYlink written 24 months ago by yifangt860

It's not usually necessary unless you want to combine multiple assemblies.  Sometimes it is also useful in RNA-seq transcriptome assembly.

ADD REPLYlink written 24 months ago by Brian Bushnell14k
0
gravatar for shobhitagrawal1
5.3 years ago by
shobhitagrawal10 wrote:

hi guys these are great replies. while you are at it can someone tell me how i can modify these scripts to counts Ns as a match. eg.

>seq1
ACGT
>seq2
ANGT
>seq3
AGGT

to collapse it to a file which contains

>seq1
ACGT
>seq3  
AGGT
ADD COMMENTlink modified 5.3 years ago by Neilfws47k • written 5.3 years ago by shobhitagrawal10
1

Please post this as a new question, not as an answer.

ADD REPLYlink written 5.3 years ago by Neilfws47k
0
gravatar for Eslam Samir
9 months ago by
Eslam Samir90
Egypt / Cairo / Microbiology & Immunology Department
Eslam Samir90 wrote:

Here is my free program on Github Sequence database curator (https://github.com/Eslam-Samir-Ragab/Sequence-database-curator)

It is a very fast program and it can deal with:

  1. Nucleotide sequences
  2. Protein sequences

It can work under Operating systems:

  1. Windows
  2. Mac
  3. Linux

It also works for:

  1. Fasta format
  2. Fastq format

Best Regards

ADD COMMENTlink modified 9 months ago • written 9 months ago by Eslam Samir90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 707 users visited in the last hour