Remove Redundant Sequences Fasta
6
1
Entering edit mode
8.2 years ago
empyrean999 ▴ 180

Hi.. i have big fasta file with redundant sequences like

Seq_1 this_is_the_description_of_seq1 ATGACCAAGAGATAGATAACG

> Seq_1 
ATGACCAAGAGATAGATAACG

> Seq_2 this_is_the_description_of_seq2
ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA

> Seq_2 
ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA

The goal is to remove redundand sequence and keep the one which has description. the desired output for above example is

> Seq_1 this_is_the_description_of_seq1
ATGACCAAGAGATAGATAACG

> Seq_2 this_is_the_description_of_seq2
ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA

Thanks for your help !

unix awk perl sequence fasta • 7.9k views
ADD COMMENT
1
Entering edit mode

Are they always ordered like in your example and/or do they always have the same name (but a possibly different description)?

ADD REPLY
4
Entering edit mode
8.2 years ago
  • remove spaces after '>'
  • linearize fasta: print whole line,first token, the length of the description, the sequence
  • sort on 1st token length of description (inverse)
  • sort+unique on first token using a stable sort to keep the longest descriptions at the top
  • keep the columns 1 (header) and 4 (sequence)
  • tr back to fasta
     $ sed 's/^>[  \t]*/>/' < input.fa | \ 
     awk -F ' ' '/^>/ { printf("\n%s\t%s\t%d\t",$0,$1,length($0));next;} { printf("%s",$0);} END { printf("\n");}' | \ 
     sort -t '  ' -k2,2 -k3,3nr  | \
     sort -t '    ' -k2,2 --stable -u | \
     cut -d '  ' -f 1,4 | tr "\t" "\n"

     >Seq_1 this_is_the_description_of_seq1
     ATGACCAAGAGATAGATAACG
     >Seq_2 this_is_the_description_of_seq2
     ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA
ADD COMMENT
1
Entering edit mode

Hi Pierre! There is a pure Awk solution (split the fasta records with RS, store the sequences and the descriptions in arrays, store the description if it exists):

awk 'BEGIN {RS = "> "} NR > 1 {if ($3) {desc[$1] = $2 ; seqs[$1] = $3} else seqs[$1] = $2} END {for (seq in seqs) print ">"seq, desc[seq]"\n"seqs[seq]}' input.fa

ADD REPLY
0
Entering edit mode

It won't scale ;-)

ADD REPLY
0
Entering edit mode

Why do you think it won't scale? The memory footprint should be roughly equivalent to the size of the fasta file.

ADD REPLY
1
Entering edit mode

exactly: if your fasta is far too big, you cannot load it in memory :-)

ADD REPLY
0
Entering edit mode

Oh, you are right :-) If the file is larger than the available memory, the solution would be to read the file over and over to find paired records. Slow, but memory friendly.

ADD REPLY
0
Entering edit mode

Very helpful !! Thank you Pierre:)

ADD REPLY
0
Entering edit mode

Quick and easy version.. thank you pierre

ADD REPLY
1
Entering edit mode
8.2 years ago
PoGibas 5.0k

This is a quick answer, there might be faster and better ways of doing this.

 grep -v '>\|^$'  Original_file | sort -u > Sequences    # Extract only nucleotide sequences (non-redundant) 
 while read SEQUENCE; do
     grep -w -m 1 -B1 $SEQUENCE Original_file >> Wanted_output
 done < Sequences
ADD COMMENT
1
Entering edit mode

I think you're missing a '>' in your while loop. It should read "grep -w -m 1 -B1 $SEQUENCE Original_file >> Wanted_output", no?

ADD REPLY
0
Entering edit mode

Thanks! Fixed it.

ADD REPLY
1
Entering edit mode
8.2 years ago
Mitch Bekritsky ★ 1.3k

Another way to do it is to make use of the fact that keys are unique in perl hashes (or Python dictionaries, or C++ maps, etc.) to eliminate duplicate sequences. Here's an implementation in perl:

#!/usr/bin/perl

use strict;
use warnings;

use Getopt::Long;

my $fastaFile;

GetOptions(
    "fasta=s"    => \$fastaFile,
);

open FA, '<', $fastaFile or die "Could not open $fastaFile: $!";

my $fa_hash;

my $seq;
my $hdr;

ENTRY:while(my $line = <FA>)
{
    next ENTRY if $line =~ m/^$/; #skip empty lines
    #The first line that gets read in should be a header line
    #otherwise, we can write a check to make sure it is
    chomp $line;
    $hdr = $line if $line =~ m/^>/;
    $line = <FA>;
    chomp $line;
    $seq = $line;

    #if the sequence is already in the hash, see if the current header is longer (which I'm using as a proxy for having a description)
    if(defined $fa_hash->{$seq})
    {
        $fa_hash->{$seq} = $hdr if length($hdr) > length($fa_hash->{$seq});
    }
    else
    {
        $fa_hash->{$seq} = $hdr;
    }
}

foreach my $key (keys %{$fa_hash})
{
    print $fa_hash->{$key},"\n";
    print $key,"\n\n";
}

The output would be

> Seq_2 this_is_the_description_of_seq2
ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA

> Seq_1 
ATGACCAAGAGATAGATAACG

Assuming Seq_1 had no description. Although I have to say, Pierre's bash solution is really nice too...

ADD COMMENT
0
Entering edit mode
8.2 years ago
Kenosis ★ 1.3k

Another option is to use Bio::SeqIO and a hash which pairs the id with the larger description--used as a filter. This method makes only two passes through the fasta file (no sorting):

use strict;
use warnings;
use Bio::SeqIO;

my ( $file, %hash, %seen ) = shift;

for my $i ( 0 .. 1 ) {
    my $in = Bio::SeqIO->new( -file   => $file, -format => 'Fasta' );

    while ( my $seq = $in->next_seq() ) {
        if ( !$i ) {
            $hash{ $seq->id } = $seq->desc if !defined $hash{ $seq->id } or length $seq->desc > length $hash{ $seq->id };
        }
        else {
            print '>'. $seq->id . ' ' . $hash{ $seq->id } . "\n" . $seq->seq . "\n" if !$seen{ $seq->id }++;
        }
    }
}

Usage: perl script.pl inFile [>outFile]

The last, optional parameter directs output to a file.

Output on your dataset:

>Seq_1 this_is_the_description_of_seq1 
ATGACCAAGAGATAGATAACG
>Seq_2 this_is_the_description_of_seq2
ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA

If you don't have access to the Bio::SeqIO module, here's a solution that produces the same output:

use strict;
use warnings;

my ( $file, %hash, %seen ) = shift;
local $/ = '>';

for my $i ( 0 .. 1 ) {
    push @ARGV, $file;
    while (<>) {
        chomp;
        next unless my ( $id, $desc, $seq ) = $_ =~ /(\S+)\s+([^\n]+)\s+(.+)/s;

        if ( !$i ) {
            $hash{$id} = $desc if !defined $hash{$id} or length $desc > length $hash{$id};
        }
        else {
            print ">$id $hash{$id}\n$seq" if !$seen{$id}++;
        }
    }
}

Although you show sequences only on one line, both of the scripts above handle multi-line sequences--just in case your actual dataset contains such.

Hope this helps!

ADD COMMENT
0
Entering edit mode

Thank you so much. this worked well.

ADD REPLY
0
Entering edit mode
6.6 years ago
pengchy ▴ 450

cat seq.fa |perl -ne 'chomp;s/>\s+/>/;if(/>(\S+)/){$id{$1}++;$id2=$1;}if($id{$id2}==1){print "$_\n"}' > seq_nr.fa

ADD COMMENT
0
Entering edit mode
4.9 years ago
Eslam Samir ▴ 110

Here is my free program on Github Sequence database curator (https://github.com/Eslam-Samir-Ragab/Sequence-database-curator)

It is a very fast program and it can deal with:

  1. Nucleotide sequences
  2. Protein sequences

It can work under Operating systems:

  1. Windows
  2. Mac
  3. Linux

It also works for:

  1. Fasta format
  2. Fastq format

Best Regards

ADD COMMENT

Login before adding your answer.

Traffic: 2212 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6