Question: Remove Redundant Sequences Fasta
1
gravatar for empyrean999
5.8 years ago by
empyrean999160
Canada
empyrean999160 wrote:

Hi.. i have big fasta file with redundant sequences like

Seq_1 this_is_the_description_of_seq1 ATGACCAAGAGATAGATAACG

> Seq_1 
ATGACCAAGAGATAGATAACG

> Seq_2 this_is_the_description_of_seq2
ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA

> Seq_2 
ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA

The goal is to remove redundand sequence and keep the one which has description. the desired output for above example is

> Seq_1 this_is_the_description_of_seq1
ATGACCAAGAGATAGATAACG

> Seq_2 this_is_the_description_of_seq2
ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA

Thanks for your help !

perl fasta sequence unix awk • 6.8k views
ADD COMMENTlink modified 2.5 years ago by Eslam Samir100 • written 5.8 years ago by empyrean999160
1

Are they always ordered like in your example and/or do they always have the same name (but a possibly different description)?

ADD REPLYlink written 5.8 years ago by Devon Ryan91k
4
gravatar for Pierre Lindenbaum
5.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:
  • remove spaces after '>'
  • linearize fasta: print whole line,first token, the length of the description, the sequence
  • sort on 1st token length of description (inverse)
  • sort+unique on first token using a stable sort to keep the longest descriptions at the top
  • keep the columns 1 (header) and 4 (sequence)
  • tr back to fasta
     $ sed 's/^>[  \t]*/>/' < input.fa | \ 
     awk -F ' ' '/^>/ { printf("\n%s\t%s\t%d\t",$0,$1,length($0));next;} { printf("%s",$0);} END { printf("\n");}' | \ 
     sort -t '  ' -k2,2 -k3,3nr  | \
     sort -t '    ' -k2,2 --stable -u | \
     cut -d '  ' -f 1,4 | tr "\t" "\n"

     >Seq_1 this_is_the_description_of_seq1
     ATGACCAAGAGATAGATAACG
     >Seq_2 this_is_the_description_of_seq2
     ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA
ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by Pierre Lindenbaum122k
1

Hi Pierre! There is a pure Awk solution (split the fasta records with RS, store the sequences and the descriptions in arrays, store the description if it exists):

awk 'BEGIN {RS = "> "} NR > 1 {if ($3) {desc[$1] = $2 ; seqs[$1] = $3} else seqs[$1] = $2} END {for (seq in seqs) print ">"seq, desc[seq]"\n"seqs[seq]}' input.fa

ADD REPLYlink written 5.8 years ago by Frédéric Mahé2.9k

It won't scale ;-)

ADD REPLYlink written 5.8 years ago by Pierre Lindenbaum122k

Why do you think it won't scale? The memory footprint should be roughly equivalent to the size of the fasta file.

ADD REPLYlink written 5.8 years ago by Frédéric Mahé2.9k
1

exactly: if your fasta is far too big, you cannot load it in memory :-)

ADD REPLYlink written 5.8 years ago by Pierre Lindenbaum122k

Oh, you are right :-) If the file is larger than the available memory, the solution would be to read the file over and over to find paired records. Slow, but memory friendly.

ADD REPLYlink written 5.8 years ago by Frédéric Mahé2.9k

Very helpful !! Thank you Pierre:)

ADD REPLYlink modified 5.8 years ago • written 5.8 years ago by Prakki Rama2.3k

Quick and easy version.. thank you pierre

ADD REPLYlink written 5.8 years ago by empyrean999160
1
gravatar for PoGibas
5.8 years ago by
PoGibas4.8k
Vilnius
PoGibas4.8k wrote:

This is a quick answer, there might be faster and better ways of doing this.

 grep -v '>\|^$'  Original_file | sort -u > Sequences    # Extract only nucleotide sequences (non-redundant) 
 while read SEQUENCE; do
     grep -w -m 1 -B1 $SEQUENCE Original_file >> Wanted_output
 done < Sequences
ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by PoGibas4.8k
1

I think you're missing a '>' in your while loop. It should read "grep -w -m 1 -B1 $SEQUENCE Original_file >> Wanted_output", no?

ADD REPLYlink written 5.8 years ago by Mitch Bekritsky1.1k

Thanks! Fixed it.

ADD REPLYlink written 5.8 years ago by PoGibas4.8k
1
gravatar for Mitch Bekritsky
5.8 years ago by
Mitch Bekritsky1.1k
London, England
Mitch Bekritsky1.1k wrote:

Another way to do it is to make use of the fact that keys are unique in perl hashes (or Python dictionaries, or C++ maps, etc.) to eliminate duplicate sequences. Here's an implementation in perl:

#!/usr/bin/perl

use strict;
use warnings;

use Getopt::Long;

my $fastaFile;

GetOptions(
    "fasta=s"    => \$fastaFile,
);

open FA, '<', $fastaFile or die "Could not open $fastaFile: $!";

my $fa_hash;

my $seq;
my $hdr;

ENTRY:while(my $line = <FA>)
{
    next ENTRY if $line =~ m/^$/; #skip empty lines
    #The first line that gets read in should be a header line
    #otherwise, we can write a check to make sure it is
    chomp $line;
    $hdr = $line if $line =~ m/^>/;
    $line = <FA>;
    chomp $line;
    $seq = $line;

    #if the sequence is already in the hash, see if the current header is longer (which I'm using as a proxy for having a description)
    if(defined $fa_hash->{$seq})
    {
        $fa_hash->{$seq} = $hdr if length($hdr) > length($fa_hash->{$seq});
    }
    else
    {
        $fa_hash->{$seq} = $hdr;
    }
}

foreach my $key (keys %{$fa_hash})
{
    print $fa_hash->{$key},"\n";
    print $key,"\n\n";
}

The output would be

> Seq_2 this_is_the_description_of_seq2
ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA

> Seq_1 
ATGACCAAGAGATAGATAACG

Assuming Seq_1 had no description. Although I have to say, Pierre's bash solution is really nice too...

ADD COMMENTlink written 5.8 years ago by Mitch Bekritsky1.1k
0
gravatar for Kenosis
5.8 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

Another option is to use Bio::SeqIO and a hash which pairs the id with the larger description--used as a filter. This method makes only two passes through the fasta file (no sorting):

use strict;
use warnings;
use Bio::SeqIO;

my ( $file, %hash, %seen ) = shift;

for my $i ( 0 .. 1 ) {
    my $in = Bio::SeqIO->new( -file   => $file, -format => 'Fasta' );

    while ( my $seq = $in->next_seq() ) {
        if ( !$i ) {
            $hash{ $seq->id } = $seq->desc if !defined $hash{ $seq->id } or length $seq->desc > length $hash{ $seq->id };
        }
        else {
            print '>'. $seq->id . ' ' . $hash{ $seq->id } . "\n" . $seq->seq . "\n" if !$seen{ $seq->id }++;
        }
    }
}

Usage: perl script.pl inFile [>outFile]

The last, optional parameter directs output to a file.

Output on your dataset:

>Seq_1 this_is_the_description_of_seq1 
ATGACCAAGAGATAGATAACG
>Seq_2 this_is_the_description_of_seq2
ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA

If you don't have access to the Bio::SeqIO module, here's a solution that produces the same output:

use strict;
use warnings;

my ( $file, %hash, %seen ) = shift;
local $/ = '>';

for my $i ( 0 .. 1 ) {
    push @ARGV, $file;
    while (<>) {
        chomp;
        next unless my ( $id, $desc, $seq ) = $_ =~ /(\S+)\s+([^\n]+)\s+(.+)/s;

        if ( !$i ) {
            $hash{$id} = $desc if !defined $hash{$id} or length $desc > length $hash{$id};
        }
        else {
            print ">$id $hash{$id}\n$seq" if !$seen{$id}++;
        }
    }
}

Although you show sequences only on one line, both of the scripts above handle multi-line sequences--just in case your actual dataset contains such.

Hope this helps!

ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by Kenosis1.2k

Thank you so much. this worked well.

ADD REPLYlink written 5.8 years ago by empyrean999160
0
gravatar for pengchy
4.2 years ago by
pengchy410
China/Beijing
pengchy410 wrote:

cat seq.fa |perl -ne 'chomp;s/>\s+/>/;if(/>(\S+)/){$id{$1}++;$id2=$1;}if($id{$id2}==1){print "$_\n"}' > seq_nr.fa

ADD COMMENTlink written 4.2 years ago by pengchy410
0
gravatar for Eslam Samir
2.5 years ago by
Eslam Samir100
Germany / Würzburg / Universität Würzburg (IMIB)
Eslam Samir100 wrote:

Here is my free program on Github Sequence database curator (https://github.com/Eslam-Samir-Ragab/Sequence-database-curator)

It is a very fast program and it can deal with:

  1. Nucleotide sequences
  2. Protein sequences

It can work under Operating systems:

  1. Windows
  2. Mac
  3. Linux

It also works for:

  1. Fasta format
  2. Fastq format

Best Regards

ADD COMMENTlink written 2.5 years ago by Eslam Samir100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 859 users visited in the last hour