Question: How To Remove The Identical Sequences In The Fasta Files?
1
gravatar for izumi.kohsuke.10
2.8 years ago by
Japan
izumi.kohsuke.1010 wrote:

Some FASTA files  have sequences with different IDs that nonetheless have the same sequence. I want to remove duplicate sequences based on the nucleotide sequence, rather than ID.HOW to do?

Additionally, I want to counts Ns as a match. For example,

>seq1
AATGC
>seq2
ANTGC
>seq3
AGATC

collapse identical sequences into a single sequence.

>seq1_seq2
AATGC
>seq3 
AGATC

HOW to do?

sequence • 1.7k views
ADD COMMENTlink modified 2.8 years ago by Ido Tamir4.7k • written 2.8 years ago by izumi.kohsuke.1010

Thank you for answer.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by izumi.kohsuke.1010
2
gravatar for Ido Tamir
2.8 years ago by
Ido Tamir4.7k
Austria
Ido Tamir4.7k wrote:
fastx collapse: http://hannonlab.cshl.edu/fastx_toolkit/  for identical reads

http://www.drive5.com/usearch/  for clustering with some identity
ADD COMMENTlink written 2.8 years ago by Ido Tamir4.7k

Thank you for your answer. I will try it.

ADD REPLYlink written 2.8 years ago by izumi.kohsuke.1010
1
gravatar for Alex Reynolds
2.8 years ago by
Alex Reynolds19k
Seattle, WA USA
Alex Reynolds19k wrote:

You might work with the following as a start:

​#!/usr/bin/perl

use strict;
use warnings;

my $header = "";
my $seq = "";
my $fastaRef;

# read FASTA in from standard input
while(<>) {
    chomp;
    if (/>/) {
        if (length $seq > 0) {
            push(@{$fastaRef->{$seq}}, $header);
        }
        $header = substr $_, 1;
    }
    else {
        $seq .= $_;
    }
}
push (@{$fastaRef->{$seq}}, $header);

# print out merged-header sequences
foreach my $seqKey (keys %{$fastaRef}) {
    my @headers = @{$fastaRef->{$seqKey}};
    print STDOUT ">".join("_", @headers)."\n".$seqKey."\n";
}

One thought is to pre-process the FASTA input to "explode" each N-sequence to four (or power-of-four) ACTG-sequences. Each exploded sequence is given a modified header name. When merging, you can filter out any unmerged N-sequences based on the modified header name; the matched sequence will merge its header with another matched sequence header.

>seq1
AATGC
>seq2
ANTGC
>seq3
AGATC

This could become something like:

>seq1
AATGC
>seq2-modA
AATGC
>seq2-modC
ACTGC
>seq2-modG
AGTGC
>seq2-modT
ATTGC
>seq3
AGATC

When you merge this modified FASTA input with the above script, you get something like this:

>seq1_seq2-modA
AATGC
>seq2-modC
ACTGC
>seq2-modG
AGTGC
>seq2-modT
ATTGC
>seq3
AGATC

Your last step is to remove *-modN sequences that only have one match (to themselves) to get a final answer:

>seq1_seq2-modA
AATGC
>seq3
AGATC

This should be an easy modification of the script above, looking for any sequences with a one-element array and then a *-modN suffix.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Alex Reynolds19k

Thank you for your answer. That's a good idea. but some sequences have many N, so modified sequence increase more.

ADD REPLYlink written 2.8 years ago by izumi.kohsuke.1010

True, a sequence with n N will have 4^n modified sequences. Still, I think this approach should work reasonably well for small inputs.

ADD REPLYlink written 2.8 years ago by Alex Reynolds19k

I think so too. If there is an opportunity, I would like to use this approach.

ADD REPLYlink written 2.8 years ago by izumi.kohsuke.1010
0
gravatar for jgbradley1
2.8 years ago by
jgbradley170
United States
jgbradley170 wrote:

If you're just looking to do a very basic implementation for a short file, then I would go with the naive approach. It would be faster to implement, although slower. For each sequence, compare it to every other sequence in the file 1 by 1. If you consider the Ns a match, then just don't check for a match when you encounter a N as you iterate through the two sequences. This is basically a pairwise approach. If you're looking for a more efficient solution however, use suffix trees.

ADD COMMENTlink written 2.8 years ago by jgbradley170

Thank you for your answer.

ADD REPLYlink written 2.8 years ago by izumi.kohsuke.1010
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1175 users visited in the last hour