Question

Find the same repeat sequence in two fasta files

0

Entering edit mode

6.3 years ago

Damian_Civ • 0

Hello, I'm a little new at this, and I would like to know if there is a way in Perl to get the repeat sequences given two fasta files for example

I have two fasta files

File 1:

>seq1
ACTGACTGACTG
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
AAAAA

File 2:

>seq1
AAAAAA
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
ACTGACTGACTG

and I would like to have an outcome like this

FILE 3:

>seq1 ACTGACTGACTG from file 1 and
>seq4 ACTGACTGACTG from file 2 were found

Thanks...!!

Perl • 2.7k views

ADD COMMENT • link updated 6.3 years ago by Pierre Lindenbaum 161k • written 6.3 years ago by Damian_Civ • 0

0

Entering edit mode

Hello angelo.armijos.iv,

We prefer that you first try to solve this on your own and if it doesn't work show us what you tried. We are more eager to point out your mistakes and put you back on track, rather than giving a full solution which will learn you nothing.

In addition, I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 6.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Hello Angelo,

I do not know how to do it in Perl. Nevertheless, we are currently developing SEDA (http://www.sing-group.org/seda/), a Java application to easily process FASTA files.

Among its functions, the "Remove redundant sequences" option (see section 3.3 of the manual: http://www.sing-group.org/seda/downloads/manuals/seda-user-manual-1.0.0.pdf) has been precisely created for doing what you are looking for. With the "Save merged headers into a file" you can obtain a similar report to the one that you are looking for.

With best regards,

Hugo.

ADD REPLY • link 6.3 years ago by Hugo ▴ 380

1

Entering edit mode

6.3 years ago

5heikki 11k

Do we have to do it in perl?

blastn -task blastn-short -query file1 -subject file2 -outfmt '6 qseqid sseqid qlen slen length nident' | awk 'BEGIN{OFS=FS="\t"}{if($3==$4 && $3==$5 && $3==$6){print $1,$2}}'

ADD COMMENT • link 6.3 years ago by 5heikki 11k

1

Entering edit mode

6.3 years ago

Pierre Lindenbaum 161k

as far as I understand, you want a perfect match between the two set of file. Linearize, sort on sequence and use join.

join -t $'\t' -1 3 -2 3 \
      <(awk '/^>/ {printf("%sF1\t%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' file1.fa | sort -t $'\t' -k3,3 ) \
      <(awk '/^>/ {printf("%sF2\t%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' file2.fa | sort -t $'\t' -k3,3 )

AAAAAA  F1  >seq2   F2  >seq1
AAAAAA  F1  >seq2   F2  >seq2
AAAAAA  F1  >seq2   F2  >seq3
AAAAAA  F1  >seq3   F2  >seq1
AAAAAA  F1  >seq3   F2  >seq2
AAAAAA  F1  >seq3   F2  >seq3
ACTGACTGACTG    F1  >seq1   F2  >seq4

ADD COMMENT • link 6.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks a lot, I was trying some utilitys like this but I allways got an error like this "sort: Cannot allocate memory" I think it doesn't work in a very large database.

ADD REPLY • link 6.3 years ago by Damian_Civ • 0

score 3 · Accepted Answer · 2018-01-22

In Perl, read the sequences and headers of one file into a hash table. Use the sequence as the key and the header as the value in the key-value pair stored in the hash table.

Once you have the "first-file" hash table, read through the second file in order. At each sequence, test if the first-file hash table has an entry with a matching key (sequence) with some header.

Given files A.fa and B.fa:

$ cat A.fa
>seq1
ACTGACTGACTG
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
AAAAA

$ cat B.fa
>seq1
AAAAAA
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
ACTGACTGACTG

You could have a Perl script called intersection.pl that keeps track of matching sequence/header combinations:

#!/usr/bin/env perl

use strict;
use warnings;

# read sequences into A hash table
my $aHt;
my $aCnt = 0;
my ($header, $sequence);
open(my $aFh, "<", "A.fa") or die "could not open A.fa";
while (<$aFh>) {
    chomp;
    if ($aCnt % 2 == 0) {
        $header = $_;
    }
    else {
        $sequence = $_;
        if (! defined $aHt->{$sequence}) {
            $aHt->{$sequence} = ();
        }
        push(@{$aHt->{$sequence}}, $header);
    }
    $aCnt++;
}
close($aFh);

# test sequences from B and report matches with A hash table
my $bCnt = 0;
open(my $bFh, "<", "B.fa") or die "could not open B.fa";
while (<$bFh>) {
    chomp;
    if ($bCnt % 2 == 0) {
        $header = $_;
    }
    else {
        $sequence = $_;
        if (defined $aHt->{$sequence}) {
            print STDOUT "matches on sequence [$sequence] from file A [@{$aHt->{$sequence}}] to file B [$header]\n";
        }
    }
    $bCnt++;
}
close($bFh);

Note that we are storing a list of headers in the hash table for a given key/sequence. This is because your test input has duplicate sequences, so we have to keep track of multiple headers.

We also assume that your FASTA files are single-line. In other words, a sequence follows a header on a single line. These may not be correct assumptions, so reformat your input, if necessary, or adjust the code to deal with multiline FASTA.

When we run this on your test input:

$ ./intersection.pl
matches on sequence [AAAAAA] from file A [>seq2 >seq3] to file B [>seq1]
matches on sequence [AAAAAA] from file A [>seq2 >seq3] to file B [>seq2]
matches on sequence [AAAAAA] from file A [>seq2 >seq3] to file B [>seq3]
matches on sequence [ACTGACTGACTG] from file A [>seq1] to file B [>seq4]