Question: Find the same repeat sequence in two fasta files
0
gravatar for Damian_Civ
21 months ago by
Damian_Civ0
Damian_Civ0 wrote:

Hello, I'm a little new at this, and I would like to know if there is a way in Perl to get the repeat sequences given two fasta files for example

I have two fasta files

File 1:

>seq1
ACTGACTGACTG
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
AAAAA

File 2:

>seq1
AAAAAA
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
ACTGACTGACTG

and I would like to have an outcome like this

FILE 3:

>seq1 ACTGACTGACTG from file 1 and
>seq4 ACTGACTGACTG from file 2 were found

Thanks...!!

perl • 894 views
ADD COMMENTlink modified 20 months ago by Pierre Lindenbaum123k • written 21 months ago by Damian_Civ0

Hello angelo.armijos.iv,

We prefer that you first try to solve this on your own and if it doesn't work show us what you tried. We are more eager to point out your mistakes and put you back on track, rather than giving a full solution which will learn you nothing.

In addition, I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 21 months ago by WouterDeCoster41k

Hello Angelo,

I do not know how to do it in Perl. Nevertheless, we are currently developing SEDA (http://www.sing-group.org/seda/), a Java application to easily process FASTA files.

Among its functions, the "Remove redundant sequences" option (see section 3.3 of the manual: http://www.sing-group.org/seda/downloads/manuals/seda-user-manual-1.0.0.pdf) has been precisely created for doing what you are looking for. With the "Save merged headers into a file" you can obtain a similar report to the one that you are looking for.

With best regards,

Hugo.

ADD REPLYlink written 21 months ago by Hugo150
3
gravatar for Alex Reynolds
21 months ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

In Perl, read the sequences and headers of one file into a hash table. Use the sequence as the key and the header as the value in the key-value pair stored in the hash table.

Once you have the "first-file" hash table, read through the second file in order. At each sequence, test if the first-file hash table has an entry with a matching key (sequence) with some header.

Given files A.fa and B.fa:

$ cat A.fa
>seq1
ACTGACTGACTG
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
AAAAA

$ cat B.fa
>seq1
AAAAAA
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
ACTGACTGACTG

You could have a Perl script called intersection.pl that keeps track of matching sequence/header combinations:

#!/usr/bin/env perl

use strict;
use warnings;

# read sequences into A hash table
my $aHt;
my $aCnt = 0;
my ($header, $sequence);
open(my $aFh, "<", "A.fa") or die "could not open A.fa";
while (<$aFh>) {
    chomp;
    if ($aCnt % 2 == 0) {
        $header = $_;
    }
    else {
        $sequence = $_;
        if (! defined $aHt->{$sequence}) {
            $aHt->{$sequence} = ();
        }
        push(@{$aHt->{$sequence}}, $header);
    }
    $aCnt++;
}
close($aFh);

# test sequences from B and report matches with A hash table
my $bCnt = 0;
open(my $bFh, "<", "B.fa") or die "could not open B.fa";
while (<$bFh>) {
    chomp;
    if ($bCnt % 2 == 0) {
        $header = $_;
    }
    else {
        $sequence = $_;
        if (defined $aHt->{$sequence}) {
            print STDOUT "matches on sequence [$sequence] from file A [@{$aHt->{$sequence}}] to file B [$header]\n";
        }
    }
    $bCnt++;
}
close($bFh);

Note that we are storing a list of headers in the hash table for a given key/sequence. This is because your test input has duplicate sequences, so we have to keep track of multiple headers.

We also assume that your FASTA files are single-line. In other words, a sequence follows a header on a single line. These may not be correct assumptions, so reformat your input, if necessary, or adjust the code to deal with multiline FASTA.

When we run this on your test input:

$ ./intersection.pl
matches on sequence [AAAAAA] from file A [>seq2 >seq3] to file B [>seq1]
matches on sequence [AAAAAA] from file A [>seq2 >seq3] to file B [>seq2]
matches on sequence [AAAAAA] from file A [>seq2 >seq3] to file B [>seq3]
matches on sequence [ACTGACTGACTG] from file A [>seq1] to file B [>seq4]
ADD COMMENTlink modified 20 months ago • written 21 months ago by Alex Reynolds29k

humble suggestion for the final step:

1) for each key in hash 1

2) use "if exists" to see if it is also a key in hash 2

3) if true: print key + filename

ADD REPLYlink modified 21 months ago • written 21 months ago by YaGalbi1.4k

Ok thank you very much It works ideally..!!, previously I was trying it with the command grep (grep -Fwf <(sort -u fileA) -B 1 fileB | grep -v '^--$' > out.fasta) and it works too but the more I increased the number of sequences it was slowest. I will follow your suggestion.

Thanks..!!

ADD REPLYlink written 20 months ago by Damian_Civ0

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLYlink written 20 months ago by WouterDeCoster41k

You're welcome! I don't often get a thanks, so I appreciate that.

ADD REPLYlink written 20 months ago by Alex Reynolds29k
1
gravatar for 5heikki
20 months ago by
5heikki8.5k
Finland
5heikki8.5k wrote:

Do we have to do it in perl?

blastn -task blastn-short -query file1 -subject file2 -outfmt '6 qseqid sseqid qlen slen length nident' | awk 'BEGIN{OFS=FS="\t"}{if($3==$4 && $3==$5 && $3==$6){print $1,$2}}'
ADD COMMENTlink modified 20 months ago • written 20 months ago by 5heikki8.5k
1
gravatar for Pierre Lindenbaum
20 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

as far as I understand, you want a perfect match between the two set of file. Linearize, sort on sequence and use join.

join -t $'\t' -1 3 -2 3 \
      <(awk '/^>/ {printf("%sF1\t%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' file1.fa | sort -t $'\t' -k3,3 ) \
      <(awk '/^>/ {printf("%sF2\t%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' file2.fa | sort -t $'\t' -k3,3 )

AAAAAA  F1  >seq2   F2  >seq1
AAAAAA  F1  >seq2   F2  >seq2
AAAAAA  F1  >seq2   F2  >seq3
AAAAAA  F1  >seq3   F2  >seq1
AAAAAA  F1  >seq3   F2  >seq2
AAAAAA  F1  >seq3   F2  >seq3
ACTGACTGACTG    F1  >seq1   F2  >seq4
ADD COMMENTlink written 20 months ago by Pierre Lindenbaum123k

Thanks a lot, I was trying some utilitys like this but I allways got an error like this "sort: Cannot allocate memory" I think it doesn't work in a very large database.

ADD REPLYlink modified 20 months ago • written 20 months ago by Damian_Civ0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2068 users visited in the last hour