Question: Find the same repeat sequence in two fasta files
0
gravatar for Damian_Civ
15 months ago by
Damian_Civ0
Damian_Civ0 wrote:

Hello, I'm a little new at this, and I would like to know if there is a way in Perl to get the repeat sequences given two fasta files for example

I have two fasta files

File 1:

>seq1
ACTGACTGACTG
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
AAAAA

File 2:

>seq1
AAAAAA
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
ACTGACTGACTG

and I would like to have an outcome like this

FILE 3:

>seq1 ACTGACTGACTG from file 1 and
>seq4 ACTGACTGACTG from file 2 were found

Thanks...!!

perl • 666 views
ADD COMMENTlink modified 15 months ago by Pierre Lindenbaum119k • written 15 months ago by Damian_Civ0

Hello angelo.armijos.iv,

We prefer that you first try to solve this on your own and if it doesn't work show us what you tried. We are more eager to point out your mistakes and put you back on track, rather than giving a full solution which will learn you nothing.

In addition, I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 15 months ago by WouterDeCoster38k

Hello Angelo,

I do not know how to do it in Perl. Nevertheless, we are currently developing SEDA (http://www.sing-group.org/seda/), a Java application to easily process FASTA files.

Among its functions, the "Remove redundant sequences" option (see section 3.3 of the manual: http://www.sing-group.org/seda/downloads/manuals/seda-user-manual-1.0.0.pdf) has been precisely created for doing what you are looking for. With the "Save merged headers into a file" you can obtain a similar report to the one that you are looking for.

With best regards,

Hugo.

ADD REPLYlink written 15 months ago by Hugo150
3
gravatar for Alex Reynolds
15 months ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

In Perl, read the sequences and headers of one file into a hash table. Use the sequence as the key and the header as the value in the key-value pair stored in the hash table.

Once you have the "first-file" hash table, read through the second file in order. At each sequence, test if the first-file hash table has an entry with a matching key (sequence) with some header.

Given files A.fa and B.fa:

$ cat A.fa
>seq1
ACTGACTGACTG
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
AAAAA

$ cat B.fa
>seq1
AAAAAA
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
ACTGACTGACTG

You could have a Perl script called intersection.pl that keeps track of matching sequence/header combinations:

#!/usr/bin/env perl

use strict;
use warnings;

# read sequences into A hash table
my $aHt;
my $aCnt = 0;
my ($header, $sequence);
open(my $aFh, "<", "A.fa") or die "could not open A.fa";
while (<$aFh>) {
    chomp;
    if ($aCnt % 2 == 0) {
        $header = $_;
    }
    else {
        $sequence = $_;
        if (! defined $aHt->{$sequence}) {
            $aHt->{$sequence} = ();
        }
        push(@{$aHt->{$sequence}}, $header);
    }
    $aCnt++;
}
close($aFh);

# test sequences from B and report matches with A hash table
my $bCnt = 0;
open(my $bFh, "<", "B.fa") or die "could not open B.fa";
while (<$bFh>) {
    chomp;
    if ($bCnt % 2 == 0) {
        $header = $_;
    }
    else {
        $sequence = $_;
        if (defined $aHt->{$sequence}) {
            print STDOUT "matches on sequence [$sequence] from file A [@{$aHt->{$sequence}}] to file B [$header]\n";
        }
    }
    $bCnt++;
}
close($bFh);

Note that we are storing a list of headers in the hash table for a given key/sequence. This is because your test input has duplicate sequences, so we have to keep track of multiple headers.

We also assume that your FASTA files are single-line. In other words, a sequence follows a header on a single line. These may not be correct assumptions, so reformat your input, if necessary, or adjust the code to deal with multiline FASTA.

When we run this on your test input:

$ ./intersection.pl
matches on sequence [AAAAAA] from file A [>seq2 >seq3] to file B [>seq1]
matches on sequence [AAAAAA] from file A [>seq2 >seq3] to file B [>seq2]
matches on sequence [AAAAAA] from file A [>seq2 >seq3] to file B [>seq3]
matches on sequence [ACTGACTGACTG] from file A [>seq1] to file B [>seq4]
ADD COMMENTlink modified 15 months ago • written 15 months ago by Alex Reynolds28k

humble suggestion for the final step:

1) for each key in hash 1

2) use "if exists" to see if it is also a key in hash 2

3) if true: print key + filename

ADD REPLYlink modified 15 months ago • written 15 months ago by YaGalbi1.4k

Ok thank you very much It works ideally..!!, previously I was trying it with the command grep (grep -Fwf <(sort -u fileA) -B 1 fileB | grep -v '^--$' > out.fasta) and it works too but the more I increased the number of sequences it was slowest. I will follow your suggestion.

Thanks..!!

ADD REPLYlink written 15 months ago by Damian_Civ0

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLYlink written 15 months ago by WouterDeCoster38k

You're welcome! I don't often get a thanks, so I appreciate that.

ADD REPLYlink written 15 months ago by Alex Reynolds28k
1
gravatar for 5heikki
15 months ago by
5heikki8.4k
Finland
5heikki8.4k wrote:

Do we have to do it in perl?

blastn -task blastn-short -query file1 -subject file2 -outfmt '6 qseqid sseqid qlen slen length nident' | awk 'BEGIN{OFS=FS="\t"}{if($3==$4 && $3==$5 && $3==$6){print $1,$2}}'
ADD COMMENTlink modified 15 months ago • written 15 months ago by 5heikki8.4k
1
gravatar for Pierre Lindenbaum
15 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

as far as I understand, you want a perfect match between the two set of file. Linearize, sort on sequence and use join.

join -t $'\t' -1 3 -2 3 \
      <(awk '/^>/ {printf("%sF1\t%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' file1.fa | sort -t $'\t' -k3,3 ) \
      <(awk '/^>/ {printf("%sF2\t%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' file2.fa | sort -t $'\t' -k3,3 )

AAAAAA  F1  >seq2   F2  >seq1
AAAAAA  F1  >seq2   F2  >seq2
AAAAAA  F1  >seq2   F2  >seq3
AAAAAA  F1  >seq3   F2  >seq1
AAAAAA  F1  >seq3   F2  >seq2
AAAAAA  F1  >seq3   F2  >seq3
ACTGACTGACTG    F1  >seq1   F2  >seq4
ADD COMMENTlink written 15 months ago by Pierre Lindenbaum119k

Thanks a lot, I was trying some utilitys like this but I allways got an error like this "sort: Cannot allocate memory" I think it doesn't work in a very large database.

ADD REPLYlink modified 15 months ago • written 15 months ago by Damian_Civ0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1389 users visited in the last hour