Find the same repeat sequence in two fasta files
3
0
Entering edit mode
3.9 years ago
Damian_Civ • 0

Hello, I'm a little new at this, and I would like to know if there is a way in Perl to get the repeat sequences given two fasta files for example

I have two fasta files

File 1:

>seq1
ACTGACTGACTG
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
AAAAA


File 2:

>seq1
AAAAAA
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
ACTGACTGACTG


and I would like to have an outcome like this

FILE 3:

>seq1 ACTGACTGACTG from file 1 and
>seq4 ACTGACTGACTG from file 2 were found


Thanks...!!

Perl • 1.8k views
0
Entering edit mode

Hello angelo.armijos.iv,

We prefer that you first try to solve this on your own and if it doesn't work show us what you tried. We are more eager to point out your mistakes and put you back on track, rather than giving a full solution which will learn you nothing.

In addition, I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

0
Entering edit mode

Hello Angelo,

I do not know how to do it in Perl. Nevertheless, we are currently developing SEDA (http://www.sing-group.org/seda/), a Java application to easily process FASTA files.

Among its functions, the "Remove redundant sequences" option (see section 3.3 of the manual: http://www.sing-group.org/seda/downloads/manuals/seda-user-manual-1.0.0.pdf) has been precisely created for doing what you are looking for. With the "Save merged headers into a file" you can obtain a similar report to the one that you are looking for.

With best regards,

Hugo.

3
Entering edit mode
3.9 years ago

In Perl, read the sequences and headers of one file into a hash table. Use the sequence as the key and the header as the value in the key-value pair stored in the hash table.

Once you have the "first-file" hash table, read through the second file in order. At each sequence, test if the first-file hash table has an entry with a matching key (sequence) with some header.

Given files A.fa and B.fa:

$cat A.fa >seq1 ACTGACTGACTG >seq2 AAAAAA >seq3 AAAAAA >seq4 AAAAA$ cat B.fa
>seq1
AAAAAA
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
ACTGACTGACTG


You could have a Perl script called intersection.pl that keeps track of matching sequence/header combinations:

#!/usr/bin/env perl

use strict;
use warnings;

# read sequences into A hash table
my $aHt; my$aCnt = 0;
my ($header,$sequence);
open(my $aFh, "<", "A.fa") or die "could not open A.fa"; while (<$aFh>) {
chomp;
if ($aCnt % 2 == 0) {$header = $_; } else {$sequence = $_; if (! defined$aHt->{$sequence}) {$aHt->{$sequence} = (); } push(@{$aHt->{$sequence}},$header);
}
$aCnt++; } close($aFh);

# test sequences from B and report matches with A hash table
my $bCnt = 0; open(my$bFh, "<", "B.fa") or die "could not open B.fa";
while (<$bFh>) { chomp; if ($bCnt % 2 == 0) {
$header =$_;
}
else {
$sequence =$_;
if (defined $aHt->{$sequence}) {
print STDOUT "matches on sequence [$sequence] from file A [@{$aHt->{$sequence}}] to file B [$header]\n";
}
}
$bCnt++; } close($bFh);


Note that we are storing a list of headers in the hash table for a given key/sequence. This is because your test input has duplicate sequences, so we have to keep track of multiple headers.

We also assume that your FASTA files are single-line. In other words, a sequence follows a header on a single line. These may not be correct assumptions, so reformat your input, if necessary, or adjust the code to deal with multiline FASTA.

When we run this on your test input:

$./intersection.pl matches on sequence [AAAAAA] from file A [>seq2 >seq3] to file B [>seq1] matches on sequence [AAAAAA] from file A [>seq2 >seq3] to file B [>seq2] matches on sequence [AAAAAA] from file A [>seq2 >seq3] to file B [>seq3] matches on sequence [ACTGACTGACTG] from file A [>seq1] to file B [>seq4]  ADD COMMENT 0 Entering edit mode humble suggestion for the final step: 1) for each key in hash 1 2) use "if exists" to see if it is also a key in hash 2 3) if true: print key + filename ADD REPLY 0 Entering edit mode Ok thank you very much It works ideally..!!, previously I was trying it with the command grep (grep -Fwf <(sort -u fileA) -B 1 fileB | grep -v '^--$' > out.fasta) and it works too but the more I increased the number of sequences it was slowest. I will follow your suggestion.

Thanks..!!

0
Entering edit mode

0
Entering edit mode

You're welcome! I don't often get a thanks, so I appreciate that.

1
Entering edit mode
3.9 years ago
5heikki 10k

Do we have to do it in perl?

blastn -task blastn-short -query file1 -subject file2 -outfmt '6 qseqid sseqid qlen slen length nident' | awk 'BEGIN{OFS=FS="\t"}{if($3==$4 && $3==$5 && $3==$6){print $1,$2}}'

1
Entering edit mode
3.9 years ago

as far as I understand, you want a perfect match between the two set of file. Linearize, sort on sequence and use join.

join -t $'\t' -1 3 -2 3 \ <(awk '/^>/ {printf("%sF1\t%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' file1.fa | sort -t$'\t' -k3,3 ) \
<(awk '/^>/ {printf("%sF2\t%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' file2.fa | sort -t \$'\t' -k3,3 )

AAAAAA  F1  >seq2   F2  >seq1
AAAAAA  F1  >seq2   F2  >seq2
AAAAAA  F1  >seq2   F2  >seq3
AAAAAA  F1  >seq3   F2  >seq1
AAAAAA  F1  >seq3   F2  >seq2
AAAAAA  F1  >seq3   F2  >seq3
ACTGACTGACTG    F1  >seq1   F2  >seq4

0
Entering edit mode

Thanks a lot, I was trying some utilitys like this but I allways got an error like this "sort: Cannot allocate memory" I think it doesn't work in a very large database.