Question: Print Different Id From Sequence Comparison Of Two Fasta Files
4
gravatar for User 3566
8.4 years ago by
User 356640
User 356640 wrote:

Hello, I'm a beginner of bioPerl, I have two sequence files in FASTA format, in which there are a lot of similar sequence. I want to extract only different sequences and eventually to know if they come from file1 or file2. Thanks fo help. Marco

ADD COMMENTlink modified 8.4 years ago by Aleksandr Levchuk3.1k • written 8.4 years ago by User 356640
2

Hi Marco! You're welcome to ask beginner questions here. If you show us what you've tried so far in code and/or thought, I'm sure someone will be willing and able to answer your question.

ADD REPLYlink written 8.4 years ago by Michael Schubert6.9k

Hi Marco! You're welcome to ask beginner questions here. If you show us what you've tried so far in code and/or thought, I'm sure someone will be willing and able to answer your question. For instance, you might want to align your sequences first.

ADD REPLYlink written 8.4 years ago by Michael Schubert6.9k

Can you clarify whether, for example, seqA in file1 is similar only to sequences in file1, or file2, or both?

ADD REPLYlink written 8.4 years ago by Neilfws48k

Sai Liu asks: What's the meaning of similar sequence? You can give us some extract examples.

ADD REPLYlink written 8.4 years ago by Neilfws48k

I have 2 file in which there are a lot of sequence, but most of them are the same. I want to exclude the identical sequence and extract only different sequence ant to know from which file (1 or 2) they came. for similar sequence I mean identical sequence with identical name, ID etc.

ADD REPLYlink written 8.4 years ago by User 356640
4
gravatar for Heikki
8.4 years ago by
Heikki360
Heikki360 wrote:

I am assuming that by similar sequence you mean same ID.

Bioperl or not, you need to go through both files, take a note of IDs, and then test if the ID was in one or two sequences and act accordingly.

If you can fit all the data in memory, the easiest and most perlish thing is to read everything, filename, ID and sequence, into a hash and then query the hash. If you expect to be dealing with huge data sets, it is best to leave sequences out and deal with IDs and filenames only. When you know the subset you are interested in, you can go through sequence files again. Alternatively, you can create a database out of sequences that allows random access to the data.

Show us some code and we'll guide you on.

ADD COMMENTlink written 8.4 years ago by Heikki360
2
gravatar for Aleksandr Levchuk
8.3 years ago by
United States
Aleksandr Levchuk3.1k wrote:

The Shell Solution

diff <(cat file1.fasta | grep ">" | sort)  <(cat file2.fasta | grep ">" | sort)

IDs that are in file1 but not if file2

To the diff command add:

| grep "^<" | awk -F\> '{print $2}'

IDs that are in file2 but not if file1

To the diff command add:

| grep "^>" | awk -F\> '{print $3}'

Examples

diff <(cat file1.fasta | grep ">" | sort)  <(cat file2.fasta | grep ">" | sort) | grep "^>" | awk -F\> '{print $3}'

CPS1_PENJA, 623 bases, 9643C925 checksum.

Y5SS2_CAEEL, 623 bases, E99BFB checksum.

YUA6_CAEEL, 623 bases, A389AA05 checksum.

diff <(cat file1.fasta | grep ">" | sort)  <(cat file2.fasta | grep ">" | sort)  | grep "^<" | awk -F\> '{print $2}'

YSS2_CAEEL, 623 bases, E99BFB checksum.

Credits

The powerful command line tools are brought to you by GNU - the brilliant software written overwhelmingly in C an ported to all imaginable kernels: Solaris, OpenSolaris, OpenIndiana, FreeBSD, OpenBSD, NetBSD, Linux, Cygwin (Windows XP, Server, Vista, and 7).

ADD COMMENTlink modified 8.3 years ago • written 8.3 years ago by Aleksandr Levchuk3.1k

Hm, nice use of [?]process substitution[?], a bash trick that doesn't get nearly enough airtime. Some nits:

  • [?]Useless Use of Cat Award[?]: you can just grep the files directly.

  • patterns that can be anchored should be: grep ^>

  • comm is a cool tool for this sort of check:

comm -1 -2 <(grep '^>' file1.fasta | sort) <(grep '^>' file1.fasta | sort)

I'm not 100% clear on whether this is really what Marco wants to do.

ADD REPLYlink written 8.3 years ago by Adb0

Hm, nice use of process substitution, a bash trick that doesn't get nearly enough airtime. Some nits: you can just grep the files directly and skip the Useless Use Of Cat. Patterns that can be anchored should be: grep ^>. And finally, comm is a cool tool for this sort of check: comm -1 -2 <(grep '^>' file1.fasta | sort) <(grep '^>' file1.fasta | sort).

ADD REPLYlink written 8.3 years ago by Adb0

@adb, Some people may prefer cat file | grep. It's more like LEGO blocks - you can replace the left part or the right part. Diff's < and > notation is pretty intuitive while comm's -1 -2 -3 arguments are confusing. On that point, I think you wanted to say comm -2 -3 (-3 suppressed lines that appear in both files).

ADD REPLYlink written 8.3 years ago by Aleksandr Levchuk3.1k
0
gravatar for Panagiotis Alexiou
8.4 years ago by
Athens, Greece
Panagiotis Alexiou200 wrote:

Try putting sequences in one file as keys in a %hash and then checking the second file with a:

if (!exists $hash{$sequence})

or something like that?

ADD COMMENTlink written 8.4 years ago by Panagiotis Alexiou200
0
gravatar for 4Urelie
8.3 years ago by
4Urelie20
4Urelie20 wrote:

I am also a beginner and wanted to compare two fasta files, and keep ONLY the sequences that are not common. So I wrote that...:

#!/usr/bin/perl -w

#######################################################
# Author :  Aurelie K
# date   :  07 Apr 2011    
# email  :  4urelie.k@gmail.com
# Pupose :  compare to fasta files and keep only sequences not in common (header name comparison, i.e. the entire line)
#####################################################

use warnings;
use Bio::Perl;
use Bio::Seq;
use Bio::SeqIO;

my $usage = "perl scriptname.pl inputfile1 inputfile2 outputfile\n";

if (@ARGV != 3) {
    print "$usage\n";
    exit;
}

my $file1 = shift or die "$usage\n";
my $file2 = shift or die "$usage\n";
my $outputfile = shift or die "$usage\n";
my $tempoutputfile = "$outputfile.temp" or die "$usage\n";

# Reading the first file and store it into a hash
####################################################

my $FastaFile1 = Bio::SeqIO->new(-file => $file1, -format => 'fasta', -alphabet => 'dna') or die "Failed to create SeqIO object from $file1 $!\n";

my %fastaH1 =();
while( my $seqFile1 = $FastaFile1->next_seq() ) {
    unless (exists $fastaH1{$seqFile1->display_id."\t".$seqFile1->desc}) {
        $fastaH1{$seqFile1->display_id."\t".$seqFile1->desc} = $seqFile1->seq; #key of the hash is fasta header (all line) and value is sequence.
    }
}

# Opening output file
####################################################

my $output1 = Bio::SeqIO->new(-file => ">$tempoutputfile", -format => 'fasta') or die "Failed to create temp outputfile $tempoutputfile $!\n";

# Reading second file
# -> store it into a hash 
# -> and write sequences of file2 that are not in file1
####################################################

my $FastaFile2 = Bio::SeqIO->new(-file => $file2, -format => 'fasta', -alphabet => 'dna') or die "Failed to create SeqIO object from $file2 $!\n";

my %fastaH2 =();
while( my $seqFile2 = $FastaFile2->next_seq() ) {
    unless (exists $fastaH2{$seqFile2->display_id."\t".$seqFile2->desc}) {
        $fastaH2{$seqFile2->display_id."\t".$seqFile2->desc} = $seqFile2->seq; #key of the hash is fasta header (all line) and value is sequence.
        unless (exists $fastaH1{$seqFile2->display_id."\t".$seqFile2->desc}) {
            $output1->write_seq($seqFile2);
        }
    }
}

# Write sequences of file1 that are not in file2 
####################################################

open OUT, ">>$tempoutputfile" or die "Failed to open FH from $tempoutputfile $!\n";
foreach my $key (keys %fastaH1){
    unless (exists $fastaH2{$key}) {
        print OUT ">$key\n$fastaH1{$key}\n";
    }
}
close OUT;

my $input = Bio::SeqIO->new(-file => $tempoutputfile, -format => 'fasta') or die "Failed to create inputfile from $tempoutputfile $!\n";
my $output2 = Bio::SeqIO->new(-file => ">$outputfile", -format => 'fasta') or die "Failed to create outputfile $outputfile $!\n";

# Rewrite sequences in fasta format...
while( my $seq = $input->next_seq() ) {
    $output2->write_seq($seq);
}

unlink "$tempoutputfile";

print "\n--- Unique sequences are written in $outputfile ---\n\n";

exit;

I don't need to know where they are from, but you can probably modify the header to include this information... I also needed the whole lines and not only the basic header (display_id would write only the first word) Hope that can help you Please, if you find a way to make this better, simpler, more efficient, tell me!

ADD COMMENTlink written 8.3 years ago by 4Urelie20

Please be aware that you mention an email address in that code. I am not sure you really want to do that!

ADD REPLYlink written 8.3 years ago by Chris Evelo10.0k

That was in purpose, some people would only suggest improvements by email, and this one exists onlt for that ;)

ADD REPLYlink written 8.3 years ago by 4Urelie20
0
gravatar for 4Urelie
8.3 years ago by
4Urelie20
4Urelie20 wrote:

A friend found that on internet (very much simpler - you just need to concatenate your two fasta files before, and it will remove sequences which have same ID and same sequence):

use strict;
use Bio::SeqIO;
my %unique;

my $file = shift or die;
my $seqio  = Bio::SeqIO->new(-file => $file, -format => "fasta");
my $outseq = Bio::SeqIO->new(-file => ">$file.uniq", -format => "fasta");

while(my $seqs = $seqio->next_seq) {
    my $id  = $seqs->display_id;
    my $seq = $seqs->seq;
    unless(exists($unique{$seq})) {
        $outseq->write_seq($seqs);
        $unique{$seq} +=1;
    }
}
ADD COMMENTlink written 8.3 years ago by 4Urelie20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1509 users visited in the last hour