Question: array of hashes in perl
1
gravatar for natasha.sernova
5.5 years ago by
natasha.sernova3.7k
natasha.sernova3.7k wrote:

Dear collegues,
I need your advice. I have a mixture of protein sequences,
about a thousand. They are in fasta-format - their names are
always different, but their sequences sometimes are the same.
I would like to get rid of any repeats automatically.
Is there any simple way to do it? Hashes "seen" may help, but I don't know exactly how many hashes I will need to create.
To create the array of these hashes is too complicated to my mind.
Is it the only way to do it?
Many thanks for your help!

sequence • 1.1k views
ADD COMMENTlink modified 5.5 years ago by JC9.1k • written 5.5 years ago by natasha.sernova3.7k

check How To Remove The Same Sequences In The Fasta Files? thread to remove duplicate sequences.

ADD REPLYlink written 5.5 years ago by Prakki Rama2.3k

Many thanks! I am very bad in python, but this is a good reason to study it better.

I have to try, I have no choice.
 

ADD REPLYlink written 5.5 years ago by natasha.sernova3.7k
1

Check pierre's post in the above link, where you can just run the command to remove duplicate without running any program.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Prakki Rama2.3k

Really? That's great! But I didn't quite understand, what "pierre's post"

you are talking about. It's my dream - just run the command to remove duplicates without running any program. Please, help me to find it! THOUSAND THANKS!

Natasha

ADD REPLYlink written 5.5 years ago by natasha.sernova3.7k

:) Click How To Remove The Same Sequences In The Fasta Files?

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Prakki Rama2.3k

I assume that by "repeats" you mean "duplicate sequences" as opposed to sequence repeats.

ADD REPLYlink written 5.5 years ago by Neilfws48k

Yes, exactly. I mean multiple duplicate sequences.

ADD REPLYlink written 5.5 years ago by natasha.sernova3.7k
1
gravatar for JC
5.5 years ago by
JC9.1k
Mexico
JC9.1k wrote:

Well, you need to use the sequence as the "key" in the hash and the sequence_id as the "value", for example:

#!/usr/bin/perl
use strict;
use warnings;
my %seqs;
$/ = "\n>";
while (<>) {
    s/>//g;
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq;
    $seqs{$seq} .= "$id,";
}

while ( my ($seq, $id) = each %seqs) {
    print ">$id\n$seq\n";
}

then you can run it as:

perl removeDuplicates.pl < original.fasta > unique.fasta

 

ADD COMMENTlink modified 5.5 years ago by Alex Reynolds29k • written 5.5 years ago by JC9.1k

THANK YOU! I was always very much afraid of such complicated data stuctures...

That's great!

Natasha

ADD REPLYlink written 5.5 years ago by natasha.sernova3.7k

Your welcome,  I'm glad to help.

ADD REPLYlink written 5.5 years ago by JC9.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1456 users visited in the last hour