Hello,
I am looking for the way to convert amino acid sequences to numerical values based on custom file.
Those numerical values were mixed up with positive and negative values.
Reference file: tab separated
A R N D C Q E G H K
-0.5 3 0.2 3 -1 0.2 3 0 -0.5 5
#!/usr/bin/perl
use strict;
use warnings;$ARGV[2] or die "use convert.pl ref_data fasta_in fasta_out\n";
my $ref_file=shift @ARGV;
my $in_file=shift @ARGV;
my $out_file=shift @ARGV;
my %code =();
my @aa =();
open (my $rh, "<", $ref_file) or die "cannot read $ref_file\n";
my $nl= 0;while(<$rh>){$nl++;
chomp;
my @a =split(/\t/, $_);if($nl== 1){# capture first line with aa
@aa = @a;}
elsif ($nl== 2){# capture second line with values, combine both in a Hashfor(my $i= 0;$i<=$#a; $i++) {$code{$aa[$i]}=$a[$i];}}else{ last;}}
close $rh;
open (my $ih, "<", $in_file) or die "cannot read $in_file\n";
open (my $oh, ">", $out_file) or die "cannot write $out_file\n";
$/ ="\n>";# read fasta in blockswhile(<$ih>){
s/>//g;
my ($id, @seq)=split(/\n/, $_);
my $seq=join"", @seq;# reconstruct sequence if multiline
print $oh">$id\n";# fasta header
@seq =split(//, $seq);
my @val =();
foreach my $aa(@seq){if(defined $code{$aa}){ push @val, $code{$aa};}else{ die "$aa is not defined in $ref_file\n";}}
print $ohjoin"\t", @val;
print $oh"\n";}
close $ih;
close $oh;
Making some major assumptions about the actual file formats, I'd do something like:
from Bio import SeqIO
with open('reference.txt', 'r') as rh:
content = rh.read()
mappings = dict(zip(content[0].split(), content[1].split())for record in SeqIO.parse('sequences.fasta', 'fasta'):
scores =[]for char in str(record.seq):
try:
scores.append(mappings[char])
except KeyError:
scores.append('.')
print(">" + record.description + "\n" +
" ".join(scores))
I have just written this from memory/off the top of my head so it's untested. I also expect there's a more concise and faster way to do it with str.translate but this will probably work (with some minor tweaks).
The try/except block will prevent this crashing in the even that you have a residue in the sequence that isn't in the numerical mappings, but it will put a dummy character in (.), so its up to you how you want to handle this.
this looks like a homework ... otherwise please explain why do you need the file in your research or offer to pay at least a coffee for a script ...
Those custom values are from AAindex by amino acid property. I will use it these results for research purpose only after converting.
I would do it in python and use dictionaries