Question

Converting amino acid sequences to numerical values

1

Entering edit mode

5.7 years ago

karish03 ▴ 10

Hello, I am looking for the way to convert amino acid sequences to numerical values based on custom file.
Those numerical values were mixed up with positive and negative values.

Reference file: tab separated

A   R   N   D   C   Q   E   G   H    K
-0.5    3   0.2 3   -1  0.2 3   0   -0.5    5

Input file: Multi-fasta file

>Example1
ARN
>Example2
DCQEGH

Expected output: tab separated

>Example1
 -0.5   3   0.2 
>Example2
  3 -1   0.2    3   0   -0.5

Thank you

R perl python sequence script • 1.8k views

ADD COMMENT • link updated 5.7 years ago by JC 13k • written 5.7 years ago by karish03 ▴ 10

0

Entering edit mode

this looks like a homework ... otherwise please explain why do you need the file in your research or offer to pay at least a coffee for a script ...

ADD REPLY • link 5.7 years ago by JC 13k

0

Entering edit mode

Those custom values are from AAindex by amino acid property. I will use it these results for research purpose only after converting.

ADD REPLY • link 5.7 years ago by karish03 ▴ 10

0

Entering edit mode

I would do it in python and use dictionaries

ADD REPLY • link 5.7 years ago by gb ★ 2.2k

score 1 · Answer 1 · 2019-10-31

1

Entering edit mode

5.7 years ago

zx8754 12k

These examples should help you to get started:

strsplit("XYZ", "")
# [[1]]
# [1] "X" "Y" "Z"

lookup <- setNames(c(11, 22, 33), c("X", "Y", "Z"))
lookup[ c("Z", "Z", "Y") ]
#  Z  Z  Y 
# 33 33 22

ADD COMMENT • link 5.7 years ago by zx8754 12k

0

Entering edit mode

Thank you. I will check it.

ADD REPLY • link 5.7 years ago by karish03 ▴ 10

score 1 · Answer 2 · 2019-11-05

here how I could do in Perl:

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[2] or die "use convert.pl ref_data fasta_in fasta_out\n";
my $ref_file = shift @ARGV;
my $in_file  = shift @ARGV;
my $out_file = shift @ARGV;

my %code = ();
my @aa   = ();
open (my $rh, "<", $ref_file) or die "cannot read $ref_file\n";
my $nl = 0;
while (<$rh>) {
    $nl++;
    chomp;
    my @a = split (/\t/, $_);
    if ($nl == 1) { # capture first line with aa
        @aa = @a;
    }
    elsif ($nl == 2) { # capture second line with values, combine both in a Hash
        for (my $i = 0; $i<=$#a; $i++) {
            $code{ $aa[$i] } = $a[$i];
        }
    }
    else { last; }
}
close $rh;

open (my $ih, "<", $in_file) or die "cannot read $in_file\n";
open (my $oh, ">", $out_file) or die "cannot write $out_file\n";
$/ = "\n>"; # read fasta in blocks
while (<$ih>) {
    s/>//g;
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq; # reconstruct sequence if multiline
    print $oh ">$id\n"; # fasta header
    @seq = split (//, $seq);
    my @val = ();
    foreach my $aa (@seq) {
        if (defined $code{$aa}) { push @val, $code{$aa}; }
        else { die "$aa is not defined in $ref_file\n";  }
    }
    print $oh join "\t", @val;
    print $oh "\n";
}
close $ih;
close $oh;

Testing:

$ perl convert.pl ref.data seqs.pep seqs.vals
$ cat seqs.vals
>Example1
-0.5    3       0.2
>Example2
3       -1      0.2     3       0       -0.5

score 0 · Answer 3 · 2019-10-31

Making some major assumptions about the actual file formats, I'd do something like:

from Bio import SeqIO

with open('reference.txt', 'r') as rh:
    content = rh.read()
    mappings = dict(zip(content[0].split(), content[1].split())

for record in SeqIO.parse('sequences.fasta', 'fasta'):
    scores = []
    for char in str(record.seq):
        try:
            scores.append(mappings[char])
        except KeyError:
            scores.append('.')

    print(">" + record.description + "\n" +
          " ".join(scores))

I have just written this from memory/off the top of my head so it's untested. I also expect there's a more concise and faster way to do it with str.translate but this will probably work (with some minor tweaks).

The try/except block will prevent this crashing in the even that you have a residue in the sequence that isn't in the numerical mappings, but it will put a dummy character in (.), so its up to you how you want to handle this.