Question: Matching peptides which differ by one letter?
0
gravatar for kimmcl860
4 months ago by
kimmcl8600
kimmcl8600 wrote:

Is there any I can match peptides differing by only one letter in two different TSV files? My original peptides in file 1 contains a point mutation represented by a lowercase letter. File 2 contains mutated versions of the same peptides, but changed to the remaining 19 amino acids in the same location as the lowercase letter.

If there is a match I would also like to divide the binding affinity of the original in File 1, column 4 by the binding affinity of the mutations in File 2, column 4.

File 1:

TCGA-A3-3316-0  HLA-A0201   SLVFLPFnT   1.90<=WB
TCGA-A3-3316-0  HLA-A0201   KLLLAtKSL   1.70<=WB    
TCGA-A3-3316-0  HLA-A0201   sIWKHATPV   0.30<=SB    
TCGA-A3-3316-0  HLA-A0201   KVTSIQhWV   1.90<=WB
TCGA-A3-3316-0  HLA-A0201   MtYDRYVAI   1.10<=WB

File 2:

TCGA-A3-3316-0  HLA-A0201   KVTSIQAWV   2
TCGA-A3-3316-0  HLA-A0201   KVTSIQCWV   2.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQDWV   4.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQEWV   3.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQFWV   0.4
TCGA-A3-3316-0  HLA-A0201   KVTSIQGWV   7.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQIWV   1.2
TCGA-A3-3316-0  HLA-A0201   KVTSIQKWV   9
TCGA-A3-3316-0  HLA-A0201   KVTSIQLWV   1.4
TCGA-A3-3316-0  HLA-A0201   KVTSIQMWV   1.1
TCGA-A3-3316-0  HLA-A0201   KVTSIQNWV   3.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQPWV   1.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQQWV   3.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQRWV   7.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQSWV   3.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQTWV   2.5
TCGA-A3-3316-0  HLA-A0201   KVTSIQVWV   1.6
programming script • 210 views
ADD COMMENTlink modified 4 months ago by cpad01128.3k • written 4 months ago by kimmcl8600

You're going to need a custom script for it. For each entry in file1, you can replace the lower case character by a ., then use that as a word-bounded regex to get matching lines from your file2.

I'm not giving you code as this is an excellent opportunity to try your hand at some text processing code. I'd recommend Python because it's easier to code in.

ADD REPLYlink written 4 months ago by Ram17k
1
gravatar for Alex Reynolds
4 months ago by
Alex Reynolds25k
Seattle, WA USA
Alex Reynolds25k wrote:
#!/usr/bin/env python

import sys
import re

f1 = 'file1.txt'
f2 = 'file2.txt'

aa_table = [ "F", "L", "I", "M", "V", "S", "P", "T", "A", "Y", "H", "Q", "N", "K", "D", "E", "C", "W", "R", "G" ]

patterns = []
affinities = {}

with open(f1, "r") as f1h:
    for f1l in f1h:
        (tcga, hla, res, aff) = f1l.rstrip().split('\t')
        idx = int([i for i, c in enumerate(res) if c.islower()][0])
        aa_table_filtered = [x for x in aa_table if x != res[idx].upper()]
        pat = res[:idx] + '[' + '|'.join(aa_table_filtered) + ']' + res[(idx+1):]
        aff_head, aff_sep, aff_tail = aff.partition('<=')
        patterns.append(pat)
        affinities[pat] = float(aff_head)

with open(f2, "r") as f2h:
    for f2l in f2h:
        (tcga, hla, res, aff) = f2l.rstrip().split('\t')
        aff = float(aff)
        for pattern in patterns:
            if re.search(r"%s" % (pattern), res):
                sys.stdout.write("%s\n" % '\t'.join([tcga, hla, res, str(affinities[pattern]/aff)]))
                break

Result:

$ ./findPointMutations.py
TCGA-A3-3316-0  HLA-A0201       KVTSIQAWV       0.95
TCGA-A3-3316-0  HLA-A0201       KVTSIQCWV       0.76
TCGA-A3-3316-0  HLA-A0201       KVTSIQDWV       0.4222222222222222
TCGA-A3-3316-0  HLA-A0201       KVTSIQEWV       0.5428571428571428
TCGA-A3-3316-0  HLA-A0201       KVTSIQFWV       4.749999999999999
TCGA-A3-3316-0  HLA-A0201       KVTSIQGWV       0.2533333333333333
TCGA-A3-3316-0  HLA-A0201       KVTSIQIWV       1.5833333333333333
TCGA-A3-3316-0  HLA-A0201       KVTSIQKWV       0.2111111111111111
TCGA-A3-3316-0  HLA-A0201       KVTSIQLWV       1.3571428571428572
TCGA-A3-3316-0  HLA-A0201       KVTSIQMWV       1.727272727272727
TCGA-A3-3316-0  HLA-A0201       KVTSIQNWV       0.5428571428571428
TCGA-A3-3316-0  HLA-A0201       KVTSIQPWV       1.2666666666666666
TCGA-A3-3316-0  HLA-A0201       KVTSIQQWV       0.5428571428571428
TCGA-A3-3316-0  HLA-A0201       KVTSIQRWV       0.2533333333333333
TCGA-A3-3316-0  HLA-A0201       KVTSIQSWV       0.5428571428571428
TCGA-A3-3316-0  HLA-A0201       KVTSIQTWV       0.76
TCGA-A3-3316-0  HLA-A0201       KVTSIQVWV       1.1874999999999998
ADD COMMENTlink modified 4 months ago • written 4 months ago by Alex Reynolds25k

I think you are dividing affinity_file2 by affinity_file1, and OP wants the other way around.

ADD REPLYlink written 4 months ago by h.mon18k

Thanks, should post a fix shortly.

ADD REPLYlink written 4 months ago by Alex Reynolds25k

I added a break at the end of a pattern match. That should speed this up a bit, on the assumption that if a match is found, there isn't any need to walk through the rest.

ADD REPLYlink written 4 months ago by Alex Reynolds25k
1
gravatar for h.mon
4 months ago by
h.mon18k
Brazil
h.mon18k wrote:

For a problem like this one, it is unfathomable there is a python answer and not a perl answer. So here it is:

#!/usr/bin/env perl

use warnings;
use strict;

my $file1 = shift @ARGV;
my $file2 = shift @ARGV;
my %hash = ();

open(IN1, "<", $file1) or die "Can't open input file. Please provide a valid filename.\n";
while (<IN1>) {
    s/\R//;
    my @line = split( "\t", $_ );
    my ($affinity, @trash) = split( "<=", $line[3] );
    $hash{$line[2]} = $affinity;
}
close(IN1);

open(IN2, "<", $file2) or die "Can't open input file. Please provide a valid filename.\n";
while (<IN2>) {
    s/\R//;
    my @line = split( "\t", $_ );
    foreach my $key (keys %hash) {
        if ( hd( $key, $line[2] ) == 1 ){
            my $new_aff = $hash{$key} / $line[3];
            print "$line[2]\t$hash{$key}\t$line[3]\t$new_aff\n";
        }
    }
}
close(IN2);

# http://www.perlmonks.org/?node_id=500235
sub hd{ length( $_[ 0 ] ) - ( ( $_[ 0 ] ^ $_[ 1 ] ) =~ tr[\0][\0] ) }

Save it as pep_affinity.pl, and call it with ./pep_affinity.pl file1 file2.

One assumption of the script is the the only peptides with a one amino-acid difference between file1 and file2 are the mutated ones - it doesn't specifically checks for the lower-case letters. It is pretty inefficient, as for every peptide in file2, every peptide in file1 is looped through to check for the 1 amino-acid difference.

ADD COMMENTlink modified 4 months ago • written 4 months ago by h.mon18k

That bit at the end is pretty darn clever. Wish I could give more than one vote, but I'm not Putin.

ADD REPLYlink written 4 months ago by Alex Reynolds25k
0
gravatar for cpad0112
4 months ago by
cpad01128.3k
India
cpad01128.3k wrote:

in R:

df1= read.csv("test1.txt", sep = "\t", stringsAsFactors = F, header = F)
df2= read.csv("test2.txt", sep = "\t", stringsAsFactors = F, header = F)

library(stringr)

df1$V6=as.numeric(str_split_fixed(df1$V4,"<",3)[,1])

library(fuzzyjoin)
df3=stringdist_inner_join(df1, df2, by=c("V3"="V3"), max_dist=1,ignore_case=T)
df4=df3[,c(1:3,6,10)]
df4$V7=df4$V6/df4$V4.y

output:

             V1.x      V2.x      V3.x  V6 V4.y        V7
1  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  2.0 0.9500000
2  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  2.5 0.7600000
3  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  4.5 0.4222222
4  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  3.5 0.5428571
5  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  0.4 4.7500000
6  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  7.5 0.2533333
7  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  1.2 1.5833333
8  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  9.0 0.2111111
9  TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  1.4 1.3571429
10 TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  1.1 1.7272727
11 TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  3.5 0.5428571
12 TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  1.5 1.2666667
13 TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  3.5 0.5428571
14 TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  7.5 0.2533333
15 TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  3.5 0.5428571
16 TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  2.5 0.7600000
17 TCGA-A3-3316-0 HLA-A0201 KVTSIQhWV 1.9  1.6 1.1875000
ADD COMMENTlink modified 4 months ago • written 4 months ago by cpad01128.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1511 users visited in the last hour