Question

how to find and replace the atypical aas in a protein fasta-file, remembering their protein position

0

Entering edit mode

9.1 years ago

natasha.sernova ★ 4.0k

Dear ALL,

I am trying to replace atypical aminoacid (one letter symbol like U), that is either leucine or isoleucine.

I have a long bacterial protein fasta-file with the single line headers and single line seqs.

I need to produce the modified fasta-file with the replaced atypical aminoacids and an additional file saying where they where situated before the substitution. I have a very strange result - probably I made a mistake in curly brace positions.

I am looking at this file for too long, I've not seen any obvious mistake.

Please, help me to find it! Thank you very much!

Sincerely yours,
Natasha

#!/bin/env perl
use strict;
use warnings;
my $file_name1;
my $file_name2;
my $file_name3;
my $file_name4;
my $line;
my @seq_line;
my @string;

$file_name1 = $ARGV[0];
$file_name2 = $ARGV[1];
$file_name3 = $ARGV[2];
$file_name4 = $ARGV[3];

open(IN1, "<".$file_name2) or die; # fasta-file with lines as headers and protein seqs
open(OUT1, ">".$file_name3) or die; # my file headers and seqpositions of the wrong aa symb
open(OUT2, ">".$file_name4) or die; #modified fasta –file - programs do not like atypical aas.

while(my $line=<IN1>) {
    next if $line =~ /^\s$/; # skip empty line
#  next if $line =~ /^>/;
#  I have to print this line to my output file
    if($line =~ /^>/) {
        print OUT1 $line;
#      print OUT2 $line."\n";
#      next;
    }

#  split the whole seq into the aminoacids. No new-line symbol
    my @seq_line = split(//, $line);
    for(my $i=0; $i< scalar(@seq_line); $i++) {
        if ($seq_line[$i] eq "U") {
            $line=~s/U/L/;
            print OUT1 "position".$i." .atypical.protsymbol"."_".$file_name3."\n";
        }
        elsif ($seq_line[$i] eq "u"){
            $line=~s/u/l/;
            print OUT1 "position ".$i." .atypical.protsymbol"."_".$file_name3."\n";
        }
    }
    print OUT2 $line."\n";
}
next;
}

close(IN1);
close(OUT1);
close(OUT2);

perl fasta protein • 2.6k views

ADD COMMENT • link updated 22 months ago by Ram 43k • written 9.1 years ago by natasha.sernova ★ 4.0k

Ram · Accepted Answer · 2015-04-02

3

Entering edit mode

9.1 years ago

dylan.storey ▴ 60

The fastest way to do this would be to use the unix tool sed

sed 's/U/I/g' bad.fasta > good.fasta

If you're actually interested in the position you'll need to use regex captures + pos or a special internal variable something along the lines of ... go look at

#Find all the positions
while ( $sequence =~/(U)/g){
    push (@positions , pos($sequence))
     }

#replace all U with L
$seqeunces =~ tr/U/L/;

OR:

while ( $sequence =~/U/g){    
    push (@positions ,$-[0]); save begining of match
    push (@positions ,$+[0]); # save end of match
    )
#replace all U with L
$seqeunces =~ tr/U/L/;

ADD COMMENT • link updated 22 months ago by Ram 43k • written 9.1 years ago by dylan.storey ▴ 60

0

Entering edit mode

Thank you very much, I will try. The position is the main point - I won't be able to find the particular L without it. And it will give me a respective header, am I right?

ADD REPLY • link 9.1 years ago by natasha.sernova ★ 4.0k

Ram · Accepted Answer · 2015-04-02

2

Entering edit mode

9.1 years ago

5heikki 11k

This replaces all "U" with "I" outside headers:

awk '{if(/^>/)print $0; else {gsub("U","I"); print $0}}' file.fasta

Whereas this prints all positions (1-based) of "U" outside headers (I googled this one so I'm not certain, but it looks ok):

awk '{if(/^>/) print $0; else {s=$0; i=1; idx=0; while(i>0){i=match(s, /U/); if(i>0) {idx += i;print idx;s=substr(s, i+1);}}}}' file.fasta

Also, how come there's no gmatch (like sub/gsub)?

ADD COMMENT • link updated 22 months ago by Ram 43k • written 9.1 years ago by 5heikki 11k

0

Entering edit mode

Thank you very much! So, I have to run the second expression first, and then, knowing all existing positions of the atypical leucines in the file, replace them to the normal leucines with the first expression, right? Have I understood the idea?

ADD REPLY • link 9.1 years ago by natasha.sernova ★ 4.0k

1

Entering edit mode

exactly. This second expression is required if you want to know the positions without replacing it to normal LEU, otherwise the simple sed or awk expressions are enough to replace U with L.

ADD REPLY • link 9.1 years ago by venu 7.1k

1

Entering edit mode

Yeah, note that neither edits the files in place but instead you have to direct the output with > newFile

ADD REPLY • link 9.1 years ago by 5heikki 11k

0

Entering edit mode

You can also use the -i flag with sed if you're feeling confident.

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.1 years ago by dylan.storey ▴ 60

0

Entering edit mode

The sed command edits also headers though..

ADD REPLY • link 9.1 years ago by 5heikki 11k

0

Entering edit mode

Thank you, it's important to remember.

ADD REPLY • link 9.1 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

I've run it, it's perfect! Thousand thanks for your help!

ADD REPLY • link 9.1 years ago by natasha.sernova ★ 4.0k