Question: how to find and replace the atypical aas in a protein fasta-file, remembering their protein position
0
gravatar for natasha.sernova
4.1 years ago by
natasha.sernova3.4k
natasha.sernova3.4k wrote:

Dear ALL,

I am trying to replace atypical aminoacid  (one letter symbol like U), that is either leucine or isoleucine.

I have a long bacterial protein fasta-file with the single line headers and single line seqs.

I need to produce the modified fasta-file with the replaced atypical aminoacids and an additional file saying where they where situated before the substitution. I have a very strange result - probably I made a mistake in curly brace positions.

I’am looking at this file for too long, I’ve not seen any obvious mistake.

Please, help me to find it! Thank you very much!

Sincerely yours,

Natasha

#!/bin/env perl
use strict;
use warnings;
my $file_name1;
my $file_name2;
my $file_name3;
my $file_name4;
my $line;
my @seq_line;
my @string;

$file_name1 = $ARGV[0];
$file_name2 = $ARGV[1];
$file_name3 = $ARGV[2];
$file_name4 = $ARGV[3];

open(IN1, "<".$file_name2) or die; # fasta-file with lines as headers and protein seqs
open(OUT1, ">".$file_name3) or die; # my file headers and seqpositions of the wrong aa symb
open(OUT2, ">".$file_name4) or die; #modified fasta –file - programs do not like atypical aas.

while(my $line=<IN1>) {

                 next if $line =~ /^\s$/; # skip empty line

#                next if $line =~ /^>/;

#I have to print this line to my output file

                 if($line =~ /^>/) {

                                 print OUT1 $line;

#                                print OUT2 $line."\n";

#                                next;}

                

# split the whole seq into the aminoacids. No new-line symbol

                 my @seq_line = split(//, $line);

                 for(my $i=0; $i< scalar(@seq_line); $i++) {

         if ($seq_line[$i] eq "U"){

                 $line=~s/U/L/;

print OUT1 "position".$i." .atypical.protsymbol"."_".$file_name3."\n";

                                 }

         elsif ($seq_line[$i] eq "u"){

                 $line=~s/u/l/;

print OUT1 "position ".$i." .atypical.protsymbol"."_".$file_name3."\n";

                                 }                           

}

print OUT2 $line."\n";

}

next;

}

close(IN1);

close(OUT1);

close(OUT2);                   

 

 

 

 

protein perl fasta • 1.3k views
ADD COMMENTlink modified 4.1 years ago by 5heikki8.4k • written 4.1 years ago by natasha.sernova3.4k
3
gravatar for dylan.storey
4.1 years ago by
dylan.storey60
United States
dylan.storey60 wrote:

The fastest way to do this would be to use the unix tool sed

sed 's/U/I/g' bad.fasta > good.fasta  

If you're actually interested in the position you'll need to use regex captures + pos or a special internal variable something along the lines of ...  go look at 

http://perldoc.perl.org/functions/pos.html

http://perldoc.perl.org/perlvar.html

#Find all the positions
while ( $sequence =~/(U)/g){
    push (@positions , pos($sequence))
     }

#replace all U with L
$seqeunces =~ tr/U/L/;

OR: 

while ( $sequence =~/U/g){    
    push (@positions ,$-[0]); save begining of match
    push (@positions ,$+[0]); # save end of match
    )
#replace all U with L
$seqeunces =~ tr/U/L/;
ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by dylan.storey60

Thank you very much, I will try. The position is the main point - I won't be able to find the particular L without it. And it will give me a respective header, am I right?

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by natasha.sernova3.4k
2
gravatar for 5heikki
4.1 years ago by
5heikki8.4k
Finland
5heikki8.4k wrote:

This replaces all "U" with "I" outside headers:

awk '{if(/^>/)print $0; else {gsub("U","I"); print $0}}' file.fasta

Whereas this prints all positions (1-based) of "U" outside headers (I googled this one so I'm not certain, but it looks ok):

awk '{if(/^>/) print $0; else {s=$0; i=1; idx=0; while(i>0){i=match(s, /U/); if(i>0) {idx += i;print idx;s=substr(s, i+1);}}}}' file.fasta

 

Also, how come there's no gmatch (like sub/gsub)? 

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by 5heikki8.4k

Thank you very much! So, I have to run the second expression first, and then, knowing all existing positions of the atypical leucines in the file, replace them to the normal leucines with the first expression, right? Have I understood the idea?

ADD REPLYlink written 4.1 years ago by natasha.sernova3.4k
1

exactly. This second expression is required if you want to know the positions without replacing it to normal LEU, otherwise the simple sed or awk expressions are enough to replace U with L.

ADD REPLYlink written 4.1 years ago by venu6.1k
1

Yeah, note that neither edits the files in place but instead you have to direct the output with > newFile

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by 5heikki8.4k

you can also use the -i flag with sed if you're feeling confident. 

ADD REPLYlink written 4.1 years ago by dylan.storey60

The sed command edits also headers though..

ADD REPLYlink written 4.1 years ago by 5heikki8.4k

Thank you, it's important to remember.

ADD REPLYlink written 4.1 years ago by natasha.sernova3.4k

I've run it, it's perfect! Thousand thanks for your help!

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by natasha.sernova3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1171 users visited in the last hour