how to find and replace the atypical aas in a protein fasta-file, remembering their protein position
2
0
Entering edit mode
6.3 years ago
natasha.sernova ★ 3.8k

Dear ALL,

I am trying to replace atypical aminoacid  (one letter symbol like U), that is either leucine or isoleucine.

I have a long bacterial protein fasta-file with the single line headers and single line seqs.

I need to produce the modified fasta-file with the replaced atypical aminoacids and an additional file saying where they where situated before the substitution. I have a very strange result - probably I made a mistake in curly brace positions.

I’am looking at this file for too long, I’ve not seen any obvious mistake.

Please, help me to find it! Thank you very much!

Sincerely yours,

Natasha

#!/bin/env perl
use strict;
use warnings;
my $file_name1;
my $file_name2;
my $file_name3;
my $file_name4;
my $line;
my @seq_line;
my @string;

$file_name1 = $ARGV[0];
$file_name2 = $ARGV[1];
$file_name3 = $ARGV[2];
$file_name4 = $ARGV[3];

open(IN1, "<".$file_name2) or die; # fasta-file with lines as headers and protein seqs
open(OUT1, ">".$file_name3) or die; # my file headers and seqpositions of the wrong aa symb
open(OUT2, ">".$file_name4) or die; #modified fasta –file - programs do not like atypical aas.

while(my $line=<IN1>) {

                 next if $line =~ /^\s$/; # skip empty line

#                next if $line =~ /^>/;

#I have to print this line to my output file

                 if($line =~ /^>/) {

                                 print OUT1 $line;

#                                print OUT2 $line."\n";

#                                next;}

                

# split the whole seq into the aminoacids. No new-line symbol

                 my @seq_line = split(//, $line);

                 for(my $i=0; $i< scalar(@seq_line); $i++) {

         if ($seq_line[$i] eq "U"){

                 $line=~s/U/L/;

print OUT1 "position".$i." .atypical.protsymbol"."_".$file_name3."\n";

                                 }

         elsif ($seq_line[$i] eq "u"){

                 $line=~s/u/l/;

print OUT1 "position ".$i." .atypical.protsymbol"."_".$file_name3."\n";

                                 }                           

}

print OUT2 $line."\n";

}

next;

}

close(IN1);

close(OUT1);

close(OUT2);                   

 

 

 

 

protein fasta perl • 1.6k views
ADD COMMENT
3
Entering edit mode
6.3 years ago
dylan.storey ▴ 60

The fastest way to do this would be to use the unix tool sed

sed 's/U/I/g' bad.fasta > good.fasta  

If you're actually interested in the position you'll need to use regex captures + pos or a special internal variable something along the lines of ...  go look at 

http://perldoc.perl.org/functions/pos.html

http://perldoc.perl.org/perlvar.html

#Find all the positions
while ( $sequence =~/(U)/g){
    push (@positions , pos($sequence))
     }

#replace all U with L
$seqeunces =~ tr/U/L/;

OR: 

while ( $sequence =~/U/g){    
    push (@positions ,$-[0]); save begining of match
    push (@positions ,$+[0]); # save end of match
    )
#replace all U with L
$seqeunces =~ tr/U/L/;
ADD COMMENT
0
Entering edit mode

Thank you very much, I will try. The position is the main point - I won't be able to find the particular L without it. And it will give me a respective header, am I right?

ADD REPLY
2
Entering edit mode
6.3 years ago
5heikki 9.8k

This replaces all "U" with "I" outside headers:

awk '{if(/^>/)print $0; else {gsub("U","I"); print $0}}' file.fasta

Whereas this prints all positions (1-based) of "U" outside headers (I googled this one so I'm not certain, but it looks ok):

awk '{if(/^>/) print $0; else {s=$0; i=1; idx=0; while(i>0){i=match(s, /U/); if(i>0) {idx += i;print idx;s=substr(s, i+1);}}}}' file.fasta

 

Also, how come there's no gmatch (like sub/gsub)? 

ADD COMMENT
0
Entering edit mode

Thank you very much! So, I have to run the second expression first, and then, knowing all existing positions of the atypical leucines in the file, replace them to the normal leucines with the first expression, right? Have I understood the idea?

ADD REPLY
1
Entering edit mode

exactly. This second expression is required if you want to know the positions without replacing it to normal LEU, otherwise the simple sed or awk expressions are enough to replace U with L.

ADD REPLY
1
Entering edit mode

Yeah, note that neither edits the files in place but instead you have to direct the output with > newFile

ADD REPLY
0
Entering edit mode

you can also use the -i flag with sed if you're feeling confident. 

ADD REPLY
0
Entering edit mode

The sed command edits also headers though..

ADD REPLY
0
Entering edit mode

Thank you, it's important to remember.

ADD REPLY
0
Entering edit mode

I've run it, it's perfect! Thousand thanks for your help!

ADD REPLY

Login before adding your answer.

Traffic: 841 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6