Identifying Amino Acids In A Fasta Sequence File By Their Properties (Hydrophpic, Charged Etc)
1
0
Entering edit mode
11.6 years ago
Shweta ▴ 20

I have a protein sequence in a file. I want to locate if the sequence hxxhcxc is present in the file or not, if yes, then print the stretch. Here, h=hydrophobic, c=charged, x=any (including remaining) residue/s. How to do this in perl?

What I could think of is make 3 arrays—of hydrophobic, charged and all residues. Compare each array with the file having the FASTA sequence. I can't think of anything beyond this, especially how to maintain the order—that's the main thing. I am a beginner in Perl, so please make the explanation as simple as possible.

Thanks in advance.

perl sequence protein • 3.6k views
ADD COMMENT
4
Entering edit mode
11.6 years ago
Eric ▴ 40

What you need is a regular expression.

This script should do it:

The code can be compacted a bit, but I thought this was more readable.

#!/usr/bin/perl
use strict;
use warnings;

#This is to unwrap the FASTA formatted file into records
$/=">";
<>;

while (my $line = <>) {
    my ($header, @seq) = split /\n/, $line;
    my $sequence = join '', @seq;

#Find all occurrences of the pattern. hydrophobic = [AVILMFYW], charged = [RHKDE], "." matches any character
    while ( $sequence =~ m/([AVILMFYW]..[AVILMFYW][RHKDE].[RHKDE])/gi ){
        print "$1\n"
    }
}
ADD COMMENT
0
Entering edit mode

Hi, thanks for the reply. But this is not showing any output. No errors though, the prompt just moves on to the next line. Also, I didn't quite understand the code (please pardon my ignorance); so if you could just whiz past what it's trying to say, I would be able to refrain from asking silly questions in future

ADD REPLY
0
Entering edit mode

If your FASTA formatted file if names sequences.fa. Usage would be

script.pl sequences.fa > output

The script first changes the end of record separator from a newline (\n) to ">" which is the first character for each FASTA record. This lets you cycle through the sequences one at a time. I believe I saw this method in "Beginning Perl for Bioinformatics". I use it often.

The outer while loop cycles through each FASTA record and combines all of the lines of sequence into a single variable, $sequence, so that we can match the pattern even if it is on multiple lines.

The inner while loop finds each occurrence of the regular expression within $sequence and prints it to STDOUT.

The real work of the script is the regular expression: m/([AVILMFYW]..[AVILMFYW][RHKDE].[RHKDE])/gi

Regular expressions can be tricky, but very powerful. There are several good books on regular expressions and most perl books will at least have a section. "Mastering Regular Expressions" is the serious guide, but there are many good tutorials online.

ADD REPLY

Login before adding your answer.

Traffic: 2715 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6