Question: Perl code to get starting and ending position of list of peptides from the protein sequence of list of protein
0
gravatar for genie66
4.4 years ago by
genie6620
United States
genie6620 wrote:

I have a list of protein names with a list of corresponding peptides. I want to find the starting and ending position of each peptide from their corresponding protein sequence! Since the number of peptides is very huge, it's not possible to do manually. How to do this using programming!! Please help me out

Example:

 Protein                     peptide

A1AT_HUMAN LSITGTYDLK
A1AT_HUMAN SVLGQLGITK
A1BG_HUMAN NGVAQEPVHLDSPAIK
A1BG_HUMAN SGLSTGWTQLSK
A2GL_HUMAN DLLLPQPDLR
A2GL_HUMAN VAAGAFQGLR
A4_HUMAN LVFFAEDVGSNK
A4_HUMAN THPHFVIPYR
A4_HUMAN WYFDVTEGK
ADD COMMENTlink modified 4.4 years ago by Siva1.6k • written 4.4 years ago by genie6620

It's a published data, These peptides are used in the mass spec analysis, I need the position of the peptides from the corresponding proteins

ADD REPLYlink written 4.4 years ago by genie6620
1
gravatar for RamRS
4.4 years ago by
RamRS21k
Houston, TX
RamRS21k wrote:

This would be my approach:

1. Read file with ID and sequence (as shown in your question) into a hash, with the sequence ID as the key and an array of the sequences as the value - this is because you have multiple sequences to be searched per ID.

2. Iterate thru the FASTA file sequence by sequence

3. For each sequence, run a Perl Regex match operation and print the match location (details on this here: How can I find the location of a regex match in Perl?)

 

That should do the trick

ADD COMMENTlink written 4.4 years ago by RamRS21k

I am sorry. I am very new to programming. Can't this be done with a simple program?

ADD REPLYlink written 4.4 years ago by genie6620

Not really. I'd suggest a simple program, but it would be imperfect and not address the scenarios I see here, let alone contingencies for scenarios that might happen.

ADD REPLYlink written 4.4 years ago by RamRS21k

I'd suggest Python if you're new to programming, and you might have to go with string.index() method or use the technique shown here: http://stackoverflow.com/a/250303/1394178

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by RamRS21k

Thanks for your suggestion. Can I use the same logic(which you mentioned for perl) in python to run through all peptide and protein sequences and find positions?

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by genie6620

Yes, the logic should hold in Python.

ADD REPLYlink written 4.4 years ago by RamRS21k
1
gravatar for Siva
4.4 years ago by
Siva1.6k
United States
Siva1.6k wrote:

Once you have the protein sequences from UniProt, you can use Perl index() which takes a string (protein sequence, in your case) and a substring (peptide sequence, in your case) to search against the string and returns the start position (0-based) of the first occurrence of the substring. You add the length of the peptide sequence to get the end position.

I hope you are taking in to account that there may be more than one match for your peptide sequence in the protein sequence. In that case, you want to check first how many occurrences of the peptide sequence in the protein sequence. For a general idea, check this sample code

ADD COMMENTlink written 4.4 years ago by Siva1.6k

Wouldn't a RegEx match with @+ be easier?

ADD REPLYlink written 4.4 years ago by RamRS21k
1

Wow. I did not know about these built-in variables. Thank you. Though I might not recommend to someone who is not familiar with Perl to work with built-in variables with all those confusing "a-cat-walked-across-my-keyboard" symbols :)

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Siva1.6k
1

index() will be faster, so I wouldn't even worry about these variables. This task is exactly what index() was designed for, but of course, a regex solution can be applied to anything.
 

ADD REPLYlink written 4.4 years ago by SES8.2k

I agree. But it's not just the index() - it's the iteration to find all occurrences. That is where RegEx makes it easier.

ADD REPLYlink written 4.4 years ago by RamRS21k
1

Both approaches require a while() loop to find all matches, so there is no difference there. You need one more line to get the end position with index(), though I wouldn't say that regex is easier because of that. I would probably use index() because it is the more idiomatic choice, being designed specifically for this purpose, and it is also a bit more readable. I've also used regex for this purpose depending on the task. Outside of the performance between the two approaches, and there may be no difference, it probably doesn't really matter which method is used. Both are good choices and will be fast enough.

ADD REPLYlink written 4.4 years ago by SES8.2k

Thanks for the explanation.

ADD REPLYlink written 4.4 years ago by Siva1.6k
0
gravatar for Michael Dondrup
4.4 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

Did you try blastp with -task blastp-short against all human proteins? 

Are exactly matching (if so, how do you know)?

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Michael Dondrup46k
0
gravatar for Prasad
4.4 years ago by
Prasad1.5k
India
Prasad1.5k wrote:

this might not be the best and easiest method but this could serve the purpose. Based on the example, the data is from uniprot. You just map the first column in uniprot and get the sequences. write a simple perl code for string match (peptide against sequence you have obtained)

ADD COMMENTlink written 4.4 years ago by Prasad1.5k

Could you please help me in mapping the protein names in <gs class="GINGER_SOFTWARE_mark" ginger_software_uiphraseguid="c328c7be-72a6-4c23-9a43-b5bcd5ecaa64" id="062bcb23-f7fd-427f-a5c0-aa02f43a33ed">uniprot</gs> to get sequences of all the protein! How can I do that!

ADD REPLYlink written 4.4 years ago by genie6620

If I understand correctly, you want to download the protein sequences for a list of UniProt protein names? If so, have a look at this thread: How To Find Sequences For Protein Names (A Challenge)

ADD REPLYlink written 4.4 years ago by Siva1.6k

In uniprot there is a batch submission. You enter the 1st column (protein names) [Though uniprot removes duplicates, I would suggest you to remove duplicates if you have really huge list], then download just the sequence file. below code might be useful

$file=shift;
$file1=shift;

$/=">";

open JJ, $file or die "$file - $!\n";

foreach (<JJ>)
{
    chomp;
    next unless (($id,$seq) = /(.*?)\n(.*)/s);
    $seq=~s/[\s\d\W]//g; #remove digits, spaces, line breaks,...
    my @temp=split(" ",$id);
    $temp[0]=~m/(\w+$)/;
    $hash{$1}=$seq;
}
close JJ;
$/="\n";

open HH, $file1;
foreach (<HH>)
{
    chomp;
    my @mapp=split("\t", $_);
    $mapp[0]=~s/\s+//g;
    $start= index($hash{$mapp[0]}, $mapp[1], 0);
    next if ($start < 0);
    print $mapp[0],"\t",$mapp[1], "\t", $start+1,"\t", $start+1+length($mapp[1]),"\n";
}

close HH;

this gives first occurrence of peptide, if you want all, just change the offset in index function

 

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Prasad1.5k

Always use Bio-packages. Much less hackier than using > as record separators. Also, this code does't deal with the fact that each ID has multiple sequences and each sequence from the FASTA file might have multiple matches - though you mention changing the offset, it would involve the code being nested in a loop. Multiple matches is much easier using the regex match array.

ADD REPLYlink written 4.4 years ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 820 users visited in the last hour