Question: How Can I Count Dipeptide (Or Features Of Protein Sequences) In Multiple Protein Sequences
2
gravatar for Mohammad Reza Bakhtiarizadeh
7.9 years ago by
Tehran university

Hi, i have a fasta file including multiple protein sequences. I want to do feature selection for them and to do this, i need features of the proteins (number of each amino acid, number of di-peptide and something like this) but i couldn't find any software or script. if possible help me. thanks a lot for any help best regards

protein • 4.5k views
ADD COMMENTlink modified 4.6 years ago by ddofer30 • written 7.9 years ago by Mohammad Reza Bakhtiarizadeh290
2
gravatar for Julien
7.9 years ago by
Julien150
Julien150 wrote:

Try EMBOSS protein composition > compseq:

web mirror: http://pro.genomics.purdue.edu/emboss/

download: http://emboss.sourceforge.net/download/

ADD COMMENTlink written 7.9 years ago by Julien150

Thanks for your help. Compseq count dipeptide for a sequence or several sequences but it display a table (for all of them) but i need a table for each of sequences. in fact, i need a software that i can submit a fasta file and it bring me a table including number of dipeptides in each of sequences.

ADD REPLYlink written 7.9 years ago by Mohammad Reza Bakhtiarizadeh290
1
gravatar for Chris
7.9 years ago by
Chris1.6k
Munich
Chris1.6k wrote:

This is a very fundamental thing to do on protein sequences, hence you most likely won't find readily accessible scripts for that. Using your script-language of choice (like python, perl), this should be easily achievable by simple sequence parsing.

ADD COMMENTlink written 7.9 years ago by Chris1.6k

thanks for your answer. i am a bit fresh about bioinformatis specially protein bioinformatics. i need a script (it doesn't matter perl or python) to do that. if there is any thing like this inform me. thanks

ADD REPLYlink written 7.9 years ago by Mohammad Reza Bakhtiarizadeh290

What Chris is saying is that for this type of task, there is rarely a ready-made program to generate exactly the output you want. There may be something close (such as the tools in EMBOSS), but you still need to parse the output into the required form. So an experienced bioinformatician would write the code themselves, probably using a library for sequence analysis. It would be nice if they then made that code available but frequently, this is not the case. Sorry if that's not helpful, but at least you're now aware of how people do these tasks and what you need to do: learn to script yourself!

ADD REPLYlink written 7.9 years ago by Neilfws48k
1
gravatar for Woa
7.9 years ago by
Woa2.7k
United States
Woa2.7k wrote:

Here is a quick perl hack. Please check it thoroughly until it gives you the desired result. This however doesn't report frequencies but just only non zero counts. If certain dipeptides are missing the dipeptides and correspnding counts(0) are not reported.

use strict;
use diagnostics;
use Bio::SeqIO;
my $in  = Bio::SeqIO->new(-file => "myfile.fasta" , '-format' => 'Fasta');
while ( my $seq= $in->next_seq ) {
my @dipeps=($seq->seq()=~/(?=(.{2}))/g);
my %di_count=();
$di_count{$_}++ for @dipeps;
print ">",$seq->id();
print " ",$_,",",$di_count{$_} for sort keys(%di_count);
print "\n";
}

This small modification can report all possible dipep counts. Note that I've included two extra amino acids U and X

ADD COMMENTlink modified 7.9 years ago • written 7.9 years ago by Woa2.7k

Thanks for your answer. i am freshman about perl and it is such as stupid question. when i run the script, it give error (cant locate Bio/seqI0.pm in @INC ....) I don't know that what things must be installed. if possible help me about this. regards

ADD REPLYlink written 7.9 years ago by Mohammad Reza Bakhtiarizadeh290

Bio::SeqIO is a Bioperl module - http://www.bioperl.org - so that's what you need to install. It's not easy for complete beginners I'm afraid, but nor is it very difficult, if you read and follow the instructions to the letter.

ADD REPLYlink written 7.9 years ago by Neilfws48k
1
gravatar for Neilfws
7.9 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

As others have indicated, most bioinformaticians would write their own code to perform this task.

Something like a "ready-made" solution exists in seqinr, an R package for sequence analysis. Of course, that requires that you know or are willing to learn some R. Some sample code:

# install seqinr if required
install.packages("seqinr")
library(seqinr)

# read fasta sequences in file "sequences.fasta"
seqs <- read.fasta("sequences.fasta")

# amino acid composition of first sequence
# note that you need to specify protein alphabet
count(seqs[[1]], 1, alphabet = s2c("acdefghiklmnpqrstvwy"))

 a  c  d  e  f  g  h  i  k  l  m  n  p  q  r  s  t  v  w  y 
60  2 43 44 19 50 14 30 31 49 15 21 37 20 23 28 41 37 11 21

# for dipeptides, change 1 to 2 (output not shown)
count(seqs[[1]], 2, alphabet = s2c("acdefghiklmnpqrstvwy"))
# for all sequences in list, use lapply()
lapply(s1, function(x) count(x, 1, alphabet = s2c("acdefghiklmnpqrstvwy")))

$`gi|452453|gb|AAA93118.1|`

 a  c  d  e  f  g  h  i  k  l  m  n  p  q  r  s  t  v  w  y 
60  2 43 44 19 50 14 30 31 49 15 21 37 20 23 28 41 37 11 21

$`gi|76667604|dbj|BAE45629.1|`

 a  c  d  e  f  g  h  i  k  l  m  n  p  q  r  s  t  v  w  y 
36  6 35 29 17 43 15 32 53 39 12 20 41 16 26 16 36 42 14 23

$`gi|995617|emb|CAA62740.1|`

 a  c  d  e  f  g  h  i  k  l  m  n  p  q  r  s  t  v  w  y 
57  2 39 24 14 43 14 30 45 46 10 19 35 12 23 28 39 41 11 22
ADD COMMENTlink written 7.9 years ago by Neilfws48k
0
gravatar for Rm
7.9 years ago by
Rm7.8k
Danville, PA
Rm7.8k wrote:

This script: protein_dipeptid.pl might be useful to you. I haven't used it though.

ADD COMMENTlink written 7.9 years ago by Rm7.8k

Thanks for your help. it works like Compseq (in EMBOSS ) and bring composition of dipeptides as a table (not for each sequence). regards

ADD REPLYlink written 7.9 years ago by Mohammad Reza Bakhtiarizadeh290
0
gravatar for ddofer
4.6 years ago by
ddofer30
Israel
ddofer30 wrote:

I'm actually in the final stages of a comprehensive package for doing exactly this right now! 

That said, here's an older prototype, that should do all the things you're looking for (dipeptides, etc' ).

 

http://www.protonet.cs.huji.ac.il/neuropid/code/index.php

 

NPID_WebPackage.zip   

(You want "SLEEK_FeatureGen+_new.py")

http://neuropid.cs.huji.ac.il/

ADD COMMENTlink written 4.6 years ago by ddofer30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2150 users visited in the last hour