How Can I Count Dipeptide (Or Features Of Protein Sequences) In Multiple Protein Sequences
6
2
Entering edit mode
13.0 years ago

Hi, i have a fasta file including multiple protein sequences. I want to do feature selection for them and to do this, i need features of the proteins (number of each amino acid, number of di-peptide and something like this) but i couldn't find any software or script. if possible help me. thanks a lot for any help best regards

protein protein protein • 7.4k views
ADD COMMENT
2
Entering edit mode
13.0 years ago
Julien ▴ 160

Try EMBOSS protein composition > compseq:

web mirror: http://pro.genomics.purdue.edu/emboss/

download: http://emboss.sourceforge.net/download/

ADD COMMENT
0
Entering edit mode

Thanks for your help. Compseq count dipeptide for a sequence or several sequences but it display a table (for all of them) but i need a table for each of sequences. in fact, i need a software that i can submit a fasta file and it bring me a table including number of dipeptides in each of sequences.

ADD REPLY
1
Entering edit mode
13.0 years ago
Chris ★ 1.6k

This is a very fundamental thing to do on protein sequences, hence you most likely won't find readily accessible scripts for that. Using your script-language of choice (like python, perl), this should be easily achievable by simple sequence parsing.

ADD COMMENT
0
Entering edit mode

thanks for your answer. i am a bit fresh about bioinformatis specially protein bioinformatics. i need a script (it doesn't matter perl or python) to do that. if there is any thing like this inform me. thanks

ADD REPLY
0
Entering edit mode

What Chris is saying is that for this type of task, there is rarely a ready-made program to generate exactly the output you want. There may be something close (such as the tools in EMBOSS), but you still need to parse the output into the required form. So an experienced bioinformatician would write the code themselves, probably using a library for sequence analysis. It would be nice if they then made that code available but frequently, this is not the case. Sorry if that's not helpful, but at least you're now aware of how people do these tasks and what you need to do: learn to script yourself!

ADD REPLY
1
Entering edit mode
13.0 years ago
Woa ★ 2.9k

Here is a quick perl hack. Please check it thoroughly until it gives you the desired result. This however doesn't report frequencies but just only non zero counts. If certain dipeptides are missing the dipeptides and correspnding counts(0) are not reported.

use strict;
use diagnostics;
use Bio::SeqIO;
my $in  = Bio::SeqIO->new(-file => "myfile.fasta" , '-format' => 'Fasta');
while ( my $seq= $in->next_seq ) {
my @dipeps=($seq->seq()=~/(?=(.{2}))/g);
my %di_count=();
$di_count{$_}++ for @dipeps;
print ">",$seq->id();
print " ",$_,",",$di_count{$_} for sort keys(%di_count);
print "\n";
}

This small modification can report all possible dipep counts. Note that I've included two extra amino acids U and X

ADD COMMENT
0
Entering edit mode

Thanks for your answer. i am freshman about perl and it is such as stupid question. when i run the script, it give error (cant locate Bio/seqI0.pm in @INC ....) I don't know that what things must be installed. if possible help me about this. regards

ADD REPLY
0
Entering edit mode

Bio::SeqIO is a Bioperl module - http://www.bioperl.org - so that's what you need to install. It's not easy for complete beginners I'm afraid, but nor is it very difficult, if you read and follow the instructions to the letter.

ADD REPLY
1
Entering edit mode
13.0 years ago
Neilfws 49k

As others have indicated, most bioinformaticians would write their own code to perform this task.

Something like a "ready-made" solution exists in seqinr, an R package for sequence analysis. Of course, that requires that you know or are willing to learn some R. Some sample code:

# install seqinr if required
install.packages("seqinr")
library(seqinr)

# read fasta sequences in file "sequences.fasta"
seqs <- read.fasta("sequences.fasta")

# amino acid composition of first sequence
# note that you need to specify protein alphabet
count(seqs[[1]], 1, alphabet = s2c("acdefghiklmnpqrstvwy"))

 a  c  d  e  f  g  h  i  k  l  m  n  p  q  r  s  t  v  w  y 
60  2 43 44 19 50 14 30 31 49 15 21 37 20 23 28 41 37 11 21

# for dipeptides, change 1 to 2 (output not shown)
count(seqs[[1]], 2, alphabet = s2c("acdefghiklmnpqrstvwy"))
# for all sequences in list, use lapply()
lapply(s1, function(x) count(x, 1, alphabet = s2c("acdefghiklmnpqrstvwy")))

$`gi|452453|gb|AAA93118.1|`

 a  c  d  e  f  g  h  i  k  l  m  n  p  q  r  s  t  v  w  y 
60  2 43 44 19 50 14 30 31 49 15 21 37 20 23 28 41 37 11 21

$`gi|76667604|dbj|BAE45629.1|`

 a  c  d  e  f  g  h  i  k  l  m  n  p  q  r  s  t  v  w  y 
36  6 35 29 17 43 15 32 53 39 12 20 41 16 26 16 36 42 14 23

$`gi|995617|emb|CAA62740.1|`

 a  c  d  e  f  g  h  i  k  l  m  n  p  q  r  s  t  v  w  y 
57  2 39 24 14 43 14 30 45 46 10 19 35 12 23 28 39 41 11 22
ADD COMMENT
0
Entering edit mode
13.0 years ago
Rm 8.3k

This script: protein_dipeptid.pl might be useful to you. I haven't used it though.

ADD COMMENT
0
Entering edit mode

Thanks for your help. it works like Compseq (in EMBOSS ) and bring composition of dipeptides as a table (not for each sequence). regards

ADD REPLY
0
Entering edit mode
9.7 years ago
ddofer ▴ 30

I'm actually in the final stages of a comprehensive package for doing exactly this right now!

That said, here's an older prototype, that should do all the things you're looking for (dipeptides, etc' ).

http://www.protonet.cs.huji.ac.il/neuropid/code/index.php

NPID_WebPackage.zip

(You want "SLEEK_FeatureGen+_new.py")

http://neuropid.cs.huji.ac.il/

ADD COMMENT

Login before adding your answer.

Traffic: 1832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6