Question: Wanted: Basic Sequence Comparison Algorithm With Statistical Significance Output
gravatar for Fabian
8.2 years ago by
Fabian50 wrote:

In my dataset I have fragments ranging from 5 amino acids to 20 amino acids. For each lenght, I have 200,000 -- 400,000 distinct sequences.

I want to compare same-length sequence fragments (all vs. all) with no gaps allowed. Thus even Needleman-Wunsch is overkill because I don't need any alignment, just comparison. I know I could implement this myself with just one for-loop and a method to access the PAM/BLOSUM matrices. I did this already, actually.

My problem is that I want the values for different-length fragments to be comparable. If I just sum up the log-odds values of the substitution matrices, they are not. I need some statistical significance value like BLAST's E-value. But I don't know how to determine K and lamda for my database.

Isn't there something off-the-shelf I can use?

sequence • 1.5k views
ADD COMMENTlink written 8.2 years ago by Fabian50
gravatar for Gustavo
8.2 years ago by
Gustavo530 wrote:

Perhaps GGSEARCH from the FASTA package, setting the gap opening penalty to a sufficiently large value.

ADD COMMENTlink written 8.2 years ago by Gustavo530

Hi Gustavo, my problem is that I need this to be quite fast because I have so many sequences all vs. all. The whole NW dynamic programming stuff is unnecessary overhead.

ADD REPLYlink written 8.2 years ago by Fabian50
gravatar for Woa
8.2 years ago by
United States
Woa2.8k wrote:

Can this be of any help ?

Estimation of P-values for global alignments of protein sequences

ADD COMMENTlink written 8.2 years ago by Woa2.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1860 users visited in the last hour