Question

Wanted: Basic Sequence Comparison Algorithm With Statistical Significance Output

1

Entering edit mode

12.1 years ago

Fabian ▴ 50

In my dataset I have fragments ranging from 5 amino acids to 20 amino acids. For each lenght, I have 200,000 -- 400,000 distinct sequences.

I want to compare same-length sequence fragments (all vs. all) with no gaps allowed. Thus even Needleman-Wunsch is overkill because I don't need any alignment, just comparison. I know I could implement this myself with just one for-loop and a method to access the PAM/BLOSUM matrices. I did this already, actually.

My problem is that I want the values for different-length fragments to be comparable. If I just sum up the log-odds values of the substitution matrices, they are not. I need some statistical significance value like BLAST's E-value. But I don't know how to determine K and lamda for my database.

Isn't there something off-the-shelf I can use?

sequence • 2.2k views

ADD COMMENT • link updated 12.1 years ago by Gustavo ▴ 530 • written 12.1 years ago by Fabian ▴ 50

score 1 · Answer 1 · 2012-03-15

1

Entering edit mode

12.1 years ago

Gustavo ▴ 530

Perhaps GGSEARCH from the FASTA package, setting the gap opening penalty to a sufficiently large value.

ADD COMMENT • link 12.1 years ago by Gustavo ▴ 530

0

Entering edit mode

Hi Gustavo, my problem is that I need this to be quite fast because I have so many sequences all vs. all. The whole NW dynamic programming stuff is unnecessary overhead.

ADD REPLY • link 12.1 years ago by Fabian ▴ 50

score 0 · Answer 2 · 2012-03-15

0

Entering edit mode

12.1 years ago

Woa ★ 2.9k

Can this be of any help ?

Estimation of P-values for global alignments of protein sequences

http://www.ncbi.nlm.nih.gov/pubmed/11751224

ADD COMMENT • link 12.1 years ago by Woa ★ 2.9k