Wanted: Basic Sequence Comparison Algorithm With Statistical Significance Output
2
1
Entering edit mode
12.1 years ago
Fabian ▴ 50

In my dataset I have fragments ranging from 5 amino acids to 20 amino acids. For each lenght, I have 200,000 -- 400,000 distinct sequences.

I want to compare same-length sequence fragments (all vs. all) with no gaps allowed. Thus even Needleman-Wunsch is overkill because I don't need any alignment, just comparison. I know I could implement this myself with just one for-loop and a method to access the PAM/BLOSUM matrices. I did this already, actually.

My problem is that I want the values for different-length fragments to be comparable. If I just sum up the log-odds values of the substitution matrices, they are not. I need some statistical significance value like BLAST's E-value. But I don't know how to determine K and lamda for my database.

Isn't there something off-the-shelf I can use?

sequence • 2.2k views
ADD COMMENT
1
Entering edit mode
12.1 years ago
Gustavo ▴ 530

Perhaps GGSEARCH from the FASTA package, setting the gap opening penalty to a sufficiently large value.

ADD COMMENT
0
Entering edit mode

Hi Gustavo, my problem is that I need this to be quite fast because I have so many sequences all vs. all. The whole NW dynamic programming stuff is unnecessary overhead.

ADD REPLY
0
Entering edit mode
12.1 years ago
Woa ★ 2.9k

Can this be of any help ?

Estimation of P-values for global alignments of protein sequences

http://www.ncbi.nlm.nih.gov/pubmed/11751224

ADD COMMENT

Login before adding your answer.

Traffic: 2251 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6