In my dataset I have fragments ranging from 5 amino acids to 20 amino acids. For each lenght, I have 200,000 -- 400,000 distinct sequences.
I want to compare same-length sequence fragments (all vs. all) with no gaps allowed. Thus even Needleman-Wunsch is overkill because I don't need any alignment, just comparison. I know I could implement this myself with just one for-loop and a method to access the PAM/BLOSUM matrices. I did this already, actually.
My problem is that I want the values for different-length fragments to be comparable. If I just sum up the log-odds values of the substitution matrices, they are not. I need some statistical significance value like BLAST's E-value. But I don't know how to determine K and lamda for my database.
Isn't there something off-the-shelf I can use?