Local Alignment Statistical Significance
1
1
Entering edit mode
7.9 years ago
Maria ▴ 170

I want to align 2 sequences locally (Smith Waterman algorithm) ==> the output will be several alignments some of them unlike others is significant. What I want to do is to test the significance by a randomization/permutation test. And below is my approach:

1. Align S1 to S2
2. ==> obtain for example 10 alignments (Ai) longer then some threshold
3. for each Ai test if significant :
• Do a permutation of S1 and S2 and align those permutations
• We obtain a score for each alignment
• Calculate the probability of those alignments that have a score > the initial score obtained by the initial alignment Question : how to continue ? how to decide if the alignment is significant? And are these probabilities obtained considered as p-values ? (my statistical knowledge is humble) Thanks in advance
test statistics • 2.8k views
2
Entering edit mode
7.9 years ago

I am not clear what exactly you are trying to ask. Here are some random ideas they may or may not make sense to you:

1) SW uses dynamic programming so it will give you the best alignment or the alignment with the highest alignment score. So there is no need to do the permutation testing to check if the best alignment is the best one.

2) Sometimes there can be more than one best alignments as the alignment score depends on the match, mismatch and the gap penalty.

3) Shuffling or Randomisation of the sequence would not help to test for the significance of the alignment as you are messing up the order of the nucleotides in a sequence. You should have a constraint that preserves the order of the nucleotides. So the best thing you can do is that instead of randomizing the sequences , you can slide the two sequences in an alignment against each other (preserving the order of nucleotides in each sequence) and calculate alignment scores.

4) P-value can be generated as : (number of alignment scores from step 3 > alignment score given by Smith Waterman from step 1) / Total number of alignments compared.

P-value using step 4 for the best alignment generated by Smith-Waterman should always be zero.

0
Entering edit mode

1- But the query sequence could be found in more than one region in the reference sequence, thus I don't only need the alignment with the highest score, I need all the alignment whose length is above some threshhold 3- How to slide the the two sequences ? can you please give a small example ?

0
Entering edit mode

The standard SW only gives a single best alignment. There are variants that can give you multiple, but they are not SW and I have not seen them used in practice. You should just pick up an aligner. Forget about SW.