I computed global pairwise alignment scores between a reference DNA sequence (length of 25) and a list of sequences (length varies from 20 to 150) using function pairwiseAlignment in R package Biostrings. The computed score value ranges from -260 to 50. Can you please help me how to compute a statistical cutoff, so as to filter sequences with a clear mapping? Please let me know if you need any additional information.
Thank you.
Thank you @Asaf. I am little bit confused about what you meant by "I'm just wondering if single and di-nucleotide distribution is enough or triplets quads etc. should be used". My reference sequence is a combination of "C" "G" "T" "A" and my sequenced observation has "N" along with the combinations of the "C" "G" "T" "A . I hope this is what you are referring to.
No, what I meant is there are dimers that appear more than others in the DNA for biological reasons, AG might be more common than AC although C and G are equally probable, If a sequence has a lot of AG and it has some score X for the alignment, if you shuffle the sequence using the single nucleotide distribution you will end up with more AC than expected in a random biological DNA sequence so their alignment scores will be lower than expected at random (biologically). I was just wondering if nucleotide triplets and quadruplets are also prominent in DNA or di-nucs will answer this bias.
Thank you @Asaf for your patience. I am new to NGS analysis. Is the analysis specified here a good approach https://a-little-book-of-r-for-bioinformatics.readthedocs.io/en/latest/src/chapter4.html#calculating-the-statistical-significance-of-a-pairwise-global-alignment
Yes, computing an empirical p-value described there is what I suggested, the problem with their approach is that it will generate non-biological sequences so you might end up with a (statistically) significant score that doesn't necessarily reflect biological significance. I'm sorry, I didn't understand what you're trying to show with the R code. By the way, this is not a standard NGS processing step, make sure this is what you want.