Computing statistical cutoff from global pairwise alignment scores
1
0
Entering edit mode
4.0 years ago
rrgnu2007 ▴ 10

I computed global pairwise alignment scores between a reference DNA sequence (length of 25) and a list of sequences (length varies from 20 to 150) using function pairwiseAlignment in R package Biostrings. The computed score value ranges from -260 to 50. Can you please help me how to compute a statistical cutoff, so as to filter sequences with a clear mapping? Please let me know if you need any additional information.

Thank you.

sequencing alignment • 1.2k views
ADD COMMENT
1
Entering edit mode
4.0 years ago
Asaf 10k

If you don't want to make assumption about nucleotides and di-nucleotides distributions, you can shuffle each sequence while keeping these distributions the same and run the alignment. Do the shuffling 1000 times (or 10000 etc.) to get an empirical cut-off for the score. I'm just wondering if single and di-nucleotide distribution is enough or triplets quads etc. should be used.

The algorithm for shuffling is pretty simple, you first collect the probability of observing a di-nucleotides, then when you generate the sequence you sample the first nucleotide and the sample the next one using the four di-nucleotides that start with the first one, repeat for the entire length.

ADD COMMENT
0
Entering edit mode

Thank you @Asaf. I am little bit confused about what you meant by "I'm just wondering if single and di-nucleotide distribution is enough or triplets quads etc. should be used". My reference sequence is a combination of "C" "G" "T" "A" and my sequenced observation has "N" along with the combinations of the "C" "G" "T" "A . I hope this is what you are referring to.

ADD REPLY
0
Entering edit mode

No, what I meant is there are dimers that appear more than others in the DNA for biological reasons, AG might be more common than AC although C and G are equally probable, If a sequence has a lot of AG and it has some score X for the alignment, if you shuffle the sequence using the single nucleotide distribution you will end up with more AC than expected in a random biological DNA sequence so their alignment scores will be lower than expected at random (biologically). I was just wondering if nucleotide triplets and quadruplets are also prominent in DNA or di-nucs will answer this bias.

ADD REPLY
0
Entering edit mode

Thank you @Asaf for your patience. I am new to NGS analysis. Is the analysis specified here a good approach https://a-little-book-of-r-for-bioinformatics.readthedocs.io/en/latest/src/chapter4.html#calculating-the-statistical-significance-of-a-pairwise-global-alignment

ADD REPLY
0
Entering edit mode

Yes, computing an empirical p-value described there is what I suggested, the problem with their approach is that it will generate non-biological sequences so you might end up with a (statistically) significant score that doesn't necessarily reflect biological significance. I'm sorry, I didn't understand what you're trying to show with the R code. By the way, this is not a standard NGS processing step, make sure this is what you want.

ADD REPLY

Login before adding your answer.

Traffic: 2657 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6