Question

Computing statistical cutoff from global pairwise alignment scores

0

Entering edit mode

4.0 years ago

rrgnu2007 ▴ 10

I computed global pairwise alignment scores between a reference DNA sequence (length of 25) and a list of sequences (length varies from 20 to 150) using function pairwiseAlignment in R package Biostrings. The computed score value ranges from -260 to 50. Can you please help me how to compute a statistical cutoff, so as to filter sequences with a clear mapping? Please let me know if you need any additional information.

Thank you.

sequencing alignment • 1.2k views

ADD COMMENT • link updated 4.0 years ago by Asaf 10k • written 4.0 years ago by rrgnu2007 ▴ 10

score 1 · Accepted Answer · 2020-04-23

1

Entering edit mode

4.0 years ago

Asaf 10k

If you don't want to make assumption about nucleotides and di-nucleotides distributions, you can shuffle each sequence while keeping these distributions the same and run the alignment. Do the shuffling 1000 times (or 10000 etc.) to get an empirical cut-off for the score. I'm just wondering if single and di-nucleotide distribution is enough or triplets quads etc. should be used.

The algorithm for shuffling is pretty simple, you first collect the probability of observing a di-nucleotides, then when you generate the sequence you sample the first nucleotide and the sample the next one using the four di-nucleotides that start with the first one, repeat for the entire length.

ADD COMMENT • link 4.0 years ago by Asaf 10k

0

Entering edit mode

Thank you @Asaf. I am little bit confused about what you meant by "I'm just wondering if single and di-nucleotide distribution is enough or triplets quads etc. should be used". My reference sequence is a combination of "C" "G" "T" "A" and my sequenced observation has "N" along with the combinations of the "C" "G" "T" "A . I hope this is what you are referring to.

ADD REPLY • link 4.0 years ago by rrgnu2007 ▴ 10

0

Entering edit mode

No, what I meant is there are dimers that appear more than others in the DNA for biological reasons, AG might be more common than AC although C and G are equally probable, If a sequence has a lot of AG and it has some score X for the alignment, if you shuffle the sequence using the single nucleotide distribution you will end up with more AC than expected in a random biological DNA sequence so their alignment scores will be lower than expected at random (biologically). I was just wondering if nucleotide triplets and quadruplets are also prominent in DNA or di-nucs will answer this bias.

ADD REPLY • link 4.0 years ago by Asaf 10k

0

Entering edit mode

Thank you @Asaf for your patience. I am new to NGS analysis. Is the analysis specified here a good approach https://a-little-book-of-r-for-bioinformatics.readthedocs.io/en/latest/src/chapter4.html#calculating-the-statistical-significance-of-a-pairwise-global-alignment

ADD REPLY • link 4.0 years ago by rrgnu2007 ▴ 10

0

Entering edit mode

Yes, computing an empirical p-value described there is what I suggested, the problem with their approach is that it will generate non-biological sequences so you might end up with a (statistically) significant score that doesn't necessarily reflect biological significance. I'm sorry, I didn't understand what you're trying to show with the R code. By the way, this is not a standard NGS processing step, make sure this is what you want.

ADD REPLY • link 4.0 years ago by Asaf 10k