Testing independence between 5' and 3' terminal sequences in a DNA database
0
0
Entering edit mode
4 weeks ago
Anand Rao ▴ 360

I have a small database of ~ 1000 biologically validated DNA sequences, varying in length 5-20KB. From each of them, I have extracted 1000bp of their 5' terminal sequences, and separately 1000bp of their 3' terminal sequences. How can I test for independence of these 5' and 3' sequences?

If these were numbers instead of sequences, I would perform a Chi-Square test, right? But what is the equivalent test for DNA sequence independence? In other words, does presence of a certain 5' terminal sequence correlate (directly or inversely) with a certain 3' terminal sequence and vice versa?

I cannot think of how to perform this independence test directly using DNA sequences, so I seek BioStars help for this.

Currently, I am thinking of this pipeline:

1. Cluster 5' terminal sequences at varying identities, record the cluster memberships.
2. Repeat step 1 in exactly the same way, for 3' sequences.
3. For each of my 1000 DNA sequences, report cluster memberships separately, for both ends, and at different clustering ID %
4. From table generated in step 3, determine if and and what % identity, there appears to be any correlation - would this step be based on Multinomial logistic regression? (note to self: wiki link)

Please suggest any changes in approach and / or implementation. Thanks in advance!

sequence cluster DNA independence testing • 319 views
0
Entering edit mode

I feel that your approach lacks a proper definition. what do you mean by "independence"?

In what way would a 5' and 3' sequence "dependent" on one another?

What you seem to propose is measuring some sort of similarity, that may be ok, perhaps it would be better for everyone if you called it that way. Renaming "similarity" and presenting it as "dependence" feels like a stretch.

And the Chi-Square test does not measure dependence either. It measures whether an observed effect size could have been caused by random chance alone when sampling identical distributions. As a matter of fact all tests require independence of measurements, none would produce correct values if the measures were not independent.

0
Entering edit mode

Thanks for your reply. Yes, I think I could have explained my problem much better, but rather than change my original post, I will add some clarification in my comment here.

I am not looking to measure sequence similarity , but whether a certain sequence at the 5' end indicates a high probability of a certain other sequence at the 3' end, and vice versa - i.e. whether the terminal sequences are 'dependent' or 'independent'.

In this context, I believe testing for independence using Chi-Sq test does make some sense. https://online.stat.psu.edu/stat500/lesson/8/8.1. Perhaps you agree now? If not, then does my clarification at least help you suggest a different approach. TIA!

1
Entering edit mode

with that definition I would suspect that all of your sequences are "dependent", I would expect every functional DNA/RNA region to capture some relationship between the start and end.

Thus I think the question is more about the magnitude of that dependence rather than existence.

I would start with simpler measures of correlation, for example GC content of 5' plotted against gc content of 3', perhaps codon usage.

The method you describe might work as well, but I would run it on different clustering levels and importantly you need to properly estimate what the excepted frequencies ought to be.