I have a small database of ~ 1000 biologically validated DNA sequences, varying in length 5-20KB. From each of them, I have extracted 1000bp of their 5' terminal sequences, and separately 1000bp of their 3' terminal sequences. How can I test for independence of these 5' and 3' sequences?
If these were numbers instead of sequences, I would perform a Chi-Square test, right? But what is the equivalent test for DNA sequence independence? In other words, does presence of a certain 5' terminal sequence correlate (directly or inversely) with a certain 3' terminal sequence and vice versa?
I cannot think of how to perform this independence test directly using DNA sequences, so I seek BioStars help for this.
Currently, I am thinking of this pipeline:
- Cluster 5' terminal sequences at varying identities, record the cluster memberships.
- Repeat step 1 in exactly the same way, for 3' sequences.
- For each of my 1000 DNA sequences, report cluster memberships separately, for both ends, and at different clustering ID %
- From table generated in step 3, determine if and and what % identity, there appears to be any correlation - would this step be based on Multinomial logistic regression? (note to self: wiki link)
Please suggest any changes in approach and / or implementation. Thanks in advance!