Hi all,

So I am trying to develop the correct statistical test in R to determine the following:

I have 1000000 sequences each of five base pair length. Eg:

ATTTG

ATGCG

ATTTT

GCCCT
.

.

.

1000000 items.

Also there are 12 catgories in which these sequences can be divided. So say, 40000 out of 10000000 sequences belong to category1 and so on.

Now I have to develop a statistical test to determine how significant are the presence of nucleotides in each location. To elaborate, say, location 1 of category 1 30% A, 40% G and so on. These percentages are calculated using :

(number of As in location 1)/1000000 etc. So, how significant is the proportion of A at location 1. Is there an overrepresentation of As?

I thought of two ways: One is to calculate the genome average of A,T, G,C. So in hg38 ref genome, I have say, 27% A, 21%G etc for hg38. Now taking category1 (all 40000 sequences) and calculating %A in my location 1 clearly indicates an over-representation at that location. But I am skeptical about this method mainly because it doesn't involve any statistical analysis. Can you please suggest otherwise?

Second, shuffle each nucleotide on a positional basis. Say, I shuffle all the nucleotides in Position 1, similarly for Position 2 and so on for all 10000000 sequences. Now I pick 40000 sequences(for category 1) at random from these already shuffled list and see what percentage of A I get. Similarly for category 2 and so on. Is this correct?

Please suggest otherwise. Thanks in advance for reading such a long question.

