Question: Correct statistical test to determine the significance of nucleotides present
0
7 months ago by
banerjeeshayantan80 wrote:

Hi all,
So I am trying to develop the correct statistical test in R to determine the following:
I have 1000000 sequences each of five base pair length. Eg:
ATTTG
ATGCG
ATTTT
GCCCT .
.
.
1000000 items.
Also there are 12 catgories in which these sequences can be divided. So say, 40000 out of 10000000 sequences belong to category1 and so on.
Now I have to develop a statistical test to determine how significant are the presence of nucleotides in each location. To elaborate, say, location 1 of category 1 30% A, 40% G and so on. These percentages are calculated using :
(number of As in location 1)/1000000 etc. So, how significant is the proportion of A at location 1. Is there an overrepresentation of As?

I thought of two ways: One is to calculate the genome average of A,T, G,C. So in hg38 ref genome, I have say, 27% A, 21%G etc for hg38. Now taking category1 (all 40000 sequences) and calculating %A in my location 1 clearly indicates an over-representation at that location. But I am skeptical about this method mainly because it doesn't involve any statistical analysis. Can you please suggest otherwise?

Second, shuffle each nucleotide on a positional basis. Say, I shuffle all the nucleotides in Position 1, similarly for Position 2 and so on for all 10000000 sequences. Now I pick 40000 sequences(for category 1) at random from these already shuffled list and see what percentage of A I get. Similarly for category 2 and so on. Is this correct?

next-gen R sequence • 419 views
modified 7 months ago by Kevin Blighe37k • written 7 months ago by banerjeeshayantan80

Could you tell us why you're doing this please? This reads like creating a plot to represent motifs, like the one here:

Ok please pardon my ignorance, but how to give feedback to the posts? Should I accept an answer?

1

By doing one (or more) of the following as appropriate.

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they all work. These buttons can be found on the left edge of answers/comments.

Thanks for informing.

4
7 months ago by
Kevin Blighe37k
Republic of Ireland
Kevin Blighe37k wrote:

I would generate an equivalent number of random sequences where each base had an equal likelihood of appearing at each position.

Then, compare the distribution/frequency of the bases in your test versus your random data using a Chi Square test.

Your own suggestions don't seem too bad. There will be limitations to each method, even mine, though. Regarding the hg38 idea, you could simply identify all 5-mers in that genome and then use that as the reference against which you compare your test sequences, if you wish. I have AWK code that I can share that can do this, if you wish.

Kevin

Ps - I would normally write more but Im on my phone at the dentist

Thanks for taking the time out and writing an answer to this question. I would definitely try the random sequence generation idea and get back to you. Can you please share the kmer code? It would be very helpful.

1

Sure, will send the info later after I have fled from the dentist

1

Hey again, here are some simple AWK commands that count 5-mer sequences. It really helps if your FASTA files are formatted such that each line is a multiple of 5.

``````awk '!/^>/ {for (i=1; i<=NF; i+=5) {m5[\$(i)\$(i+1)\$(i+2)\$(i+3)\$(i+4)]+=1}} END {for (motif in m5) print motif","m5[motif]}' FS='' /ReferenceMaterial/K.hansenii/GCF_000164395.1_ASM16439v1_genomic.fa | head -10
CACGA,755
AGCTC,155
CACCA,1155
ACCTC,393
CACGC,1449
GACAG,1056
CACCC,860
AGCTG,419
ACCTG,1007
CACGG,1167

echo "ATGCCATGCCTTTTTAAAAAATGCC" | awk '!/^>/ {for (i=1; i<=NF; i+=5) {m5[\$(i)\$(i+1)\$(i+2)\$(i+3)\$(i+4)]+=1}} END {for (motif in m5) print motif","m5[motif]}' FS='' -
TTTTT,1
ATGCC,3
AAAAA,1
``````

As you can see, these move along the sequence by shifting a 'window' of 5 bases each time. It's also possible to count 5-mers by shifting by a single base (just requires an extra if statement):

``````awk '!/^>/ {for (i=1; i<=NF; i+=1) if (i+4<=NF) {m5[\$(i)\$(i+1)\$(i+2)\$(i+3)\$(i+4)]+=1}} END {for (motif in m5) print motif","m5[motif]}' FS='' /ReferenceMaterial/K.hansenii/GCF_000164395.1_ASM16439v1_genomic.fa | head -10
NNTTA,4
CACGA,3692
AGCTC,738
ACCTC,1771
CACCA,5283
CACGC,6670
CACCC,4019
GACAG,5057
AGCTG,1943
ACCTG,5178

echo "ATGCCATGCCTTTTTAAAAAATGCC" | awk '!/^>/ {for (i=1; i<=NF; i+=1) if (i+4<=NF) {m5[\$(i)\$(i+1)\$(i+2)\$(i+3)\$(i+4)]+=1}} END {for (motif in m5) print motif","m5[motif]}' FS='' -
TTAAA,1
TAAAA,1
CTTTT,1
TTTTT,1
GCCTT,1
AAAAT,1
TTTAA,1
CCATG,1
TGCCA,1
ATGCC,3
CCTTT,1
CATGC,1
AATGC,1
AAATG,1
TTTTA,1
TGCCT,1
GCCAT,1
AAAAA,2
``````

When I think about this from a biological context, I think that it's more correct to do it by shifting by a single base each time, but requires more processing time. Also, if your FASTA files are formatted to non-multiples of 5, you may consider reformatting them or else the script becomes a lot more difficult.

Other things to consider:

• the genome I use here is a bacterial genome that has masked (N) bases. It's possible to obtain the human genome without masked bases
• when looking at the human genome files, be aware that the high GC content telomeric and centromeric regions are not included in the reference builds (this statement needs some verifying)
• for the human genome, you may consider tabulating the 5-mer frequencies separately per chromosome
1

...and, apparently, for hg38.fasta:

``````awk '!/^>/ {for (i=1; i<=NF; i+=5) {m5[toupper(\$(i))toupper(\$(i+1))toupper(\$(i+2))toupper(\$(i+3))toupper(\$(i+4))]+=1}} END {for (motif in m5) print motif","m5[motif]}' FS='' /ReferenceMaterial/hg38/hg38.fasta | sort -t, -nrk2 | head

NNNNN,31993236
TTTTT,4074644
AAAAA,4038749
AAAAT,2427645
ATTTT,2425798
TATTT,2045200
AAATA,2033061
TAAAA,2021046
TTTTA,1996175
AGAAA,1964212
``````

Hi again, In your above command how can I generate the file hg38.5mer.freq?

1

Sorry, that was a type error from pasting from my command line. I ran the first AWK command overnight and saved it to hg38.5mer.freq, which was unsorted. I then sorted it this morning and, for the purposes of display here on Biostars, I merged the AWK and sort commands together by pipe.

It is possible to obtain hg38 without the masked bases, but I cannot recall from where, exactly. Note that I've also added a toupper command due to the fact that, in this reference genome that I have, there are `ATGC` and `atgc`. Again, it's possible to obtain hg38 without masked bases and with all bases captalised.