Question: How to calculate dinucleotide A/T patterns in a given human reference genome region?
1
gravatar for Ashley
13 months ago by
Ashley50
China
Ashley50 wrote:
library("Biostrings")

I want to calculate the probability of di-nucleotide AA, TT, AT, and TA in each 2 location.

My DNA sequence is as follows:

DNA.set
  A DNAStringSet instance of length 5
    width seq                                                              names               
[1]    20 TCCGTATTGGAAAGCTCGTC                                             SEQ-1
[2]    20 TTAGACCACTCCGCATGTAG                                             SEQ-2
[3]    20 CTGTGGTACGGCTCAAACGG                                             SEQ-3
[4]    20 CTCCCGCCTATCTCCCTTCT                                             SEQ-4
[5]    20 TCGCCTAGAAAAAGTTTCCT                                             SEQ-5

I want to obtain the result as follows:

AA=0,0,0,0,1/5,2/5,0,1/5,0,0
TT=1/5,0,0,0,0,0,0,1/5,1/5,0
AT=0,0,0,0,0,0,0,0,1/5,0,0
TA=0,0,1/5,1/5,1/5,0,0,0,0,0

Any help would be great appreciate.

sequence R • 376 views
ADD COMMENTlink modified 12 months ago by Biostar ♦♦ 20 • written 13 months ago by Ashley50

dinucleotideFrequency function of biostrings could give those 2mers. than you can take the subset of your desired ones.

ADD REPLYlink written 13 months ago by morovatunc400
> dinucleotideFrequency(DNA.set)
     AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
[1,]  2  0  1  1  0  1  2  1  1  1  1  2  1  3  1  1
[2,]  0  2  2  1  2  2  1  1  1  1  0  1  2  1  1  1
[3,]  2  2  0  0  1  0  2  2  0  1  3  2  1  1  2  0
[4,]  0  0  0  1  0  5  1  5  0  1  0  0  1  4  0  1
[5,]  4  0  2  0  0  2  1  2  1  1  0  1  1  2  0  2

Thanks for your reply. But I want to know the frequency or probability of A/T in each position. Not total number. So dinucleotideFrequency maybe isn't suitable for me.

ADD REPLYlink written 13 months ago by Ashley50

consensusMatrix(dinucleotideFrequency(DNA.set)) ? maybe ?

ADD REPLYlink written 13 months ago by morovatunc400

For our example,

> consensusMatrix(dinucleotideFrequency(DNA.set))
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
0    2    3    2    2    3    1    0    0    2     0     3     1     0     0     2     1
1    0    0    1    3    1    1    3    2    3     5     1     2     4     2     2     3
2    2    2    2    0    1    2    2    2    0     0     0     2     1     1     1     1
3    0    0    0    0    0    0    0    0    0     0     1     0     0     1     0     0
4    1    0    0    0    0    0    0    0    0     0     0     0     0     1     0     0
5    0    0    0    0    0    1    0    1    0     0     0     0     0     0     0     0

Thanks for your kind help. But I think the column of result should be the length(DNA.seq)/2=10, however, the column is 16. And it didn't show which column represents for AA, AT, TA and TT. I am the newcomer of bioinformatics, could you help me figure it out? Thank you so much. With my best wishes.

I think the number of column is always the 16. For another example,

> dna2
  A DNAStringSet instance of length 3
    width seq                                                              names               
[1]     4 ACGT                                                             seq1
[2]     4 GTCA                                                             seq2
[3]     4 GCTA                                                             seq3
> consensusMatrix(dna2, as.prob = TRUE)
       [,1]      [,2]      [,3]      [,4]
A 0.3333333 0.0000000 0.0000000 0.6666667
C 0.0000000 0.6666667 0.3333333 0.0000000
G 0.6666667 0.0000000 0.3333333 0.0000000
T 0.0000000 0.3333333 0.3333333 0.3333333
M 0.0000000 0.0000000 0.0000000 0.0000000
R 0.0000000 0.0000000 0.0000000 0.0000000
W 0.0000000 0.0000000 0.0000000 0.0000000
S 0.0000000 0.0000000 0.0000000 0.0000000
Y 0.0000000 0.0000000 0.0000000 0.0000000
K 0.0000000 0.0000000 0.0000000 0.0000000
V 0.0000000 0.0000000 0.0000000 0.0000000
H 0.0000000 0.0000000 0.0000000 0.0000000
D 0.0000000 0.0000000 0.0000000 0.0000000
B 0.0000000 0.0000000 0.0000000 0.0000000
N 0.0000000 0.0000000 0.0000000 0.0000000
- 0.0000000 0.0000000 0.0000000 0.0000000
+ 0.0000000 0.0000000 0.0000000 0.0000000
. 0.0000000 0.0000000 0.0000000 0.0000000
> dim(consensusMatrix(dna2, as.prob = TRUE))
[1] 18  4
> consensusMatrix(dinucleotideFrequency(dna2))
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
0    3    2    3    3    2    3    2    2    3     2     3     1     2     2     3     3
1    0    1    0    0    1    0    1    1    0     1     0     2     1     1     0     0
ADD REPLYlink modified 13 months ago • written 13 months ago by Ashley50

my example is: (original link: https://www.dropbox.com/s/h8de0hcc8vc193t/data.jpg?dl=0 )

ADD REPLYlink modified 12 months ago by Devon Ryan94k • written 13 months ago by Ashley50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1629 users visited in the last hour