Question

Find out over-represented nucleotides

0

Entering edit mode

5.8 years ago

Gene_MMP8 ▴ 240

I have 10000 entries as a fasta file that contains nucleotide sequences of length 5.
Eg:- AGGTC, AGCTC, CGCTC, .... 10000 entries.
Now I have plotted the sequence logo plots using the "ggseqlogo" package of R. Now, although I am getting the relative abundance of the bases at each of the five locations(visually), I want to quantify it. Say, location 1 has an over representation of base A by 30% , G by 20% etc.
Is there any other method/package to do that?
or
In ggseqlogo, how do I find the height of each letter and accordingly measure for over representation.
Also sometimes the bases look similar in heights, but I definitely know from literature that they are not equal.

R sequencing assembly • 1.2k views

ADD COMMENT • link updated 5.8 years ago by Alex Reynolds 35k • written 5.8 years ago by Gene_MMP8 ▴ 240

score 2 · Answer 1 · 2018-07-06

Look at the seqLogo package in Bioconductor. https://bioconductor.org/packages/release/bioc/html/seqLogo.html You could use the ic property of a seqLogo object: https://bioconductor.org/packages/release/bioc/vignettes/seqLogo/inst/doc/seqLogo.pdf

If you are comfortable with the command line, you could use WebLogo 3: http://weblogo.threeplusone.com/

The idea here is that frequency counts or, additionally, background correction, gives you information content — this "informs" you of quantitative overrepresentation, or conservation, of one or more bases to the exclusion of others.

score 1 · Answer 2 · 2018-07-07

did you try prob method in plotting ggseqlogo? use biostrings package if you require counts and probability.

Let us say following is part of your data:

head(test$x)
[1] "CCATATATAG" "CCATATATAG" "CCATAAATAG" "CCATAAATAG" "CCATAAATAG" "CCATAAATAG"

convert that into DNAstringset:

library(Biostrings)
testdss=DNAStringSet(head(test$x))

Now you can calculate per row counts and probabilities:

> alphabetFrequency(testdss, baseOnly=TRUE)
     A C G T other
[1,] 4 2 1 3     0
[2,] 4 2 1 3     0
[3,] 5 2 1 2     0
[4,] 5 2 1 2     0
[5,] 5 2 1 2     0
[6,] 5 2 1 2     0
> alphabetFrequency(testdss, baseOnly=TRUE, as.prob = T)
       A   C   G   T other
[1,] 0.4 0.2 0.1 0.3     0
[2,] 0.4 0.2 0.1 0.3     0
[3,] 0.5 0.2 0.1 0.2     0
[4,] 0.5 0.2 0.1 0.2     0
[5,] 0.5 0.2 0.1 0.2     0
[6,] 0.5 0.2 0.1 0.2     0

At each position, alphabet count and probability can be printed for 10 mer:

> consensusMatrix(testdss,baseOnly=T)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
A        0    0    6    0    6    4    6    0    6     0
C        6    6    0    0    0    0    0    0    0     0
G        0    0    0    0    0    0    0    0    0     6
T        0    0    0    6    0    2    0    6    0     0
other    0    0    0    0    0    0    0    0    0     0
> consensusMatrix(testdss,baseOnly=T, as.prob = T)
      [,1] [,2] [,3] [,4] [,5]      [,6] [,7] [,8] [,9] [,10]
A        0    0    1    0    1 0.6666667    1    0    1     0
C        1    1    0    0    0 0.0000000    0    0    0     0
G        0    0    0    0    0 0.0000000    0    0    0     1
T        0    0    0    1    0 0.3333333    0    1    0     0
other    0    0    0    0    0 0.0000000    0    0    0     0