Question: Find out over-represented nucleotides
0
gravatar for banerjeeshayantan
11 weeks ago by
banerjeeshayantan70 wrote:

I have 10000 entries as a fasta file that contains nucleotide sequences of length 5.
Eg:- AGGTC, AGCTC, CGCTC, .... 10000 entries.
Now I have plotted the sequence logo plots using the "ggseqlogo" package of R. Now, although I am getting the relative abundance of the bases at each of the five locations(visually), I want to quantify it. Say, location 1 has an over representation of base A by 30% , G by 20% etc.
Is there any other method/package to do that?
or
In ggseqlogo, how do I find the height of each letter and accordingly measure for over representation.
Also sometimes the bases look similar in heights, but I definitely know from literature that they are not equal.

assembly sequencing R • 185 views
ADD COMMENTlink modified 11 weeks ago by Alex Reynolds25k • written 11 weeks ago by banerjeeshayantan70
2
gravatar for Alex Reynolds
11 weeks ago by
Alex Reynolds25k
Seattle, WA USA
Alex Reynolds25k wrote:

Look at the seqLogo package in Bioconductor. https://bioconductor.org/packages/release/bioc/html/seqLogo.html You could use the ic property of a seqLogo object: https://bioconductor.org/packages/release/bioc/vignettes/seqLogo/inst/doc/seqLogo.pdf

If you are comfortable with the command line, you could use WebLogo 3: http://weblogo.threeplusone.com/

The idea here is that frequency counts or, additionally, background correction, gives you information content — this "informs" you of quantitative overrepresentation, or conservation, of one or more bases to the exclusion of others.

ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by Alex Reynolds25k
1
gravatar for cpad0112
11 weeks ago by
cpad01129.0k
India
cpad01129.0k wrote:

did you try prob method in plotting ggseqlogo? use biostrings package if you require counts and probability.

Let us say following is part of your data:

head(test$x)
[1] "CCATATATAG" "CCATATATAG" "CCATAAATAG" "CCATAAATAG" "CCATAAATAG" "CCATAAATAG"

convert that into DNAstringset:

library(Biostrings)
testdss=DNAStringSet(head(test$x))

Now you can calculate per row counts and probabilities:

> alphabetFrequency(testdss, baseOnly=TRUE)
     A C G T other
[1,] 4 2 1 3     0
[2,] 4 2 1 3     0
[3,] 5 2 1 2     0
[4,] 5 2 1 2     0
[5,] 5 2 1 2     0
[6,] 5 2 1 2     0
> alphabetFrequency(testdss, baseOnly=TRUE, as.prob = T)
       A   C   G   T other
[1,] 0.4 0.2 0.1 0.3     0
[2,] 0.4 0.2 0.1 0.3     0
[3,] 0.5 0.2 0.1 0.2     0
[4,] 0.5 0.2 0.1 0.2     0
[5,] 0.5 0.2 0.1 0.2     0
[6,] 0.5 0.2 0.1 0.2     0

At each position, alphabet count and probability can be printed for 10 mer:

> consensusMatrix(testdss,baseOnly=T)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
A        0    0    6    0    6    4    6    0    6     0
C        6    6    0    0    0    0    0    0    0     0
G        0    0    0    0    0    0    0    0    0     6
T        0    0    0    6    0    2    0    6    0     0
other    0    0    0    0    0    0    0    0    0     0
> consensusMatrix(testdss,baseOnly=T, as.prob = T)
      [,1] [,2] [,3] [,4] [,5]      [,6] [,7] [,8] [,9] [,10]
A        0    0    1    0    1 0.6666667    1    0    1     0
C        1    1    0    0    0 0.0000000    0    0    0     0
G        0    0    0    0    0 0.0000000    0    0    0     1
T        0    0    0    1    0 0.3333333    0    1    0     0
other    0    0    0    0    0 0.0000000    0    0    0     0
ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by cpad01129.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1250 users visited in the last hour