Question: Question concerning a for loop in R with strsplit to generate amino acid abundance visuals
0
gravatar for MEGA3000
4.6 years ago by
MEGA30000
MEGA30000 wrote:

Hello Biostars,

I'd like to preface this post to say that I am new to R and bioinformatics coding, and that I'd really appreciate some input from this knowledgable community.  My goal for the code posted below is to generate pie charts that show amino acid abundance per protein from BLAST results.  I uploaded a csv file from UniProt, converted it to a matrix, and wrote out the code below.  I keep getting the error: In AAs[i] = table(strsplit(BLAST_AA_seqs[i], "", useBytes = TRUE)) :number of items to replace is not a multiple of replacement length.  Column 8 is the output column that contains the amino acid sequences.  Thanks in advance!

mydata=read.table("C:/Users/du0/Desktop/Downloads/CDPKbeta_BLAST_results.csv", header=TRUE,sep=",")
mydata=as.matrix(mydata)

AAs=c()
BLAST_AA_seqs=c()
for(i in 1:nrow(mydata)){
  print(i)
  BLAST_AA_seqs[i]=mydata[i,8]
  AAs[i]=table(strsplit(BLAST_AA_seqs[i],"", useBytes=TRUE))
  pie(AAs, col=rainbow(length(AAs)), main="Residue abundance")
}

alignment blast graphic sequence R • 2.8k views
ADD COMMENTlink modified 4.6 years ago by Michael Dondrup46k • written 4.6 years ago by MEGA30000

http://stackoverflow.com/questions/27245246/a-for-loop-with-strsplit-in-r-error

ADD REPLYlink written 4.6 years ago by zx87547.6k

first answer there is wrong...

ADD REPLYlink written 4.6 years ago by Michael Dondrup46k
5
gravatar for Michael Dondrup
4.6 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

So, putting the final solution you should use on top. In R it is best to use existing packages and functions. Package Biostrings has a function alphabetFrequency:

library(Biostrings)
mydata=read.table("C:...", header=TRUE,sep=",")[,8]
sapply(mydata, function (seq) {
  AAs=alphabetFrequency(AAString(seq))
  pie(AAs[AAs>0]) # remove all 0 counts, otherwise ugly pie
})

Let's start with making the code work as it is, because you are close to a working solution, after that let's remove all the beginner mistakes.

AAs=list()
BLAST_AA_seqs=c()
for(i in 1:nrow(mydata)){
  print(i)
  BLAST_AA_seqs[i]=mydata[i,8]
  AAs[[i]]=table(strsplit(BLAST_AA_seqs[i],"", useBytes=TRUE))
  pie(AAs[[i]], col=rainbow(length(AAs)), main="Residue abundance")
}

look at the output of table:

A L M T U X 
2 1 3 1 1 3 

A list is the right datatype to store a collection of other data objects in R, not a vector, as vector elements must be of the same type. AAs[i] was referring to a vector element of length 1 while the table has a different length and hence cannot replace a single vector element, instead a list can contain objects of any type and length. Note that pie can only plot one vector at a time.


Reduced solution with loop, there's a lot of redundancy in the first code, it is in principle a one-liner.

mydata=read.table("C:...", header=TRUE,sep=",")[,8]
# we don't need to keep the data we do not need
mydata=as.matrix(mydata) # not necessary, a dataframe can be indexed just fine
AAs=c()
BLAST_AA_seqs=c() # Not needed as you do not want to store the data for later
for(i in 1:length(mydata)){
  seq = mydata[i] # use local variables instead
  AAs = table(strsplit(seq,"", useBytes=TRUE))
  pie (AAs, col=rainbow(length(AAs)), main="Residue abundance")
}

In R, for loops shouldn't be used, instead use apply, sapply, etc.:

mydata=read.table("C:...", header=TRUE,sep=",")[,8]
pdf() # you will get a lot of plots, this will make a pdf of your plots with one plot per page
sapply(mydata, function (seq) {  
  AAs = table(strsplit(seq, "", useBytes=TRUE))
  pie (AAs, col=rainbow(length(AAs)), main="Residue abundance")
} )
dev.off()
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Michael Dondrup46k

This is pretty good. You could also run strsplit on the entire column (e.g. AA <- strsplit(mydata$seq, "") if it were headed seq). Then use lapply much as sapply to table and plot the data in each list element.

ADD REPLYlink written 4.6 years ago by Neilfws48k
5
gravatar for Michael Dondrup
4.6 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

So, now for a serious final answer, let's approach the visualization you are attempting, and therefore do a complete rewrite of the solution. You are proposing pie charts, but pie charts are not good for displaying data (See help page of pie function). In addition they are really very bad for comparison. Therefore, I am proposing stacked bar charts and dotcharts as an alternative. In contrast to pie charts they allow you to depict multiple rows of data in one image. We are also using a AASStringSet to hold our data, use alphabet frequency on it to get a matrix like type. No more apply or loops are needed.

## example data, contains also stop codons
seqs <- c("AAULMMMTXXX", "ALUMMTXXXX*", "MTXXMMM*M", "MTTTALSSSUUX*") 
AAs <- alphabetFrequency(AAStringSet(seqs)) ## alphabetFrequency also works on sets
layout(matrix(c(1,2), 1, 2,  byrow = TRUE), width=c(0.7,0.3)) ## allow to place the legend
nAAs = AAs/rowSums(AAs) ## if you want to normalize by total length, use this
barplot(t(nAAs),col=rainbow(ncol(AAs))) ## yields a stacked barplot, one bar per sequence
plot.new()
legend(x="bottom", legend=rev(colnames(AAs)), fill=rev(rainbow(ncol(AAs))),cex=0.7)

Output:

 

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Michael Dondrup46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1459 users visited in the last hour