Question concerning a for loop in R with strsplit to generate amino acid abundance visuals
2
0
Entering edit mode
7.5 years ago
MEGA3000 • 0

Hello Biostars,

I'd like to preface this post to say that I am new to R and bioinformatics coding, and that I'd really appreciate some input from this knowledgable community. My goal for the code posted below is to generate pie charts that show amino acid abundance per protein from BLAST results. I uploaded a csv file from UniProt, converted it to a matrix, and wrote out the code below. I keep getting the error:

In AAs[i] = table(strsplit(BLAST_AA_seqs[i], "", useBytes = TRUE)) :number of items to replace is not a multiple of replacement length.


Column 8 is the output column that contains the amino acid sequences. Thanks in advance!

mydata=read.table("C:/Users/du0/Desktop/Downloads/CDPKbeta_BLAST_results.csv", header=TRUE,sep=",")
mydata=as.matrix(mydata)

AAs=c()
BLAST_AA_seqs=c()
for(i in 1:nrow(mydata)){
print(i)
BLAST_AA_seqs[i]=mydata[i,8]
AAs[i]=table(strsplit(BLAST_AA_seqs[i],"", useBytes=TRUE))
pie(AAs, col=rainbow(length(AAs)), main="Residue abundance")
}

sequence blast alignment Graphic R • 3.9k views
0
Entering edit mode
0
Entering edit mode

5
Entering edit mode
7.5 years ago

So, putting the final solution you should use on top. In R it is best to use existing packages and functions. Package Biostrings has a function alphabetFrequency:

library(Biostrings)
sapply(mydata, function (seq) {
AAs=alphabetFrequency(AAString(seq))
pie(AAs[AAs>0]) # remove all 0 counts, otherwise ugly pie
})


Let's start with making the code work as it is, because you are close to a working solution, after that let's remove all the beginner mistakes.

AAs=list() # list instead of vector
BLAST_AA_seqs=c()
for(i in 1:nrow(mydata)){
print(i)
BLAST_AA_seqs[i]=mydata[i,8]
AAs[[i]]=table(strsplit(BLAST_AA_seqs[i],"", useBytes=TRUE)) ## Lists use [[]] instead of [] for indexing
pie(AAs[[i]], col=rainbow(length(AAs)), main="Residue abundance")
}


look at the output of table:

A L M T U X
2 1 3 1 1 3


A list is the right datatype to store a collection of other data objects in R, not a vector, as vector elements must be of the same type. AAs[i] was referring to a vector element of length 1 while the table has a different length and hence cannot replace a single vector element, instead a list can contain objects of any type and length. Note that pie can only plot one vector at a time.

Reduced solution with loop, there's a lot of redundancy in the first code, it is in principle a one-liner.

mydata=read.table("C:...", header=TRUE,sep=",")[,8]
# we don't need to keep the data we do not need

# mydata=as.matrix(mydata) # not necessary, a dataframe can be indexed just fine
# AAs=c()
# BLAST_AA_seqs=c() # Not needed as you do not want to store the data for later
for(i in 1:length(mydata)){
seq = mydata[i] # use local variables instead
AAs = table(strsplit(seq,"", useBytes=TRUE))
pie (AAs, col=rainbow(length(AAs)), main="Residue abundance")
}


In R, for loops shouldn't be used, instead use apply, sapply, etc.:

mydata=read.table("C:...", header=TRUE,sep=",")[,8]
pdf() # you will get a lot of plots, this will make a pdf of your plots with one plot per page
sapply(mydata, function (seq) {
AAs = table(strsplit(seq, "", useBytes=TRUE))
pie (AAs, col=rainbow(length(AAs)), main="Residue abundance")
} )
dev.off()

0
Entering edit mode

This is pretty good. You could also run strsplit on the entire column (e.g. AA <- strsplit(mydata\$seq, "") if it were headed seq). Then use lapply much as sapply to table and plot the data in each list element.

5
Entering edit mode
7.5 years ago

So, now for a serious final answer, let's approach the visualization you are attempting, and therefore do a complete rewrite of the solution. You are proposing pie charts, but pie charts are not good for displaying data (See help page of pie function). In addition they are really very bad for comparison. Therefore, I am proposing stacked bar charts and dotcharts as an alternative. In contrast to pie charts they allow you to depict multiple rows of data in one image. We are also using a AASStringSet to hold our data, use alphabet frequency on it to get a matrix like type. No more apply or loops are needed.

## example data, contains also stop codons
seqs <- c("AAULMMMTXXX", "ALUMMTXXXX*", "MTXXMMM*M", "MTTTALSSSUUX*")
AAs <- alphabetFrequency(AAStringSet(seqs)) ## alphabetFrequency also works on sets
layout(matrix(c(1,2), 1, 2, byrow = TRUE), width=c(0.7,0.3)) ## allow to place the legend
nAAs = AAs/rowSums(AAs) ## if you want to normalize by total length, use this
barplot(t(nAAs),col=rainbow(ncol(AAs))) ## yields a stacked barplot, one bar per sequence
plot.new()
legend(x="bottom", legend=rev(colnames(AAs)), fill=rev(rainbow(ncol(AAs))),cex=0.7)


Output: