Question: "missing value where TRUE/FALSE needed" in 16Stimator R script
1
gravatar for charles.bridges
9 months ago by
charles.bridges70 wrote:

Hello,

I'm using the program 16Stimator (paper here and source here) to estimate 16S rRNA gene copy number from draft genomes. It's comprised of a shell script that calls several other programs and R scripts. I run R on a linux cluster and have gotten most scripts to run successfully, but upon running the R script Estimate_16S23S_PriceBonCIR1R2.R, I receive the following error:

Rscript ./Scripts/Estimate_16S23S_PriceBonCIR1R2.R $Isolate $Lib $ReadLen $n
Error in if (Df16S$GeneLen[r] > 1400) { : 
  missing value where TRUE/FALSE needed
Execution halted

The required seqinr packages is installed. I've found a general answer regarding this error here, but I can't seem to gleen enough information to apply it to my situation. I haven't found anyone else running into this problem. I should mention that the authors state that16Stimator requires R v3.1.0, and I'm using R v3.1.1. I'm not sure if this is significant or not (E.g. some syntax has changed during the version change). Thank you in advance for your insight. Here's the entire script below:

## Estimate 16S and 23S copy numbers
CurrentDir = getwd()
setwd(CurrentDir)
## Start code
args = commandArgs(trailingOnly = TRUE)
Isolate = as.character(args[1])
Lib = as.character(args[2])
ReadLen = as.numeric(args[3])
Seq = as.numeric(args[4])
## Load in gene GC% table and mapped read table
GCtab = read.table(paste("./Beds/", Isolate, ".", Lib, ".gene.gc.txt", sep = ""),
                   sep = "\t", header = FALSE, stringsAsFactors = FALSE)
colnames(GCtab) = c("Node", "Start", "End", "Gene", "pct_at", "pct_gc", "num_A",
                    "num_C", "num_G", "num_T", "num_N", "num_oth", "seq_len")
MappedReads = read.table(paste("./Beds/", Isolate, ".", Lib, ".R", Seq,
                               ".gene.mapped.reads.txt", sep = ""),
                         sep = "\t", header = FALSE, quote = "",
                         comment.char = "", stringsAsFactors = FALSE)
MappedReads = MappedReads[,(ncol(MappedReads) - 4):ncol(MappedReads)]
colnames(MappedReads) = c("Node", "Start", "End", "Gene", "Overlap")
RDPerGene = data.frame(table(MappedReads$Gene), stringsAsFactors = FALSE)
colnames(RDPerGene) = c("Gene", "Reads")
RDPerGene$Gene = as.character(RDPerGene$Gene)
## Apply GC% correction (Yoon et al, 2009)
## multiply coverage for each gene win by 
## looking up corresponding GC cov correction factor
CovCorrectDF = read.csv(paste("./CovCorrect/", Isolate, ".", Lib, ".R", Seq,
                              ".CovCorrectForGC.csv", sep = ""), header = TRUE) 
Gene_gc = numeric()
Reads_gc = numeric()
## Gene name has class, name, len, win, separated by "_"
Class = character()
Name = character()
GeneLen = numeric()
GeneWin = numeric()
for(q in 1:nrow(RDPerGene)){
  Gene_gc[q] = (round(GCtab[which(GCtab$Gene == RDPerGene$Gene[q]),]$pct_gc,
                      digits = 2))
  Reads_gc[q] = RDPerGene$Reads[q] * 
    CovCorrectDF[CovCorrectDF$GCprop == Gene_gc[q],]$CovCorrect
  String = strsplit(RDPerGene$Gene[q], "_")
  Class[q] = String[[1]][1]
  Name[q] = String[[1]][2]
  GeneLen[q] = as.numeric(String[[1]][3])
  GeneWin[q] = as.numeric(String[[1]][4])
}
RDPerGene = data.frame(RDPerGene, Gene_gc, Reads_gc, Class, Name, GeneLen, GeneWin,
                       stringsAsFactors = FALSE)
write.csv(RDPerGene, paste("./CovCorrect/", Isolate, ".", Lib, ".R", Seq,
                           ".GCorCovPerGeneWin.csv", sep = ""),
          quote = FALSE, row.names = FALSE)
## Calculate median 16S/23S coverages, take account of full and partial genes
## if multiple full length 16S or 23S, then sum coverages for each win
## if partial 16S or 23S, then just add wins to vector
Df16S = RDPerGene[which(RDPerGene$Class == "16S"),]
Big16S = data.frame()
Lit16S = data.frame()
SixteenCov = numeric()
for(r in 1:nrow(Df16S)){
  if(Df16S$GeneLen[r] > 1400){
    Big16S = rbind(Big16S, Df16S[r,])
  }else{
    Lit16S = rbind(Lit16S, Df16S[r,])
  }
}
if(nrow(Big16S) > 0){
  for(s in as.numeric(unique(Big16S$GeneWin))){
    SixteenCov = c(SixteenCov, sum(Big16S[which(Big16S$GeneWin == s),]$Reads_gc))
  }
}
if(nrow(Lit16S) > 0){
  SixteenCov = c(SixteenCov, as.numeric(Lit16S$Reads_gc))
}
Df23S = RDPerGene[which(RDPerGene$Class == "23S"),]
Big23S = data.frame()
Lit23S = data.frame()
TwentyThreeCov = numeric()
for(t in 1:nrow(Df23S)){
  if(Df23S$GeneLen[t] > 2800){
    Big23S = rbind(Big23S, Df23S[t,])
  }else{
    Lit23S = rbind(Lit23S, Df23S[t,])
  }
}
if(nrow(Big23S) > 0){
  for(u in as.numeric(unique(Big23S$GeneWin))){
    TwentyThreeCov = c(TwentyThreeCov, sum(Big23S[which(Big23S$GeneWin == u),]$Reads_gc))
  }
}
if(nrow(Lit23S) > 0){
  TwentyThreeCov = c(TwentyThreeCov, as.numeric(Lit23S$Reads_gc))
}
## Apply confidence intervals from Price + Bonett 2002
## Distribution-Free Confidence Intervals for Difference and Ratio of Medians
## Set alpha to desired confidence level
alpha = 0.05
SingCovs = RDPerGene[which(RDPerGene$Class == "Sing"),]$Reads_gc
## Calculate medians for each class (16S, 23S, SingCopyGenes)
Med16S = median(SixteenCov)
Med23S = median(TwentyThreeCov)
MedSings = median(SingCovs)

Ord16S = SixteenCov[order(SixteenCov)]
Ord23S = TwentyThreeCov[order(TwentyThreeCov)]
OrdSings = SingCovs[order(SingCovs)]
c16S = round((length(Ord16S) + 1)/2 - sqrt(length(Ord16S)), digits = 0)
c23S = round((length(Ord23S) + 1)/2 - sqrt(length(Ord23S)), digits = 0)
cSings = round((length(OrdSings) + 1)/2 - sqrt(length(OrdSings)), digits = 0)
if(length(Ord16S) < 132){
  p16S = 0
  for(i in 1:(c16S - 1)){
    p16S = p16S + choose(length(Ord16S), i)
  }
  p16S = p16S * 0.5^length(Ord16S)
  z16S = qnorm(1 - p16S)
}else{
  z16S = 2
}
if(length(Ord23S) < 132){
  p23S = 0
  for(i in 1:(c23S - 1)){
    p23S = p23S + choose(length(Ord23S), i)
  }
  p23S = p23S * 0.5^length(Ord23S)
  z23S = qnorm(1 - p23S)
}else{
  z23S = 2
}
if(length(OrdSings) < 132){
  pSings = 0
  for(i in 1:(cSings - 1)){
    pSings = pSings + choose(length(OrdSings), i)
  }
  pSings = pSings * 0.5^length(OrdSings)
  zSings = qnorm(1 - pSings)
}else{
  zSings = 2
}
N16S = ((log(Ord16S[length(Ord16S) - c16S + 1]) - 
           log(Ord16S[c16S]))/(2 * z16S))^2
N23S = ((log(Ord23S[length(Ord23S) - c23S + 1]) - 
           log(Ord23S[c23S]))/(2 * z23S))^2
NSings = ((log(OrdSings[length(OrdSings) - cSings + 1]) - 
             log(OrdSings[cSings]))/(2 * zSings))^2
zalpha = qnorm(1 - alpha/2)
Est16S = Med16S/MedSings
LowEst16S = (Med16S/MedSings) * exp(-zalpha * sqrt(N16S + NSings))
HighEst16S = (Med16S/MedSings) * exp(zalpha * sqrt(N16S + NSings))
Est23S = Med23S/MedSings
LowEst23S = (Med23S/MedSings) * exp(-zalpha * sqrt(N23S + NSings))
HighEst23S = (Med23S/MedSings) * exp(zalpha * sqrt(N23S + NSings))

Df1623S = data.frame(Isolate, Lib, Seq, Est16S, LowEst16S, HighEst16S,
                     Est23S, LowEst23S, HighEst23S,
                     stringsAsFactors = FALSE)
write.csv(Df1623S, paste("./Estimates/", Isolate, ".", Lib, ".R", Seq,
                         ".CopyNumEst.csv", sep = ""),
          quote = FALSE, row.names = FALSE)
print("Estimated 16S and 23S copy numbers with Price + Bonett CIs")
16stimator is.na copy number 16s R • 3.2k views
ADD COMMENTlink modified 9 months ago by zx87547.5k • written 9 months ago by charles.bridges70

So after a bit of investigation, here's what I've come up with:

  1. The RDPerGene variable does contain the information that it should.
  2. It is (seemingly) correctly written and formatted to a*.csv file.
  3. The following print commands showed what each variable contains:

print(RDPerGene)
                 Gene Reads Gene_gc  Reads_gc Class  Name GeneLen GeneWin
1    Sing_alaS_2615_1   118    0.38 116.03333  Sing  alaS    2615       1
2    Sing_alaS_2615_2   111    0.40 111.94872  Sing  alaS    2615       2
3    Sing_alaS_2615_3    98    0.40  98.83761  Sing  alaS    2615       3
4    Sing_alaS_2615_4   101    0.40 101.86325  Sing  alaS    2615       4
5    Sing_alaS_2615_5   101    0.41 101.86325  Sing  alaS    2615       5
6    Sing_alaS_2615_6   105    0.37 103.68201  Sing  alaS    2615       6
7    Sing_argS_1793_1   152    0.37 150.09205  Sing  argS    1793       1 .....

print(Df16S)
[1] Gene     Reads    Gene_gc  Reads_gc Class    Name     GeneLen  GeneWin 
<0 rows> (or 0-length row.names)

print(Big16S)
data frame with 0 columns and 0 rows

print(Lit16S)
data frame with 0 columns and 0 rows

Since the variable RDPerGene and it's corresponding output to *.csv file DO contain the information it's supposed to, it seems to me like there's a syntax or contextual issue with how the following commands are assigning information to the Df16S variable:

for(r in 1:nrow(Df16S)){
  if(Df16S$GeneLen[r] > 1400){
    Big16S = rbind(Big16S, Df16S[r,])
  }else{
    Lit16S = rbind(Lit16S, Df16S[r,])
  }
}

As I said in another comment, I'm very new to R and unfamiliar with syntax. Also of note, is that I'm using R v3.1.1, and the author's code was written in R v3.1.0; I'm unsure if there were any major (or minor) changes to syntax, etc.

Does anyone have any further ideas for this issue? Perhaps I'm thinking about this incorrectly?

ADD REPLYlink modified 8 months ago • written 8 months ago by charles.bridges70
1
gravatar for Jean-Karim Heriche
9 months ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche19k wrote:

The error means that the evaluation of the comparison produced an NA value instead of TRUE or FALSE. Since the right hand side of the comparison is a constant, it means that Df16S$GeneLen[r] is most likely not what you think it is. Check what Df16S contains.

ADD COMMENTlink written 9 months ago by Jean-Karim Heriche19k

Hi Jean-Karim Heriche,

Thanks for your reply. I'm not well versed in R, how would I check to see what Df16S contains?

Charles

ADD REPLYlink written 9 months ago by charles.bridges70

If you're using Rstudio, you could just click on Df16S in the list of objects in your environment or type View(Df16S) in the console. In an R terminal, you just type Df16S and the top 100 or so rows would be displayed. A common debugging technique is to output the variable value just before the line that produces the error.

ADD REPLYlink written 9 months ago by Jean-Karim Heriche19k
1
gravatar for zx8754
9 months ago by
zx87547.5k
London
zx87547.5k wrote:

As per error, it is related to this line:

if(Df16S$GeneLen[r] > 1400){

To reproduce the error with simple example, try:

x <- NA
if(x > 1400){"x is more"} else {"x is less"}
# Error in if (x > 1400) { : missing value where TRUE/FALSE needed

We get above error, because x > 1400 gives NA, and if statement expects TRUE/FALSE value. Look into is.na() to prevent this error:

ifis.na(x)){"x is NA"} else if(x > 1400){"x is more"} else {"x is less"}
# [1] "x is NA"

# now test with non NA value
x <- 1000
ifis.na(x)){"x is NA"} else if(x > 1400){"x is more"} else {"x is less"}
# [1] "x is less"
ADD COMMENTlink written 9 months ago by zx87547.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 680 users visited in the last hour