Question: Difference in means between sexes over time
0
selplat2110 wrote:

Hello,

I have males and females across time for various phenotypes. I first began by binning my data in 20 year increments.

``````Data\$cuts <- cut(Data\$year, breaks = c(seq(min(Data\$year), max(Data\$year), 20), max(Data\$year)), labels = FALSE)
``````

This now produces a cut or bin with a value from 1-8 for every individual in my dataset.

I then am trying to produce an output with the difference in mean between males and females in a trait for each bin of time.

``````for (i in 1:8) {
difmean <- c()
Mcuts <- DataM[ which(DataM\$cuts=='i'),]
Fcuts <- DataF[ which(DataF\$cuts=='i'),]
Mmean <- mean(Mcuts\$trait, na.rm = TRUE)
Fmean <- mean(Fcuts\$trait, na.rm = TRUE)
difmean <- c(Mmean-Fmean)
print (difmean)
}
``````

I get an output of the following:

 NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN

Any help would be greatly appreciated!

R • 110 views
modified 11 days ago • written 12 days ago by selplat2110
1

Got it, you use `'i'` instead of `i` in `DataM\$cuts=='i'` and it's never the string `'i'`

Thank you!! It is working now, much appreciated.

Is there a way to assess significance of a linear model with binned data? I pasted some code below that generates the regression line, but I don't get p-values from the summary. Maybe I need to bootstrap and just look at confidence intervals?

1

I think you should start a new thread for that question

Do `Data` and `DataM` and `DataF` have the same number of rows? Is `trait` a column in `DataM` and `DataF`?

DataM and DataF have a different numbers of rows, but the same columns. `\$trait` is a column in both datasets.

DataM and DataF were generated like so:

``````DataM <- Data[which(Data\$sex=="M"),]
DataF <- Data[which(Data\$sex=="F"),]
``````

Side note: Why use `which()` when just specifying `DataM<-Data[Data\$sex=="M",]` would work just fine?

You're right, it was just how I left it during processing.

0
selplat2110 wrote:

Update,

I was able to loop through and provide a mean difference, sample size for each sex, and total sample size.

``````Data\$cuts <- cut(Data\$year, breaks = c(seq(min(Data\$year), max(Data\$year), 20), max(Data\$year)), labels = FALSE)

DataM <- Data[Data\$sex=="M",]
DataF <- Data[Data\$sex=="F",]

mean.df <- as.data.frame(c())

for (i in 2:8) {
Mcuts <- DataM[which(DataM\$cuts==i),]
Fcuts <- DataF[which(DataF\$cuts==i),]
Mmean <- mean(Mcuts\$trait, na.rm = TRUE)
Fmean <- mean(Fcuts\$trait, na.rm = TRUE)
mean.df[i, "bin"] <- paste(i)
mean.df[i, "mean_dif"] <- paste(Mmean-Fmean)
mean.df[i, "ss_f"] <- paste(length(Mcuts\$cuts))
mean.df[i, "ss_m"] <- paste(length(Fcuts\$cuts))
mean.df[i, "ss_t"] <- paste(sum(length(Fcuts\$cuts),length(Mcuts\$cuts)))
}

lm1 <- lm(mean_dif ~ bin, data=mean.df)
plot(mean.df\$bin, mean.df\$mean_dif)
abline(lm1)
summary(lm1)
``````

Unfortunately, because this is binned data, the lm() command is unable to produce p-values. Is there a way to assess significance of the above trendline with binned data and account for the different sample sizes of bins?