# Power Analysis Through Simulation

So, you’re designing an experiment and you’re faced with answering the
age-old question: *How many participants do I need for this experiment
to work?* Probably, your advisor sent you down a barren path of finding
a power analysis tool that will work for your particular design,
hopefully at least with the good intention of hopping on the
open/reproducible science bandwagon. But, if you’ve looked, you’ll have
noticed that most of those sorts of pacakges aren’t exactly *easy* to
use, and worse is that they don’t work for all sorts of experimental
designs.

Well, if this sounds like you, I’ve got some good news and some bad news. The good news is that, in principle, there is a fantastic way to determine power for pretty much any experimental design and statistical analysis you might be interested in. The bad news is that you will have to work, and more importantly think, to get the results you want.

# The main idea

The reason that it can be difficult to precisely calculate power for certain experiments is that some poor schmuck has to mathematically derive a formula for each experimental design. There might be some overlap between similar cases, but in many cases doing the math right is just really hard. So, more recently, the simulation approach to power analysis has been becoming more and more popular, since it is relatively simple & easy to implement, and provides results that are quantitatively just as good. The only downside is that, as you’ll see, running the simulations can sometimes take a really long time.

The simulation-based approach has 5 main steps:

- Decide on the a-priori effect size of interest
- Simulate some data assuming that effect size
- Analyze the model and check if the effect is significant
- Repeat 2-3 until you’re confident enough
- Repeat 2-4 for a range of possible sample sizes

# A simple example

To start, let’s use one of the most simple designs possible: a
one-sample t-test, which tests whether the mean of some value is equal
to some reference value like `0`

. Let’s say you’re studying moral
judgments on a scale from `-1`

to `1`

, where `-1`

indicates that people
think an action was immoral and `1`

indicates that they think it is
morally praiseworthy (and `0`

is neutral). We want to know whether
people think that forcing your friends to sit through Minions 2: The
Rise of Gru is morally wrong.

The first thing we need to do is decide on our effect size. In this
case, let’s assume that people think that it is morally wrong (I mean,
come on), with a true value of `-.4`

. We also expect people to vary, say
with a standard deviation of `.3`

.

Once we’ve decided, we can simulate a dataset with these “true” values:

1
2
3
4
5
6
7
8
9
10
11

library(tidyverse)
library(emmeans)
library(viridis)
library(gganimate)
library(mvtnorm)
library(lme4)
library(lmerTest)
d <- tibble(participant=1:10) %>%
mutate(moral_judgment=rnorm(n(), mean=-.4, sd=.3))
d

1
2
3
4
5
6
7
8
9
10
11
12
13

## # A tibble: 10 × 2
## participant moral_judgment
## <int> <dbl>
## 1 1 -0.412
## 2 2 -1.01
## 3 3 -0.234
## 4 4 -0.478
## 5 5 -0.562
## 6 6 -0.538
## 7 7 -0.476
## 8 8 -0.643
## 9 9 -0.237
## 10 10 0.141

As you can see, we’ve created a tibble (a dataframe) with 10
participants, whose moral judgments are normally distributed around
`-.4`

with a standard deviation of `.3`

. In this case, we could’ve just
used the `rnorm`

function to get a vector of judgments, but in the
future it’ll be easier to keep things in a tibble. The next thing we
want to do is run our test of interest:

1
2

t <- t.test(d$moral_judgment)
t

1
2
3
4
5
6
7
8
9
10
11

##
## One Sample t-test
##
## data: d$moral_judgment
## t = -4.6858, df = 9, p-value = 0.001143
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.6589673 -0.2298638
## sample estimates:
## mean of x
## -0.4444155

This gives us the mean moral judgment, a t-value, the degrees of
freedom, confidence intervals, and a p-value. That’s a lot of
information, but in power analysis people are typically interested in
just the p-value. We can test for “significance” by comparing the
p-value to a cutoff like `.05`

:

1

t$p.value < .05

1

## [1] TRUE

And look at that, our result was significant! Sadly, this alone doesn’t
really tell us anything, since our actual experiment could be completely
different due to sampling variation. So we need to be able to quantify
our uncertainty about power *across possible replications of the
experiment*. The solution is that we can repeat this process a large-ish
number of times (step 4). To do this, I’m going to make use of
`tidyverse`

’s list-columns, which allow you to store dataframes or lists
inside of a column of an outer data-frame. So, let’s first make our
dataframe a hundred times larger than it already is:

1
2
3

d <- expand_grid(simulation=1:100, participant=1:10) %>%
mutate(moral_judgment=rnorm(n=n(), mean=-.2, sd=.3))
d

1
2
3
4
5
6
7
8
9
10
11
12
13
14

## # A tibble: 1,000 × 3
## simulation participant moral_judgment
## <int> <int> <dbl>
## 1 1 1 -0.458
## 2 1 2 -0.650
## 3 1 3 -0.478
## 4 1 4 0.356
## 5 1 5 0.203
## 6 1 6 -0.0851
## 7 1 7 -0.103
## 8 1 8 -0.247
## 9 1 9 -0.0446
## 10 1 10 0.00239
## # … with 990 more rows

You can see that we have `100`

simulations of our `10`

-participant
dataset all wrapped up in the same dataframe. We can make it tidier
using the function `nest`

, which wraps up the dataset for each
simulation:

1
2

d <- d %>% group_by(simulation) %>% nest()
d

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

## # A tibble: 100 × 2
## # Groups: simulation [100]
## simulation data
## <int> <list>
## 1 1 <tibble [10 × 2]>
## 2 2 <tibble [10 × 2]>
## 3 3 <tibble [10 × 2]>
## 4 4 <tibble [10 × 2]>
## 5 5 <tibble [10 × 2]>
## 6 6 <tibble [10 × 2]>
## 7 7 <tibble [10 × 2]>
## 8 8 <tibble [10 × 2]>
## 9 9 <tibble [10 × 2]>
## 10 10 <tibble [10 × 2]>
## # … with 90 more rows

As you can see, now we have one row per simulation, and a new `data`

column that contains the dataset for each simulation. We can confirm
that the datasets look like our old ones by picking one out (Here I’m
using R’s double brackets instead of single brackets (`d$data[1]`

) since
the data column is a list-column):

1

d$data[[1]]

1
2
3
4
5
6
7
8
9
10
11
12
13

## # A tibble: 10 × 2
## participant moral_judgment
## <int> <dbl>
## 1 1 -0.458
## 2 2 -0.650
## 3 3 -0.478
## 4 4 0.356
## 5 5 0.203
## 6 6 -0.0851
## 7 7 -0.103
## 8 8 -0.247
## 9 9 -0.0446
## 10 10 0.00239

Cool- so now we have 100 datasets. As before, we want to calculate a
t-test for each possible dataset. To do that, we need to make use of the
`map`

function:

1
2

d <- d %>% mutate(t.test=map(data, ~ t.test(.$moral_judgment)))
d

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

## # A tibble: 100 × 3
## # Groups: simulation [100]
## simulation data t.test
## <int> <list> <list>
## 1 1 <tibble [10 × 2]> <htest>
## 2 2 <tibble [10 × 2]> <htest>
## 3 3 <tibble [10 × 2]> <htest>
## 4 4 <tibble [10 × 2]> <htest>
## 5 5 <tibble [10 × 2]> <htest>
## 6 6 <tibble [10 × 2]> <htest>
## 7 7 <tibble [10 × 2]> <htest>
## 8 8 <tibble [10 × 2]> <htest>
## 9 9 <tibble [10 × 2]> <htest>
## 10 10 <tibble [10 × 2]> <htest>
## # … with 90 more rows

The `map`

function is pretty sweet if you ask me: it takes a list-column
(like `d$data`

), and a function, and applies that function to each
element of the list-column. In our case, we used a formula syntax to
define something called an *anonymous function*, which is just like any
old function except you don’t care about it enough to give it a name.
Notably, the formula syntax is probably slightly different than you
might’ve seen before in R. The tilde (`~`

) tells R that we’re defining a
function. Everything to the right of the tilde is what the function
computes: in our case, we’re just running a t-test. Finally, the period
(`.`

) is a placeholder for whatever dataset is getting passed to
`t.test`

. If we want to extract particular information from the analyses
(like the p-values), we can use map again:

1
2
3

d <- d %>% mutate(p.value=map_dbl(t.test, ~ .$p.value),
significant=p.value < .05)
d

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

## # A tibble: 100 × 5
## # Groups: simulation [100]
## simulation data t.test p.value significant
## <int> <list> <list> <dbl> <lgl>
## 1 1 <tibble [10 × 2]> <htest> 0.163 FALSE
## 2 2 <tibble [10 × 2]> <htest> 0.0376 TRUE
## 3 3 <tibble [10 × 2]> <htest> 0.417 FALSE
## 4 4 <tibble [10 × 2]> <htest> 0.210 FALSE
## 5 5 <tibble [10 × 2]> <htest> 0.0102 TRUE
## 6 6 <tibble [10 × 2]> <htest> 0.0405 TRUE
## 7 7 <tibble [10 × 2]> <htest> 0.0379 TRUE
## 8 8 <tibble [10 × 2]> <htest> 0.0562 FALSE
## 9 9 <tibble [10 × 2]> <htest> 0.00281 TRUE
## 10 10 <tibble [10 × 2]> <htest> 0.00431 TRUE
## # … with 90 more rows

Here I’ve used `map_dbl`

instead of regular-old `map`

to ensure that our
`p.value`

column is a regular number column, not a list column.

From here we have everything we need to calculate power! To make sure we get a power estimate with good confidence intervals, I’m going to use a logistic regression to predict the rate of significance (given we know the effect is real):

1
2

power <- glm(significant ~ 1, data=d, family='binomial')
summary(power)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

##
## Call:
## glm(formula = significant ~ 1, family = "binomial", data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.093 -1.093 -1.093 1.264 1.264
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2007 0.2010 -0.998 0.318
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 137.63 on 99 degrees of freedom
## Residual deviance: 137.63 on 99 degrees of freedom
## AIC: 139.63
##
## Number of Fisher Scoring iterations: 3

Sadly this estimate isn’t super helpful since our intercept is on the
scale of log-odds. Also, don’t get worked up about whether the intercept
is “significant”, since this is just testing against a power of .5
(which is not what we want). To get more useable estimates and
confidence intervals, we can use `emmeans`

:

1

emmeans(power, ~1, type='response')

1
2
3
4
5

## 1 prob SE df asymp.LCL asymp.UCL
## overall 0.45 0.0497 Inf 0.356 0.548
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale

This tells us that for a dataset of size 10, we have a pretty poor power of 45% [35.6%, 54.8%].

## So… about that sample size…

We’re so close to being able to answer our question about sample size, but somehow we still don’t know how many participants to run. To do that, we need to repeat the entire process above for a range of sample sizes, until we find one that gives us the power we need. Thankfully, since we went through the effort of using nested data-frames, it’s pretty easy to extend what we already have:

1
2
3
4
5
6
7

d.t2 <- expand_grid(N=2:40, simulation=1:100) %>%
group_by(N, simulation) %>%
mutate(data=list(tibble(participant=1:N, moral_judgment=rnorm(n=N, mean=-.2, sd=.3))),
t.test=map(data, ~ t.test(.$moral_judgment)),
p.value=map_dbl(t.test, ~ .$p.value),
significant=p.value < .05)
d.t2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

## # A tibble: 3,900 × 6
## # Groups: N, simulation [3,900]
## N simulation data t.test p.value significant
## <int> <int> <list> <list> <dbl> <lgl>
## 1 2 1 <tibble [2 × 2]> <htest> 0.113 FALSE
## 2 2 2 <tibble [2 × 2]> <htest> 0.656 FALSE
## 3 2 3 <tibble [2 × 2]> <htest> 0.879 FALSE
## 4 2 4 <tibble [2 × 2]> <htest> 0.150 FALSE
## 5 2 5 <tibble [2 × 2]> <htest> 0.333 FALSE
## 6 2 6 <tibble [2 × 2]> <htest> 0.468 FALSE
## 7 2 7 <tibble [2 × 2]> <htest> 0.0928 FALSE
## 8 2 8 <tibble [2 × 2]> <htest> 0.654 FALSE
## 9 2 9 <tibble [2 × 2]> <htest> 0.228 FALSE
## 10 2 10 <tibble [2 × 2]> <htest> 0.451 FALSE
## # … with 3,890 more rows

As before, we have a single dataset per row, except now the datasets vary by size. To generate a power curve, we can predict power as a function of sample size (treating sample size as a factor variable to estimate the power for each size independently):

1
2
3
4
5
6
7
8
9

power.t2 <- glm(significant ~ factor(N), data=d.t2, family='binomial')
emmeans(power.t2, ~ N, type='response') %>%
as.data.frame %>%
mutate(asymp.LCL=ifelse(asymp.LCL<1e-10, 1, asymp.LCL)) %>% ## remove unneccesarily large error bars
ggplot(aes(x=N, y=prob, ymin=asymp.LCL, ymax=asymp.UCL)) +
geom_hline(yintercept=.8, linetype='dashed') +
geom_ribbon(fill='red', alpha=.25) + geom_line(color='red', size=1) +
xlab('Sample Size') + ylab('Power') + theme_classic()

If we’re interested in a power of 80% (i.e., an 80% probability of detecting a significant result), then, we will need at least 20 or so participants in our experiment (indicated by the dashed line). Of course these results are fairly noisy, but we can always get tighter confidence intervals by running more simulations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14

d.t3 <- expand_grid(N=2:40, simulation=1:1000) %>%
group_by(N, simulation) %>%
mutate(data=list(tibble(participant=1:N, moral_judgment=rnorm(n=N, mean=-.2, sd=.3))),
t.test=map(data, ~ t.test(.$moral_judgment)),
significant=map_lgl(t.test, ~ .$p.value < .05))
power.t3 <- glm(significant ~ factor(N), data=d.t3, family='binomial')
emmeans(power.t3, ~ N, type='response') %>%
as.data.frame %>%
ggplot(aes(x=N, y=prob, ymin=asymp.LCL, ymax=asymp.UCL)) +
geom_hline(yintercept=.8, linetype='dashed') +
geom_ribbon(fill='red', alpha=.25) + geom_line(color='red', size=1) +
xlab('Sample Size') +ylab('Power') + theme_classic()

This should take a bit longer, but we can see the results are definitely less noisy. If you’re interested, you can also look at other quantities given by the t-tests, like the p-values themselves. Here we can look at the effect of power on distributions of possible p-values:

1
2
3
4
5
6

ggplot(d.t3, aes(x=p.value)) a+
geom_histogram(binwidth=.02) +
theme_classic() +
transition_time(N) +
ggtitle('Number of Participants: {frame_time}') %>%
animate(height=400, width=800)

This confirms the plot above, that with increasing sample size our p-values should get closer to zero.

## What if I *don’t* know the size of my effect?

The above simulation is all well and good, but it assumes that you
*already know* how large your effect should be. If you’re working on a
new effect, you’re pretty much SOL. One thing we can try is to vary the
true effect size and construct a power curve for each:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

d.t4 <- expand_grid(N=2:40, simulation=1:1000, effect_size=seq(0, 1.5, .25)) %>%
group_by(N, simulation, effect_size) %>%
mutate(data=list(tibble(participant=1:N, moral_judgment=rnorm(n=N, mean=effect_size, sd=1))),
t.test=map(data, ~ t.test(.$moral_judgment)),
significant=map_lgl(t.test, ~ .$p.value < .05))
power.t4 <- glm(significant ~ factor(N) * factor(effect_size), data=d.t4, family='binomial')
emmeans(power.t4, ~ N * effect_size, type='response') %>%
as.data.frame %>%
mutate(asymp.LCL=ifelse(asymp.LCL<1e-10, 1, asymp.LCL)) %>% ## remove unneccesarily large error bars
ggplot(aes(x=N, y=prob, ymin=asymp.LCL, ymax=asymp.UCL, group=effect_size)) +
geom_hline(yintercept=.8, linetype='dashed') +
geom_ribbon(aes(fill=factor(effect_size)), alpha=.25) +
geom_line(aes(color=factor(effect_size)), size=1) +
scale_color_viridis(name='Effect Size', discrete=TRUE) +
scale_fill_viridis(name='Effect Size', discrete=TRUE) +
xlab('Sample Size') + ylab('Power') + theme_classic()

Here I’m using a standardized effect size—the mean divided by the standard deviation—so that I don’t need to vary both the mean and the standard deviation. We can see that if we have a large effect, we don’t need many datapoints to reach 80% power. With an effect smaller than .5, though, we quickly start to need a lot more data points for the same amount of power.

# More Complex Designs

The main advantage of the simulation-based approach to power analysis is that once we know the basic process, it’s really easy to do the same thing for all kinds of experimental designs and statistical analyses.

## Correlational designs

For example, we can do a power analysis for correlational designs by tweaking our code slightly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

r <- .4
d.corr <- expand_grid(N=seq(5, 75, 5), simulation=1:100) %>%
group_by(N, simulation) %>%
mutate(data=rmvnorm(N, mean=c(0, 0), sigma=matrix(c(1, r, r, 1), ncol=2)) %>%
as.data.frame() %>%
list(),
cor.test=map(data, ~ cor.test(.$V1, .$V2)),
significant=map_lgl(cor.test, ~ .$p.value < .05))
power.corr <- glm(significant ~ factor(N), data=d.corr, family='binomial')
emmeans(power.corr, ~ N, type='response') %>%
as.data.frame %>%
ggplot(aes(x=N, y=prob, ymin=asymp.LCL, ymax=asymp.UCL)) +
geom_hline(yintercept=.8, linetype='dashed') +
geom_ribbon(fill='red', alpha=.25) + geom_line(color='red', size=1) +
xlab('Sample Size') +ylab('Power') + theme_classic()

In the above code I’m sampling two variables from a multivariate normal
distribution with zero means, unit standard deviations, and a
correlation of `.4`

. Our simmulation here says that we need a sample
size of at least 50 or so to get 80% power.

## Repeated measures

If we have repeated measures, we just need to make sure that we simulate our data so that each participant actually has repeated measures. One way is to assume (a la mixed effects models) that people’s individual means are normally distributed around the grand mean, and their responses are normally distributed around that. Let’s try a two-way repeated measures design with a control and experimental condition:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

effect_size <- 0.4
participant_sd <- 0.5
d.lmer <- expand_grid(N_participants=c(5, 10, 25, 50), N_trials=c(5, 10, 25, 50), simulation=1:100) %>%
group_by(N_participants, N_trials, simulation) %>%
mutate(data=expand_grid(participant=1:N_participants, trial=1:N_trials) %>%
group_by(participant) %>%
mutate(participant_intercept=rnorm(n(), 0, participant_sd)) %>%
ungroup() %>%
mutate(x=ifelse(trial <= N_trials/2, 'Control', 'Manipulation'),
y=rnorm(n(), ifelse(x=='Control', participant_intercept,
effect_size+participant_intercept), 1)) %>%
list(),
lmer=map(data, ~ lmer(y ~ x + (1|participant), data=.)),
significant=map_lgl(lmer, ~ coef(summary(.))[2,5] < .05))
power.lmer <- glm(significant ~ factor(N_participants) * factor(N_trials), data=d.lmer, family='binomial')
emmeans(power.lmer, ~ N_participants*N_trials, type='response') %>%
as.data.frame %>%
mutate(asymp.LCL=ifelse(asymp.LCL<1e-10, 1, asymp.LCL)) %>% ## remove unneccesarily large error bars
ggplot(aes(x=N_participants, y=prob, ymin=asymp.LCL, ymax=asymp.UCL, group=N_trials)) +
geom_hline(yintercept=.8, linetype='dashed') +
geom_ribbon(aes(fill=factor(N_trials)), alpha=.25) + geom_line(aes(color=factor(N_trials)), size=1) +
scale_color_viridis(name='Number of Trials\nPer Participant', discrete=TRUE) +
scale_fill_viridis(name='Number of Trials\nPer Participant', discrete=TRUE) +
xlab('Sample Size') +ylab('Power') + theme_classic()

As you can tell, simulating within-participant data is slightly
trickier, since we needed to first randomly sample participant-level
intercepts before sampling their responses using those means. However,
the process is the same overall, where we create a grid of modeling
conditions (here I’m looking at number of participants and number of
trials per participant), create a fake dataset within each one of those
conditions, look for a significant difference, and finally use a
logistic regression to figure out how much power we have in each case.
The plot above tells us that for an effect size of `.4`

with a
participant-level standard deviation of `.5`

, the number of participants
we need depends on how many trials each participant completes. In this
case, we need 50 participants if each is completing only 5 trials, but
only 5 participants if each is completing 50 trials. Neat!

# Wrapping up

It might seem like we haven’t covered a *ton* today, but the beauty of
the simulation-based approach to power analysis is that if you can
simulate data for your experiment, then you can calculate power in
exactly the same way we’ve been doing! For many (most) design changes
(e.g. using categorical vs continuous variables, changing distribution
families or link functions, adding covariates), the process is exactly
the same. But as your design gets more complex, it will take more effort
to simulate, and generally you will have to make more assumptions to
calculate power. For instance, in the repeated measures case, we have to
make an assumption about how the participant-level means are
distributed, in addition to the overall effect size. If we wanted to add
random slopes, we would also need to make assumptions about the
distribution of those random slopes, as well as the correlation between
the random slopes and intercepts. But the basic process is always the
same.

However, there’s always more to learn! For one, you probably have
noticed that these simulations can quickly take up some CPU time,
especially if you’re varying many different simulation parameters or
using really big datasets. In this case, it is useful to run simulations
in parallel. Thankfully, since this case is what’s known as
*embarrassingly parallel* (i.e., none of the individual simulations
depend on each other, it is very easy to speed up. Another extension is
to look at power analyses for Bayesian statistics, where we work with
distributions of true values instead of fixed true values (Solomon Kurz
has a wonderful blog post series on the
topic.
Hopefully we’ll get to cover these topics in future meetings!