Kevin O'Neill
Kevin O'Neill PhD Candidate in Cognitive Neuroscience at Duke University working with Dr. Felipe De Brigard and Dr. John Pearson on causal reasoning.

Interpreting Regression Coefficients

Interpreting Regression Coefficients

Have you ever ran a regression and wondered where the coefficients come from or what they mean? Or perhaps you’ve tried the same analysis with different coding schemes, and the effect was significant one way but not the other way and you just couldn’t put together why. Maybe you’re new to regression and you just want to know what that’s all about. Or maybe you’ve recently came across my intro to Bayesian regression tutorial and you’re trying to figure out how to specify priors for regression coefficients. In any case, this tutorial is for you!

Table of Contents

Setup

First, we’ll just get everything set up. We need to tweak some settings, load packages, and simulate some data. This time around, let’s simulate what some data might look like for a version of the classic Stroop task. If you’re not familiar with the Stroop task, the idea is that you want to say the color of some word on the screen, regardless of what the text actually says. Typically, people find it easier to do this when the text and the colors are the same (e.g., “blue” in blue text) compared to when the text and the colors are different (e.g., “red” in blue text). So, people are usually faster to respond, as well as less likely to make mistakes, when the color and text match. In this version of our task, let’s imagine that we also manipulated the saturation of the colors to see if that has an effect as well.

To start, let’s simulate some data. Below I use the expand_grid function to generate a number of trials in the two conditions for a set of fake participants. Next I use group_by and mutate to randomly generate a participant-level mean and effect size. Finally, I randomly sample a reaction time and a response for each trial centered around the participant-level means, and sort the dataframe. This setup (where the trial-level data are centered around participant-level means, which are centered around the group-level mean) reproduces the kind of data that you would typically obtain experimentally, since participants are likely to be somewhat different from each other. I made up the means and effect sizes for this example, but in practice you will want to look at previous research to find numbers that make sense. Though ideally we would use mixed-effect regression for this type of data, today we’re going to keep things simple with good old OLS regression.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
## change some settings
## options(contrasts = c("contr.sum","contr.poly")) 
## this tweaks makes sure that contrasts are interpretable as main effects

# load packages
library(tidyverse)   # for data wrangling
library(emmeans)     # for estimating means

## simulate some fake data for a stroop task
set.seed(2021)   ## make sure we get the same data every time
colors <- c('red', 'blue', 'green')
df <- expand_grid(condition=c('same', 'different'),
                  trial=1:25,
                  participant=1:25) %>%
    group_by(participant) %>%
    mutate(color=sample(colors, n(), replace=TRUE),
           saturation=runif(n()),
           participant=as.factor(participant),
           mean=rnorm(1, 3, 0.4)-saturation*rnorm(1, 0.5, 0.2),
           effect=.3+saturation*rnorm(1, 0.25, 0.2) - .1*mean) %>%
    rowwise() %>%
    mutate(text=ifelse(condition=='same', color, sample(colors[!colors %in% color], 1, replace=TRUE))) %>%
    ungroup() %>%
    mutate(RT=rnorm(n(), ifelse(condition=='same', mean, mean+effect), 0.4),
           correct=rbinom(n(), size=1,
                          plogis(ifelse(condition=='same',
                                        mean-2.8+3*saturation*plogis(effect),
                                        mean-2.8+saturation*plogis(effect)))),
           response=ifelse(correct, color, text),
           saturation=saturation*100) %>%
    select(-mean, -effect) %>%
    relocate(participant, trial, condition, saturation, color, text, RT, response, correct) %>%
    arrange(participant, trial, condition)

Let’s take a peek at this data and see what we’ve made:

1
df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## # A tibble: 1,250 x 9
##    participant trial condition saturation color text     RT response correct
##    <fct>       <int> <chr>          <dbl> <chr> <chr> <dbl> <chr>      <int>
##  1 1               1 different      18.0  blue  green  3.60 green          0
##  2 1               1 same           93.8  green green  2.99 green          0
##  3 1               2 different       9.01 green red    2.66 green          1
##  4 1               2 same           79.4  blue  blue   2.13 blue           1
##  5 1               3 different      43.7  red   green  3.19 green          0
##  6 1               3 same           99.3  blue  blue   2.33 blue           1
##  7 1               4 different      28.8  blue  red    2.37 red            0
##  8 1               4 same           32.8  blue  blue   2.59 blue           1
##  9 1               5 different      95.8  red   blue   2.47 red            1
## 10 1               5 same           10.4  green green  2.36 green          1
## # … with 1,240 more rows

As we can see, we now have a nice & tidy dataframe where each row is a trial for a particular participant/condition, level of saturation. As dependent variables, we have the reaction time, the chosen response, and whether or not the response was correct.

The base case

To kick things off, let’s do the simplest model we can make: the null model. This model predicts reaction time with only a constant intercept (denoted by the 1 in our formula):

1
2
m <- lm(RT ~ 1, df)
summary(m)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
## 
## Call:
## lm(formula = RT ~ 1, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.70202 -0.39197 -0.01827  0.38892  1.86988 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.74466    0.01647   166.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5825 on 1249 degrees of freedom

You can see that our model indeed has only one parameter, an intercept. What does it correspond to? Well, given no other information, the model’s best guess of a given RT is simply the mean RT. We can test the mean RT against 0 using a one-sample t-test:

1
t.test(df$RT)
1
2
3
4
5
6
7
8
9
10
11
## 
##  One Sample t-test
## 
## data:  df$RT
## t = 166.6, df = 1249, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  2.712341 2.776982
## sample estimates:
## mean of x 
##  2.744662

Unsurprisingly, the one-sample t-test gives us exactly the same results. So, our null model is testing whether the mean RT is different from zero (and it is!). This isn’t especially helful in our case, but this is all that’s needed for many research questions.

Simple linear regression

To do something a little more useful, let’s look to see if our fake participants were indeed slower when the color of the text was different from the text itself:

1
2
3
4
5
6
ggplot(df, aes(x=condition, y=RT)) +
    geom_violin() +
    geom_jitter(aes(color=participant), height=0, show.legend=F, alpha=0.2) +
    stat_summary(fun=mean, size=1) +
    xlab('Condition') + ylab('Reaction Time (s)') +
    theme_classic(base_size=24)

The means are pretty close to each other, but it looks like we might have the effect we observed. To find out for sure, let’s run a regression!

1
2
m <- lm(RT ~ condition, df)
summary(m)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
## 
## Call:
## lm(formula = RT ~ condition, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68792 -0.38280 -0.02851  0.39654  1.76337 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.85118    0.02291 124.428  < 2e-16 ***
## conditionsame -0.21303    0.03241  -6.574 7.19e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5729 on 1248 degrees of freedom
## Multiple R-squared:  0.03347,    Adjusted R-squared:  0.03269 
## F-statistic: 43.22 on 1 and 1248 DF,  p-value: 7.187e-11

Based on the number of stars in this output, it looks like we have a significant effect! But what exactly is significant? More specifically, what do these coefficients actually mean? The answer actually depends on how the contrasts over your variables are coded.

Dummy coding

By default, R uses a system of contrasts called dummy coding (or contr.treatment). With dummy coding, we select one of the levels of the variable as a reference level. For categorical variables, by default, R assumes that you want use the earliest level of the variable as the reference level (in our case, this is the ‘different’ condition). Then the Intercept is simply the mean RT for this reference level. We can confirm this by calculating the mean manually:

1
2
mean.different <- df %>% filter(condition=='different') %>% pull(RT) %>% mean
mean.different
1
## [1] 2.851177

Then, the coefficient conditionsame is the difference between the means for our two conditions:

1
2
mean.same <- df %>% filter(condition=='same') %>% pull(RT) %>% mean
mean.same - mean.different
1
## [1] -0.2130316

Putting this together, our significant result is that reaction times were slightly shorter for the “same” condition compared to the “different” condition. What if we want the other way around? Easy- we just set the desired reference level of the condition variable:

1
2
3
df <- df %>% mutate(condition=factor(condition, levels=c('same', 'different')))
m <- lm(RT ~ condition, df)
summary(m)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
## 
## Call:
## lm(formula = RT ~ condition, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68792 -0.38280 -0.02851  0.39654  1.76337 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.63815    0.02291 115.131  < 2e-16 ***
## conditiondifferent  0.21303    0.03241   6.574 7.19e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5729 on 1248 degrees of freedom
## Multiple R-squared:  0.03347,    Adjusted R-squared:  0.03269 
## F-statistic: 43.22 on 1 and 1248 DF,  p-value: 7.187e-11

We can see that the Intercept now refers to the average RT in the “same” condition, and the conditiondifferent coefficient now refers to the difference between the same and different conditions. Notably, this is the same value as before, but negative.

Sum-to-zero coding

Another popular way of coding regression coefficients is to use sum-to-zero or effects coding (or contr.sum), which is popular in psychology because it produces orthogonal (read: independent) contrasts similar to an ANOVA. There are a couple different ways to change the contrast coding in R, but I recommend doing it like so:

1
2
m <- lm(RT ~ condition, df, contrasts=list(condition=contr.sum))
summary(m)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
## 
## Call:
## lm(formula = RT ~ condition, data = df, contrasts = list(condition = contr.sum))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68792 -0.38280 -0.02851  0.39654  1.76337 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.7447     0.0162 169.393  < 2e-16 ***
## condition1   -0.1065     0.0162  -6.574 7.19e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5729 on 1248 degrees of freedom
## Multiple R-squared:  0.03347,    Adjusted R-squared:  0.03269 
## F-statistic: 43.22 on 1 and 1248 DF,  p-value: 7.187e-11

Under sum-to-zero coding, the intercept is now the grand mean RT over all of our data (more specifically, the mean of the mean RTs in each condition), and the slope condition1 is the difference between this grand mean and the mean RT of the “same” condition.

1
2
3
4
mean.grand <- mean(c(mean.same, mean.different))
mean.grand

mean.same - mean.grand
1
2
## [1] 2.744662
## [1] -0.1065158

Thinking like a linear model

From the last two examples, hopefully it’s obvious that regression coefficients are just differences between means. Specifically, our regression model sees the world through a table like this:

condition
same different


That is, for each cell in this table, our model predicts a unique mean reaction time. In this case, the predicted means are just the raw means that we calculated above. But since this won’t work for every design, I’m going to introduce you to the treasure that is emmeans. emmeans stands for “expected marginal means” and does exactly that: it takes a regression model and a set of independent variables, and tells you what the model thinks the mean is for setting of the independent variables. For example, let’s get the expected marginal means of RT by condition:

1
emmeans(m, ~ condition)
1
2
3
4
5
##  condition emmean     SE   df lower.CL upper.CL
##  same        2.64 0.0229 1248     2.59     2.68
##  different   2.85 0.0229 1248     2.81     2.90
## 
## Confidence level used: 0.95

To reproduce the model’s coefficients, we can just pipe the output of this statement into the contrast function. I don’t know why, but emmeans names the contrasts differently from base R (helpful, right?), such that contr.treatment is called trt.vs.ctrl, and contr.sum is called eff:

1
2
3
4
5
6
7
## for dummy contrasts
emmeans(m, ~ condition) %>%
    contrast('trt.vs.ctrl')

## for sum-to-zero contrasts
emmeans(m, ~ condition) %>%
    contrast('eff')
1
2
3
4
5
6
7
8
##  contrast         estimate     SE   df t.ratio p.value
##  different - same    0.213 0.0324 1248 6.574   <.0001 
## 
##  contrast         estimate     SE   df t.ratio p.value
##  same effect        -0.107 0.0162 1248 -6.574  <.0001 
##  different effect    0.107 0.0162 1248  6.574  <.0001 
## 
## P value adjustment: fdr method for 2 tests

Alternatively, we can put the contrasts directly into the left side of the formula in the emmeans call to see the contrasts alongside the means:

1
2
3
4
5
## for dummy contrasts
emmeans(m, trt.vs.ctrl ~ condition)

## for sum-to-zero contrasts
emmeans(m, eff ~ condition)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
## $emmeans
##  condition emmean     SE   df lower.CL upper.CL
##  same        2.64 0.0229 1248     2.59     2.68
##  different   2.85 0.0229 1248     2.81     2.90
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast         estimate     SE   df t.ratio p.value
##  different - same    0.213 0.0324 1248 6.574   <.0001 
## 
## 
## $emmeans
##  condition emmean     SE   df lower.CL upper.CL
##  same        2.64 0.0229 1248     2.59     2.68
##  different   2.85 0.0229 1248     2.81     2.90
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast         estimate     SE   df t.ratio p.value
##  same effect        -0.107 0.0162 1248 -6.574  <.0001 
##  different effect    0.107 0.0162 1248  6.574  <.0001 
## 
## P value adjustment: fdr method for 2 tests

I’ll also note here that this is a fantastic way to get predicted means/CIs to make a plot of your results, without having to calculate anything manually (and yes, this even works for mixed-effect models and repeated measures designs)! Let’s try it out:

1
2
3
4
5
6
ggplot(df, aes(x=condition, y=RT)) +
    geom_jitter(aes(color=participant), height=0, show.legend=F, alpha=0.2) +
    geom_pointrange(aes(y=emmean, ymin=lower.CL, ymax=upper.CL),
                    data=as.data.frame(emmeans(m, ~ condition))) +
    xlab('Condition') + ylab('Reaction Time (s)') +
    theme_classic(base_size=24)

Regression with additive predictors

Until now, we’ve focused on interpreting regression coefficients when we have just one predictor in our model. But things get a little messier when we use two or more predictors. For example, perhaps we wanted to know if the color of the stimulus also effects people’s reaction time in the Stroop task. More specifically, we’d like to estimate a mean RT for each cell of the following table:

condition
color same, blue different, blue
same, green different, green
same, red different, red


There are two ways we could add this into our model. In the simplest case, we might want to see if both (a) condition and (b) color independently. That is, we force the effect of condition to be the same across the different levels of color and vice versa. We can do this using the + symbol in our regression formula:

1
2
m <- lm(RT ~ condition + color, df)
summary(m)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
## 
## Call:
## lm(formula = RT ~ condition + color, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.64646 -0.38321 -0.03082  0.40313  1.73237 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.63205    0.03324  79.173  < 2e-16 ***
## conditiondifferent  0.21597    0.03240   6.666 3.93e-11 ***
## colorgreen         -0.03829    0.04012  -0.954    0.340    
## colorred            0.04998    0.03974   1.258    0.209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5721 on 1246 degrees of freedom
## Multiple R-squared:  0.03742,    Adjusted R-squared:  0.0351 
## F-statistic: 16.14 on 3 and 1246 DF,  p-value: 2.674e-10

It looks like we have a significant effect of condition, and no significant effects of color. To replicate these coefficients, we can use emmeans (the adjust='none' option is used to make sure the p-values match what we get from lm):

1
2
emmeans(m, ~ condition) %>% contrast('trt.vs.ctrl')
emmeans(m, ~ color) %>% contrast('trt.vs.ctrl', adjust='none')
1
2
3
4
5
6
7
8
9
##  contrast         estimate     SE   df t.ratio p.value
##  different - same    0.216 0.0324 1246 6.666   <.0001 
## 
## Results are averaged over the levels of: color 
##  contrast     estimate     SE   df t.ratio p.value
##  green - blue  -0.0383 0.0401 1246 -0.954  0.3400 
##  red - blue     0.0500 0.0397 1246  1.258  0.2087 
## 
## Results are averaged over the levels of: condition

We can see that the Intercept is our estimate of the mean RT for trials in the “same” condition with a blue stimulus, the conditiondifferent coefficient is the average increase in RT for the different condition over the three colors, and the two color coefficients are the difference between mean RT in the red/green and blue colors, averaging over the two conditions. To confirm that our model actually treats the effects of condition and color independently, we can compute contrasts of one variable over the levels of the other variable:

1
2
emmeans(m, ~condition|color) %>% contrast('trt.vs.ctrl')
emmeans(m, ~color|condition) %>% contrast('trt.vs.ctrl', adjust='none')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
## color = blue:
##  contrast         estimate     SE   df t.ratio p.value
##  different - same    0.216 0.0324 1246 6.666   <.0001 
## 
## color = green:
##  contrast         estimate     SE   df t.ratio p.value
##  different - same    0.216 0.0324 1246 6.666   <.0001 
## 
## color = red:
##  contrast         estimate     SE   df t.ratio p.value
##  different - same    0.216 0.0324 1246 6.666   <.0001 
## 
## condition = same:
##  contrast     estimate     SE   df t.ratio p.value
##  green - blue  -0.0383 0.0401 1246 -0.954  0.3400 
##  red - blue     0.0500 0.0397 1246  1.258  0.2087 
## 
## condition = different:
##  contrast     estimate     SE   df t.ratio p.value
##  green - blue  -0.0383 0.0401 1246 -0.954  0.3400 
##  red - blue     0.0500 0.0397 1246  1.258  0.2087

Since the contrasts look the same over each level of the variable, we can be sure that our model does not allow condition and color to interact. If we want to use different contrast coding for this model, the coefficients are just as interpretable. For example, we can estimate the same model with sum-to-zero coefficients:

1
2
m <- lm(RT ~ condition + color, df, contrasts=list(condition='contr.sum', color='contr.sum'))
summary(m)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
## 
## Call:
## lm(formula = RT ~ condition + color, data = df, contrasts = list(condition = "contr.sum", 
##     color = "contr.sum"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.64646 -0.38321 -0.03082  0.40313  1.73237 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.743929   0.016195 169.428  < 2e-16 ***
## condition1  -0.107984   0.016198  -6.666 3.93e-11 ***
## color1      -0.003897   0.023196  -0.168   0.8666    
## color2      -0.042191   0.022882  -1.844   0.0654 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5721 on 1246 degrees of freedom
## Multiple R-squared:  0.03742,    Adjusted R-squared:  0.0351 
## F-statistic: 16.14 on 3 and 1246 DF,  p-value: 2.674e-10

The coefficients can be interpreted in the same way as sum-to-zero contrasts with only one variable: the intercept corresponds to our estimate of the grand mean, the coefficient for condition is the difference between the grand mean and the same condition, and the color coefficients are the difference between the grand mean and the red/green colors:

1
2
emmeans(m, ~ condition) %>% contrast('eff')
emmeans(m, ~ color) %>% contrast('eff', adjust='none')
1
2
3
4
5
6
7
8
9
10
11
12
##  contrast         estimate     SE   df t.ratio p.value
##  same effect        -0.108 0.0162 1246 -6.666  <.0001 
##  different effect    0.108 0.0162 1246  6.666  <.0001 
## 
## Results are averaged over the levels of: color 
## P value adjustment: fdr method for 2 tests 
##  contrast     estimate     SE   df t.ratio p.value
##  blue effect   -0.0039 0.0232 1246 -0.168  0.8666 
##  green effect  -0.0422 0.0229 1246 -1.844  0.0654 
##  red effect     0.0461 0.0227 1246  2.034  0.0422 
## 
## Results are averaged over the levels of: condition

Of course, we can always mix + match the two coding systems together if we so choose:

1
2
m <- lm(RT ~ condition + color, df, contrasts=list(color='contr.sum'))
summary(m)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
## 
## Call:
## lm(formula = RT ~ condition + color, data = df, contrasts = list(color = "contr.sum"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.64646 -0.38321 -0.03082  0.40313  1.73237 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.635945   0.022922 114.999  < 2e-16 ***
## conditiondifferent  0.215969   0.032396   6.666 3.93e-11 ***
## color1             -0.003897   0.023196  -0.168   0.8666    
## color2             -0.042191   0.022882  -1.844   0.0654 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5721 on 1246 degrees of freedom
## Multiple R-squared:  0.03742,    Adjusted R-squared:  0.0351 
## F-statistic: 16.14 on 3 and 1246 DF,  p-value: 2.674e-10

As an exercise, see if you can figure out where these numbers come from (hint: look up!).

Regression with interacting predictors

While the additive model is nice and easy to interpret, it’s not always the case that the effect of one variable will be the same across the domain of another variable. To relax this assumption, we can fit a multiplicative model that includes interaction terms:

1
2
m <- lm(RT ~ condition * color, df)
summary(m)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
## 
## Call:
## lm(formula = RT ~ condition * color, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.67893 -0.37766 -0.03314  0.40012  1.74843 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    2.63452    0.04128  63.820  < 2e-16 ***
## conditiondifferent             0.21117    0.05751   3.672 0.000251 ***
## colorgreen                    -0.07548    0.05765  -1.309 0.190709    
## colorred                       0.07582    0.05586   1.357 0.174949    
## conditiondifferent:colorgreen  0.07197    0.08026   0.897 0.370050    
## conditiondifferent:colorred   -0.05540    0.07950  -0.697 0.486034    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.572 on 1244 degrees of freedom
## Multiple R-squared:  0.03946,    Adjusted R-squared:  0.0356 
## F-statistic: 10.22 on 5 and 1244 DF,  p-value: 1.293e-09

You can see that now we have two more interaction coefficients. Under dummy coding, the presence of these interaction coefficients changes how the above coefficients are interpreted: now, each coefficient is conditional with respect to the reference level of each variable. The Intercept is still the mean RT for trials in the “same” condition with a blue stimulus. But, whereas the conditiondifferent used to tell us the average difference between same/different trials over the three colors, now it only tells us that difference for the blue trials. We call this a “simple effect” since its interpretation is more simple than that of a “main effect” which we’ll talk about soon. In emmeans, we can get this contrast for the blue trials only using the at argument:

1
2
emmeans(m, ~ condition | color, at=list(color=c('blue'))) %>%
    contrast('trt.vs.ctrl')
1
2
3
## color = blue:
##  contrast         estimate     SE   df t.ratio p.value
##  different - same    0.211 0.0575 1244 3.672   0.0003

The colorgreen and colorred coefficients have also changed: now they are the difference between the blue and red/green colors within the “same” condition:

1
2
emmeans(m, ~ color | condition, at=list(condition=c('same'))) %>%
    contrast('trt.vs.ctrl', adjust='none')
1
2
3
4
## condition = same:
##  contrast     estimate     SE   df t.ratio p.value
##  green - blue  -0.0755 0.0577 1244 -1.309  0.1907 
##  red - blue     0.0758 0.0559 1244  1.357  0.1749

Finally, we have the two interaction terms. There are two equally good ways of thinking about them. On one hand, they are the difference in the same/different Stroop effect between the red/green and blue colored trials:

1
2
3
emmeans(m, ~ condition | color) %>%
    contrast('trt.vs.ctrl') %>%
    contrast(method='trt.vs.ctrl', by='contrast', adjust='none')
1
2
3
4
## contrast = different - same:
##  contrast1    estimate     SE   df t.ratio p.value
##  green - blue   0.0720 0.0803 1244  0.897  0.3700 
##  red - blue    -0.0554 0.0795 1244 -0.697  0.4860

This tells us that the Stroop effect was larger for green than for blue stimuli, and larger for blue than for red stimuli (though none of these differences were significant). But another way of looking at it is that they describe the difference in the differences between stimulus colors across the same/different conditions:

1
2
3
emmeans(m, ~ color | condition) %>%
    contrast('trt.vs.ctrl') %>%
    contrast(method='trt.vs.ctrl', by='contrast', adjust='none')
1
2
3
4
5
6
7
## contrast = green - blue:
##  contrast1        estimate     SE   df t.ratio p.value
##  different - same   0.0720 0.0803 1244  0.897  0.3700 
## 
## contrast = red - blue:
##  contrast1        estimate     SE   df t.ratio p.value
##  different - same  -0.0554 0.0795 1244 -0.697  0.4860

This perspective tells us that the difference between RTs following green and blue stimuli in the “different” condition is larger than in the “same” condition, and that the difference between RTs following red and blue stimuli in the “different” condition is smaller than in the “same” condition. In this case the first interpretation is more natural, but mathematically both interpretations are equivalent, so feel free to take whichever interpretation seems more natural.

If you’re used to running ANOVAs, then it might seem that the approach of looking at effects of one variable conditional on another is more complicated than it needs to be. What you really want to know is, on average, what does effect does each predictor have on the outcome variable- you want a main effect. We can get main effects by switching from dummy contrasts to sum-to-zero contrasts:

1
2
m <- lm(RT ~ condition * color, df, contrasts=list(condition='contr.sum', color='contr.sum'))
summary(m)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
## 
## Call:
## lm(formula = RT ~ condition * color, data = df, contrasts = list(condition = "contr.sum", 
##     color = "contr.sum"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.67893 -0.37766 -0.03314  0.40012  1.74843 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.742980   0.016206 169.258  < 2e-16 ***
## condition1        -0.108349   0.016206  -6.686 3.46e-11 ***
## color1            -0.002875   0.023201  -0.124   0.9014    
## color2            -0.042367   0.022887  -1.851   0.0644 .  
## condition1:color1  0.002762   0.023201   0.119   0.9052    
## condition1:color2 -0.033224   0.022887  -1.452   0.1469    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.572 on 1244 degrees of freedom
## Multiple R-squared:  0.03946,    Adjusted R-squared:  0.0356 
## F-statistic: 10.22 on 5 and 1244 DF,  p-value: 1.293e-09

Now we can go back to treating each coefficient in isolation: the Intercept is the grand mean, the condition1 coefficient tells us the average difference between the same condition and the grand mean over all stimulus colors, the color1 and color2 are the average differences between the blue/green colors and the grand mean over both conditions, and finally the interaction terms are the difference in the difference in the same condition and the grand mean between the blue/green colors and the grand mean. Even though this sounds like a mouthful, sum-to-zero contrasts have the advantage that each coefficient is independent of the others.

Continuous variables

We’ve been through interpreting the null regression model, interpreting single-variable regression models, and interpreting regression models with more than one variable. What could possibly be left? Well, if you remember back to when we simulated this data, we also simulated the saturation level of the stimulus color (ranging from 0 to 100%). This is a continuous variable, which has some quirks of its own in terms of interpreting regression coefficients. Let’s start by forgetting about the Stroop effect, and just looking at whether saturation influences reaction time during the Stroop task:

1
2
3
4
5
ggplot(df, aes(x=saturation, y=RT)) +
    geom_point(alpha=.1) +
    geom_smooth(color='black', method='lm') +
    xlab('Saturation (%)') + ylab('Reaction Time (s)') +
    theme_classic(base_size=24)

To β or not to β?

It turns out that there are a few different ways to code this very same regression. Let’s start by running a regression using R’s default settings:

1
2
m <- lm(RT ~ saturation, df)
summary(m)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
## 
## Call:
## lm(formula = RT ~ saturation, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7448 -0.3810 -0.0215  0.3804  2.0264 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.8988005  0.0318273  91.079  < 2e-16 ***
## saturation  -0.0031243  0.0005544  -5.636 2.16e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5754 on 1248 degrees of freedom
## Multiple R-squared:  0.02482,    Adjusted R-squared:  0.02404 
## F-statistic: 31.76 on 1 and 1248 DF,  p-value: 2.155e-08

With this default setting, we interpret the coefficients similarly to dummy-coded categorical variables. That is, the Intercept is the mean RT when saturation is zero, and the saturation coefficient is the mean increase/decrease in RTs when saturation is increased by 1. Since our saturation variable is in percentage units, this is the increase in RT with a 1% increase in saturation. If you want to make your Intercept more interpretable, you can standardize the predictor variable saturation by subtracting the mean and dividing by the standard deviation (done by the scale function):

1
2
3
df$saturation.std <- scale(df$saturation)
m <- lm(RT ~ saturation.std, df)
summary(m)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
## 
## Call:
## lm(formula = RT ~ saturation.std, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7448 -0.3810 -0.0215  0.3804  2.0264 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.74466    0.01628 168.640  < 2e-16 ***
## saturation.std -0.09176    0.01628  -5.636 2.16e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5754 on 1248 degrees of freedom
## Multiple R-squared:  0.02482,    Adjusted R-squared:  0.02404 
## F-statistic: 31.76 on 1 and 1248 DF,  p-value: 2.155e-08

Though the Intercept is still the mean RT when saturation.std is zero, this has a different interpretation before. Since saturation.std is standardized, it is centered at zero and has a standard deviation of

  1. So, the Intercept is the mean RT at the mean value of saturation, and the coefficient saturation.std is the mean increase in RT with a standard deviation increase in saturation. In cases where the predictor variable has a meaningless scale (like a Likert scale), this coefficient can be more interpretable than the default coefficient. Finally, in addition to standardizing the predictor variable saturation, we can also standardize the outcome variable RT:
1
2
3
df$RT.std <- scale(df$RT)
m <- lm(RT.std ~ saturation.std, df)
summary(m)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
## 
## Call:
## lm(formula = RT.std ~ saturation.std, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9956 -0.6541 -0.0369  0.6531  3.4791 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     9.950e-17  2.794e-02   0.000        1    
## saturation.std -1.575e-01  2.795e-02  -5.636 2.16e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9879 on 1248 degrees of freedom
## Multiple R-squared:  0.02482,    Adjusted R-squared:  0.02404 
## F-statistic: 31.76 on 1 and 1248 DF,  p-value: 2.155e-08

Now, the Intercept is extremely close to zero (representing the mean standardized RT at the mean saturation), and the coefficient saturation.std is the standardi deviation increase in RT given a standard deviation increase in saturation: in other words, it’s a correlation!

1
cor.test(df$RT.std, df$saturation.std)
1
2
3
4
5
6
7
8
9
10
11
## 
##  Pearson's product-moment correlation
## 
## data:  df$RT.std and df$saturation.std
## t = -5.6355, df = 1248, p-value = 2.155e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2111345 -0.1029865
## sample estimates:
##        cor 
## -0.1575328

You can also run the regression with standardized RTs, but not standardized saturation values. As an exercise, try interpreting these (standardized) coefficients:

1
2
m <- lm(RT.std ~ saturation, df)
summary(m)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
## 
## Call:
## lm(formula = RT.std ~ saturation, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9956 -0.6541 -0.0369  0.6531  3.4791 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.2646347  0.0546429   4.843 1.44e-06 ***
## saturation  -0.0053641  0.0009518  -5.636 2.16e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9879 on 1248 degrees of freedom
## Multiple R-squared:  0.02482,    Adjusted R-squared:  0.02404 
## F-statistic: 31.76 on 1 and 1248 DF,  p-value: 2.155e-08

If you have been paying attention, you’ll have noticed that the significance of the saturation coefficient was exactly the same between the four versions of the model. Ultimately, the choice to standardize or not to standardize is totally up to you and your preferences.

Combining continuous & categorical

We found that saturation decreased RTs in the Stroop task on average. Could this effect differ depending on whether the color of the stimulus matched the text? First, let’s make a plot:

1
2
3
4
5
ggplot(df, aes(x=saturation, y=RT, color=condition)) +
    geom_point(alpha=.1) +
    geom_smooth(method='lm') +
    xlab('Saturation (%)') + ylab('Reaction Time (s)') +
    theme_classic()

It looks like RTs only decrease with saturation when the color is the same as the text. To confirm, let’s run a (multiplicative) regression:

1
2
m <- lm(RT ~ condition * saturation, df)
summary(m)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
## 
## Call:
## lm(formula = RT ~ condition * saturation, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.67095 -0.39058 -0.03076  0.37961  1.79690 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    2.9146551  0.0436345  66.797  < 2e-16 ***
## conditiondifferent            -0.0300392  0.0620630  -0.484    0.628    
## saturation                    -0.0056408  0.0007634  -7.389 2.70e-13 ***
## conditiondifferent:saturation  0.0049673  0.0010811   4.595 4.77e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.561 on 1246 degrees of freedom
## Multiple R-squared:  0.07459,    Adjusted R-squared:  0.07236 
## F-statistic: 33.48 on 3 and 1246 DF,  p-value: < 2.2e-16

Just like before, the Intercept is the mean RT when condition is at its reference level (“same”), and when saturation is at zero, conditiondifferent tells us the mean Stroop effect when saturation is at zero, and saturation tells us the increase in RT with a 1% increase in saturation in the “same” condition:

1
2
3
4
5
6
7
8
9
10
## Intercept
emmeans(m, ~condition, at=list(condition='same', saturation=0))

## conditiondifferent
emmeans(m, ~condition, at=list(saturation=0)) %>%
    contrast('trt.vs.ctrl')

## saturation
emmeans(m, ~saturation, at=list(saturation=0:1, condition='same')) %>%
    contrast('trt.vs.ctrl')
1
2
3
4
5
6
7
8
9
##  condition emmean     SE   df lower.CL upper.CL
##  same        2.91 0.0436 1246     2.83        3
## 
## Confidence level used: 0.95 
##  contrast         estimate     SE   df t.ratio p.value
##  different - same    -0.03 0.0621 1246 -0.484  0.6285 
## 
##  contrast estimate       SE   df t.ratio p.value
##  1 - 0    -0.00564 0.000763 1246 -7.389  <.0001

As with the interaction between condition and color, there are two different ways we can choose to interpret the interaction coefficient. First, we could say that it is the difference in size of the Stroop effect between when saturation = 1% and when saturation = 0%. But it’s equally accurate to say that it represents the difference in the effect of saturation between the “same” and “different” conditions:

1
2
3
4
5
6
7
emmeans(m, ~ saturation | condition, at=list(saturation=0:1)) %>%
    contrast('trt.vs.ctrl') %>%
    contrast(method='trt.vs.ctrl', by='contrast', adjust='none')

emmeans(m, ~ condition | saturation, at=list(saturation=0:1)) %>%
    contrast('trt.vs.ctrl') %>%
    contrast(method='trt.vs.ctrl', by='contrast', adjust='none')
1
2
3
4
5
6
7
## contrast = 1 - 0:
##  contrast1        estimate      SE   df t.ratio p.value
##  different - same  0.00497 0.00108 1246 4.595   <.0001 
## 
## contrast = different - same:
##  contrast1 estimate      SE   df t.ratio p.value
##  1 - 0      0.00497 0.00108 1246 4.595   <.0001

As an exercise, try repeating this regression with sum-to-zero contrasts and interpreting those coefficients!

There’s one more case where you might be having trouble interpreting regression coefficients: when using a family other than the default Gaussian, particilarly families with non-identity link functions. In logistic regression, for example, we predict a binary outcome with a linear regression on the logit scale (from  − ∞ to ∞), which gets pumped through the inverse logit function to yield probabilities (from 0 to 1). This transformation can sometimes make interpreting the coefficients difficult. As an example, let’s try looking at whether, in addition to being slower, people also make more mistakes when the stimulus text and color are different. First, as always, let’s check out a plot:

1
2
3
4
5
ggplot(df, aes(x=condition, y=correct)) +
    geom_jitter(alpha=.1, height=0) +
    stat_summary(fun.data=mean_cl_normal) +
    xlab('Condition') + ylab('Probability of Correct Response') +
    theme_classic()

It appears as if people perform above chance in the “same” condition, but not the “different” condition. To confirm, let’s look at the underlying model:

1
2
m <- glm(correct ~ condition, df, family='binomial')
summary(m)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
## 
## Call:
## glm(formula = correct ~ condition, family = "binomial", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4514  -1.1706   0.9262   0.9262   1.1842  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.62432    0.08393   7.439 1.02e-13 ***
## conditiondifferent -0.64032    0.11595  -5.522 3.35e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1705.7  on 1249  degrees of freedom
## Residual deviance: 1674.8  on 1248  degrees of freedom
## AIC: 1678.8
## 
## Number of Fisher Scoring iterations: 4

We can see that both the Intercept and the conditiondifferent coefficients are significant. But what do they mean? Similarly to before, the Intercept is the mean X for the “same” condition, and conditiondifferent is the mean change in X between the “same” and “different” conditions. To find out what X exactly is, let’s consult emmeans:

1
emmeans(m, ~ condition)
1
2
3
4
5
6
##  condition emmean     SE  df asymp.LCL asymp.UCL
##  same       0.624 0.0839 Inf     0.460     0.789
##  different -0.016 0.0800 Inf    -0.173     0.141
## 
## Results are given on the logit (not the response) scale. 
## Confidence level used: 0.95

emmeans tells us that our estimates are on the logit scale, not the response scale. Although we could transform these manually back to probabilities using the function plogis, emmeans can do it for you:

1
emmeans(m, ~ condition, type='response')
1
2
3
4
5
6
##  condition  prob     SE  df asymp.LCL asymp.UCL
##  same      0.651 0.0191 Inf     0.613     0.688
##  different 0.496 0.0200 Inf     0.457     0.535
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale

Now we can see that the probability of responding correctly is 65% in the same condition and 50% in the different condition. Are these two probabilities different? Let’s find out:

1
emmeans(m, ~ condition, type='response') %>% contrast('trt.vs.ctrl')
1
2
3
4
##  contrast         odds.ratio     SE  df null z.ratio p.value
##  different / same      0.527 0.0611 Inf    1 -5.522  <.0001 
## 
## Tests are performed on the log odds ratio scale

Whoah- what’s going on there? Instead of differences in probabilities, emmeans is giving us something called an odds ratio. What’s that? Well an odds is the proportion between something happening and not happening. We can calculate these manually:

1
2
3
4
5
6
7
8
p.same <- as.data.frame(emmeans(m, ~ condition, type='response'))$prob[1]
p.diff <- as.data.frame(emmeans(m, ~ condition, type='response'))$prob[2]

odds.same <- p.same / (1 - p.same)
odds.diff <- p.diff / (1 - p.diff)

odds.same
odds.diff
1
2
## [1] 1.866972
## [1] 0.984127

We can see that in the “same” condition, participants were 1.9 times more likely to make a correct response compared to an incorrect response. In the “different” condition, people were about as likely to be correct as they were incorrrect. The odds ratio, not surprisingly, is the ratio between these two odds:

1
2
odds.ratio <- odds.diff / odds.same
odds.ratio
1
## [1] 0.5271245

So we can piece all this back together to say that the odds of being correct were about half as large in the different condition than in the same condition. Why does emmeans give us this weird number? Well, this is just what logistic regression tests for. Specifically, the logit scale is also known as the log odds scale, which means that our coefficients are on the scale of log odds. To convert our coefficients to the odds scale, then, we can just exponentiate:

1
2
3
4
exp(coef(m))  ## get the coefficients on the odds scale

## get the odds with CIs with emmeans
emmeans(m, ~condition, tran='log', type='response')
1
2
3
4
5
6
7
8
##        (Intercept) conditiondifferent 
##          1.8669725          0.5271245 
##  condition  prob     SE  df asymp.LCL asymp.UCL
##  same      1.867 0.1567 Inf     1.584      2.20
##  different 0.984 0.0787 Inf     0.841      1.15
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale

If odds ratios aren’t very intuitive for you, you are in good company. Some have argued that it’s not only fine, but it’s preferable to use standard linear regression for estimating differences in probabilities. But if you’re looking for more practice, as a parting exercise, try running this logistic regression with condition and saturation as predictors! The coefficients might take some work to interpret, but the basic logic is exactly the same.

1
2
3
4
ggplot(df, aes(x=saturation, y=correct, color=condition)) +
    geom_point() +
    geom_smooth(method='glm', method.args=list(family='binomial')) +
    theme_classic()

Wrapping up

If you made it this far, congrats! We covered a lot in this workshop, including how to interpret linear regression coefficients with zero-two categorical variables, with standardized and unstandardized continuous variables, and with non-trivial link functions (using the example of logistic regression). In fact, we covered so much, that there’s not much more to know here! With a little imagination, you can use the same logic to interpret coefficients for models with 3+ way interactions, models with random effects, and models with different link functions and distribution families.