Ari Khoudary
Ari Khoudary PhD Student in Cognitive Science at UC Irvine working with Dr. Megan Peters and Dr. Aaron Bornstein on memory, perception, and metacognition.

ggplot2 workshop

ggplot2 workshop


Welcome! This is an intermediate-level workshop about the R package ggplot2. It’s co-authored by Allie and Maria. We’ll cover custom aesthetics, plotting model estimates, multi-layered plots, and advanced geoms (spaghetti and rainbow plots!).


Table of Contents



Tidy data


Like other packages in the tidyverse, ggplot2 works best when your data are “tidy”: each column is a variable, and each row is an observation. How exactly you define an “observation” is going to be dependent on your question and desired visualization, and it’s not uncommon to use a couple different versions of your dataframe when playing around with visualizations.

Today we’ll be working with the dragons dataset that we’ve all come to know and love. This is a fictional dataset about dragons. Each dragon has one row. We have information about each dragon’s body length and cognitive test score. We also have some other information about each dragon: We know about the mountain range where it lives, color, diet, and whether or not it breathes fire.

Let’s load it in, along with some other packages.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#change some settings
options(scipen=999) #turn off scientific notation
options(contrasts = c("contr.sum","contr.poly")) #this tweaks the sum-of-squares settings to make sure the output of Anova(model) and summary(model) are consistent and appropriate when a model has interaction terms

# load in packages
library(dplyr) #for data wrangling
library(lme4) #fit the models
library(lmerTest) #gives p-values and more info
library(Rmisc) #summary stats
library(emmeans) #for extracting model estimates
library(ggplot2) #for plotting, duh! 

# load the data
dragons <- read.csv("2020-10-21-dragon-data.csv")


Just as a reminder, let’s take a peak at our dataframe:

1
2
#take a peek at the header
head(dragons)
1
2
3
4
5
6
7
##    testScore bodyLength mountainRange color      diet breathesFire
## 1  0.0000000   175.5122      Bavarian  Blue Carnivore            1
## 2  0.7429138   190.6410      Bavarian  Blue Carnivore            1
## 3  2.5018247   169.7088      Bavarian  Blue Carnivore            1
## 4  3.3804301   188.8472      Bavarian  Blue Carnivore            1
## 5  4.5820954   174.2217      Bavarian  Blue Carnivore            0
## 6 12.4536350   183.0819      Bavarian  Blue Carnivore            1
1
2
#view the full dataset
View(dragons)

We have 2 continuous variables: testScore and bodyLength. The rest of the variables are categorical. This gives us lots of options for creatively visualizing relationships between different variables of interest!




Multi-Layered Plots

I’m going to assume that you have some prior knowledge of R and ggplot2, but not much. Let’s start with a basic plot.

1
ggplot(data = dragons)

If we call ggplot() without specifying any variables, it just generates a blank square! This is actually the bottom layer of every plot. It’s a blank slate, waiting for us to layer data on top.

Let’s say our first research question is whether the color of the dragon is related to its intelligence. What happens when we specify those variables, but don’t add any geoms? We initialize a grid and axes, but don’t actually plot any data.

1
ggplot(data = dragons, aes(x = color, y = breathesFire))


Let’s add a basic bar plot layer. We can also color the bars using the fill argument.

1
2
ggplot(data = dragons, aes(x = color, y = testScore, fill = color)) +
  geom_bar(stat = "summary")


Yikes, we’ve generated a Stroop effect plot. Let’s specify a custom color palette and add that palette as a layer to our plot.

1
2
3
4
5
6
7
8
scale_fill_dragon <- function(...){
      library(scales)
      discrete_scale("fill","dragon",manual_pal(values = c("#3574AC","#AD343C","#FBD774")), ...)
}

ggplot(data = dragons, aes(x = color, y = testScore, fill = color)) +
  scale_fill_dragon() +
  geom_bar(stat = "summary")


Great! That makes more sense. Now, let’s add some error bars for 95% confidence intervals. To do this, we’ll first calculate a small summary dataframe that will give us CIs. If you open this dataframe, you’ll see that you also get SE and SD measures.

1
2
3
4
5
6
dragons_sum <- summarySE(dragons, measurevar = "testScore", groupvars = c("color"))

ggplot(data = dragons, aes(x = color, y = testScore, fill = color)) +
  scale_fill_dragon() +
  geom_bar(stat = "summary") +
  geom_errorbar(data = dragons_sum, aes(ymin=testScore-ci, ymax=testScore+ci), width = 0.4)


But, this bar plot doesn’t tell us much about the distribution of the data, does it? Let’s add some points on top of the bars. We’ll also apply some horizontal jitter and adjust the opacity to make the points visible. We can also make the error bars thicker so they stand out.

1
2
3
4
5
ggplot(data = dragons, aes(x = color, y = testScore, fill = color)) +
  scale_fill_dragon() +
  geom_bar(stat = "summary") +
  geom_errorbar(data = dragons_sum, aes(ymin=testScore-ci, ymax=testScore+ci), width = 0.4, size = 1) +
  geom_jitter(width = 0.1, height = 0, alpha = 0.2)


Now, let’s add layers for axis labels and a title.

1
2
3
4
5
6
7
8
ggplot(data = dragons, aes(x = color, y = testScore, fill = color)) +
  scale_fill_dragon() +
  geom_bar(stat = "summary") +
  geom_errorbar(data = dragons_sum, aes(ymin=testScore-ci, ymax=testScore+ci), width = 0.4, size = 1) +
  geom_jitter(width = 0.1, height = 0, alpha = 0.2) +
  xlab("Dragon Color") +
  ylab("Cognitive Test Score") +
  ggtitle("Test Score by Color")


Finally, just to make it pretty, let’s apply a custom theme that I use for my plots. For the sake of time, we won’t be covering the details of custom themes today, but you can find great tutorials online. These functions allow you to specify all your favorite aesthetic choices, and then easily apply those tweaks to future plots with just one line of code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#if you want to access extra fonts, you can run the following lines of code.
#if you don't care during the tutorial, disable them. Loading the fonts (only the first time) can take several minutes if they are not already installed into R Studio.
library(extrafont)
#font_import() #first time only
#loadfonts(device= "win")

#here's my custom plotting theme. It's a function that specifies background colors, text font/size/position, and spacing.
theme_allie <- function(base_size = 14, base_family = "Gill Sans MT"){
  theme_gray(base_size = base_size, base_family = base_family) %+replace%
    theme(
      plot.margin=grid::unit(c(1,1,1,1), "cm"),
      plot.title = element_text(face = "bold", size = rel(1.15), hjust = 0.5, vjust = 5),
      panel.border = element_blank(),
      panel.background = element_rect(fill = "#EAEAF2", colour = "#EAEAF2"), 
      strip.background = element_rect(fill = NA),
      panel.grid.major.y = element_line(size = 1, colour = "white"),
      panel.grid.minor.y = element_blank(),
      panel.grid.major.x = element_blank(),
      panel.grid.minor.x = element_blank()
      )
}

ggplot(data = dragons, aes(x = color, y = testScore, fill = color)) +
  scale_fill_dragon("Dragon Color") +
  geom_bar(stat = "summary") +
  geom_errorbar(data = dragons_sum, aes(ymin=testScore-ci, ymax=testScore+ci), width = 0.4, size = 1) +
  geom_jitter(width = 0.1, height = 0, alpha = 0.2) +
  xlab("Dragon Color") +
  ylab("Cognitive Test Score") +
  ggtitle("Test Score by Color") +
  theme_allie()

1
2
# if we wanted to save the plot in our current working directory, we could run the line below.
#ggsave("dragon_color.png", dpi = 300, width = 4, height = 4, units = "in")




Plotting model estimates


Now that we know how to create beautiful multi-layered plots, let’s learn how to plot the results of our mixed models! The emmeans package is the star of the show here. Short for “estimated marginal means”, it extracts the beta values, CIs, and more from a fitted lmer object. Let’s start with the same question as we started with above – how is color related to intelligence?

Linear mixed model with a categorical predictor

First we have to fit our lmer model. If you’re still new to emmeans, it’s best to save the fitted model as an object so that you can play around with different calls. At the end of the section I’ll show you the fastest way to go from model -> plot.

1
2
model <- lmer(testScore ~ color + (1|mountainRange), data=dragons)
summary(model)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: testScore ~ color + (1 | mountainRange)
##    Data: dragons
## 
## REML criterion at convergence: 3977.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.4828 -0.6343 -0.0026  0.6986  3.0139 
## 
## Random effects:
##  Groups        Name        Variance Std.Dev.
##  mountainRange (Intercept) 349.1    18.68   
##  Residual                  220.9    14.86   
## Number of obs: 480, groups:  mountainRange, 8
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  50.3860     6.6407   7.0000   7.587 0.000128 ***
## color1       -0.3499     0.9594 470.0000  -0.365 0.715490    
## color2       -2.0290     0.9594 470.0000  -2.115 0.034970 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##        (Intr) color1
## color1  0.000       
## color2  0.000 -0.500


The results are not too exciting, but let’s make a dataframe with the marginal means anyway.

1
2
model_df <- emmeans(model, ~ color) %>% as.data.frame()
head(model_df)
1
2
3
4
##    color   emmean       SE       df lower.CL upper.CL
## 1   Blue 50.03612 6.709633 7.295226 34.29959 65.77265
## 2    Red 48.35704 6.709633 7.295226 32.62051 64.09356
## 3 Yellow 52.76494 6.709633 7.295226 37.02841 68.50147


Sweet! After converting the emGrid object to a dataframe, we now have estimates of test score as a function of color (the column “emmean”), as well as standard error, degrees of freedom, and upper/lower CIs. Let’s plot these using Allie’s custom theme:

1
2
3
4
5
6
7
8
ggplot(model_df, aes(x = color, y = emmean, fill = color)) +
  scale_fill_dragon("Dragon Color") +
  geom_col() +
  geom_errorbar(aes(ymin=lower.CL, ymax=upper.CL), width = 0.4, size = 1) +
  xlab("Dragon Color") +
  ylab("Cognitive Test Score") +
  ggtitle("Test Score by Color") +
  theme_allie()

The biggest difference between plotting in the emmeans-verse and in the way Allie showed above is that your y variable will almost always be “emmean” and you don’t have to compute confidence intervals yourself.

But if you want to add information about the raw data (like the jitter above), then you’ll need to add a layer that takes data from the source dataframe. Note that you only need to specify the dimension on which your variable is different from the one in the main ggplot call – i.e., I only had to specify y=testScore in the geom_jitter() call.

1
2
3
4
5
6
7
8
9
ggplot(model_df, aes(x = color, y = emmean, fill = color)) +
  scale_fill_dragon("Dragon Color") +
  geom_col() +
  geom_errorbar(aes(ymin=lower.CL, ymax=upper.CL), width = 0.4, size = 1) +
  geom_jitter(data=dragons, aes(y=testScore), width = 0.1, height = 0, alpha = 0.2) +
  xlab("Dragon Color") +
  ylab("Cognitive Test Score") +
  ggtitle("Test Score by Color") +
  theme_allie()


Another way to convey this information is with a violin plot, where the estimate is plotted with geom_pointrange() and raw data as violin widths:

1
2
3
4
5
6
7
8
ggplot(model_df, aes(x = color, y = emmean, fill = color)) +
  scale_fill_dragon("Dragon Color") +
  geom_violin(data=dragons, aes(y=testScore)) +
  geom_pointrange(aes(ymin=lower.CL, ymax=upper.CL), size = 0.85) +
  xlab("Dragon Color") +
  ylab("Cognitive Test Score") +
  ggtitle("Test Score by Color") +
  theme_allie()


If you want to adjust the amount of smoothing in the violin (i.e., how lumpy it is), you do that with the bw argument. Because bw sets the standard deviation of the smoothing kernel for the distributions, the magnitude of the bw value needed to achieve your desired results will vary as a function of your data.

1
2
3
4
5
6
7
8
ggplot(model_df, aes(x = color, y = emmean, fill = color)) +
  scale_fill_dragon("Dragon Color") +
  geom_violin(data=dragons, aes(y=testScore), bw=4.5) +
  geom_pointrange(aes(ymin=lower.CL, ymax=upper.CL), size = 0.85) +
  xlab("Dragon Color") +
  ylab("Cognitive Test Score") +
  ggtitle("Test Score by Color") +
  theme_allie()


Linear mixed model with an interaction between categorical predictors

So far, so good. But what if we have a model with an interaction term? Let’s fit one and find out!

1
2
model2 <- lmer(testScore ~ color*diet + (1|mountainRange), data=dragons)
summary(model2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: testScore ~ color * diet + (1 | mountainRange)
##    Data: dragons
## 
## REML criterion at convergence: 3671.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0978 -0.5708 -0.0243  0.5190  4.6698 
## 
## Random effects:
##  Groups        Name        Variance Std.Dev.
##  mountainRange (Intercept)  82.64    9.091  
##  Residual                  121.35   11.016  
## Number of obs: 480, groups:  mountainRange, 8
## 
## Fixed effects:
##               Estimate Std. Error        df t value             Pr(>|t|)    
## (Intercept)   49.98257    3.25369   6.74874  15.362           0.00000168 ***
## color1        -0.46020    0.71397 463.78929  -0.645                0.520    
## color2        -0.42132    0.71766 464.03951  -0.587                0.557    
## diet1        -16.25262    0.92164 470.83945 -17.634 < 0.0000000000000002 ***
## diet2         17.15942    0.89337 470.92818  19.208 < 0.0000000000000002 ***
## color1:diet1  -0.19162    1.02754 463.93696  -0.186                0.852    
## color2:diet1  -0.02199    1.00207 464.03628  -0.022                0.983    
## color1:diet2  -0.70805    1.01244 464.39095  -0.699                0.485    
## color2:diet2   0.42549    1.02280 464.32294   0.416                0.678    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) color1 color2 diet1  diet2  clr1:1 clr2:1 clr1:2
## color1      -0.001                                                 
## color2       0.000 -0.497                                          
## diet1        0.005  0.027 -0.120                                   
## diet2       -0.006  0.007  0.084 -0.663                            
## color1:dit1  0.004  0.041  0.035  0.012 -0.021                     
## color2:dit1 -0.015  0.036 -0.042 -0.051  0.018 -0.473              
## color1:dit2  0.003 -0.021 -0.036  0.032 -0.042 -0.511  0.246       
## color2:dit2  0.007 -0.029  0.008 -0.014  0.074  0.243 -0.482 -0.520


Whoa! Major effects of diet on test scores. Converting the model estimates to a dataframe is as simple as the following command:

1
2
3
4
5
6
7
model2_df <- emmeans(model2, ~ color | diet) %>% as.data.frame()

ggplot(model2_df, aes(x=diet, y=emmean, fill=color, color=color)) +
  scale_fill_dragon("Dragon Color") +
  geom_pointrange(aes(ymin=lower.CL, ymax=upper.CL), position=position_dodge(0.5)) +
  labs(x="Diet", y="Cognitive Test Score", color="Dragon color", fill="Dragon color") +
  theme_allie()

Three things to note: first, we indicate that we want to extract marginal means for the main terms and interactions with the | operator (did you know it’s called a ‘pipe’??? TIL!). The * operator will do exactly the same thing, and it has the added benefit of being more similar to lmer syntax.

Second, we made another Stroop plot! The colors of the geoms don’t match the color names, and we no longer have Allie’s beautiful colors. Unfortunately, this shows the limits of custom color palettes – since they’re hard-coded, they need to have enough colors to cover all the different combinations of your discrete variables. So if you wanted to make an interaction plot like this with the same colors as Allie’s, you’d need to make another palette that has 9 values instead of 3.

Third, we have two legends! One where the geoms are colored and another where they are black. This highlights the difference between the color and fill arguments. Color is used to set the color of line-based objects, like the outline of a bar, violin, ribbon, etc. It’s also used to set the color of points. Given that our geom is a combination of point & line, it’s only going to be modifiable by color arguments. Fill is used to “fill in” empty spaces, like the inside of a column, ribbon, or violin. Since we don’t have any geoms that take the fill argument in this plot, then it just shows up as black. Getting rid of fill=color and scale_fill_dragon() will get rid of the superfluous legend.


Generalized linear mixed model & continuous predictor variables

What if you want to plot the results of a generalized linear model? And what if you have a continuous predictor variable? Let’s investigate the relationship between color, bodyLength, and breathesFire to find out:

1
2
3
model3 <- glmer(breathesFire ~ color*bodyLength + (1|mountainRange), data=dragons, family=binomial(link = "logit"))
model3_df <- emmeans(model3, ~ color*bodyLength, type='response') %>% as.data.frame()
head(model3_df)
1
2
3
4
##    color bodyLength       prob         SE  df  asymp.LCL asymp.UCL
## 1   Blue   201.3165 0.90383086 0.02537801 Inf 0.84134741 0.9433625
## 2    Red   201.3165 0.52274341 0.05436631 Inf 0.41676412 0.6267137
## 3 Yellow   201.3165 0.06265896 0.02400686 Inf 0.02912871 0.1296326
1
2
# to confirm that we only get point estimates for each categorical predictor
length(model3_df)
1
## [1] 7

First thing to note: I added the argument type='response' to the emmeans call. This ensures that our estimates are back-transformed to the same scale as the original variable, instead of being on the log-odds scale. If you’re using only categorical predictors, then this is the only emmeans modification you have to worry about.

Second thing to note: the names of our variables in the dataframe are different from those in the lmer() dataframe! Instead of emmeans, we have prob; degrees of freedom are now infinite; and the CIs are now values of the asymptote (asymp.LCL, asymp.UCL) instead of the fixed upper and lower bounds (lower.CL, upper.CL). These differences all have to do with the logit link, but all you need to know in terms of plotting is that you’ll have to adjust your variable names accordingly.

Finally, because we have a continuous predictor, there is a third thing to note: we only got point estimates for the effect of color at the mean value of bodyLength. But that’s not how we think about or want to plot the effect of a continuous predictor! Thus, we need to specify the range of bodyLength values that we want emmeans to return estimates from. There are a number of ways to do this, and the “best” way is contingent on your research question and/or desired visualization. And fortunately, this method remains the same regardless of your model type (i.e., if it’s lmer() or glmer()).

One approach is to get estimates from the lowest and highest values (i.e., the range) of bodyLength. Here’s how you do that:

1
2
3
# extracting estimates at the extremes and mean values of bodyLength
model3_df <- emmeans(model3, ~ color*bodyLength, type='response', cov.reduce=range) %>% as.data.frame()
head(model3_df)
1
2
3
4
5
6
7
##    color bodyLength       prob         SE  df   asymp.LCL asymp.UCL
## 1   Blue   162.3266 0.90822227 0.06654721 Inf 0.674224946 0.9793036
## 2    Red   162.3266 0.81107636 0.07497809 Inf 0.621985544 0.9180434
## 3 Yellow   162.3266 0.01460516 0.01656817 Inf 0.001549897 0.1239744
## 4   Blue   236.3625 0.89971296 0.06519867 Inf 0.685204776 0.9736679
## 5    Red   236.3625 0.24283279 0.10726832 Inf 0.092736596 0.5015619
## 6 Yellow   236.3625 0.20571388 0.09259853 Inf 0.078587479 0.4402323
1
2
3
4
5
6
7
8
# plot those estimates
ggplot(model3_df, aes(x=bodyLength, y=prob,fill=color)) +
  geom_ribbon(aes(ymin=asymp.LCL, ymax=asymp.UCL), alpha=0.3) + 
  geom_line(aes(color=color)) +
  scale_fill_dragon("Dragon Color") +
  scale_color_manual(values=c("#3574AC","#AD343C","#FBD774")) +
  labs(y='Probability of breathing fire', x='Body Length', color='Dragon Color', title='Effects of color and body length on fire-breathing') +
  theme_allie()


But if you want a more detailed representation of the effect, you can write a quick function that tells emmeans() to extract estimates along the whole range of bodyLength values in X-sized increments:

1
2
3
4
5
6
7
8
9
10
11
# write a function to extract values of bodyLength in "precision"-sized increments
bodyLength_range <- function(bodyLength, precision=5) {
  return(seq(min(dragons[,bodyLength], na.rm=TRUE),
             max(dragons[,bodyLength], na.rm=TRUE),
             precision))
}

# tell emmeans to extract estimates at values of bodyLength specified by the function bodyLength_range
model3_df <- emmeans(model3, ~ color*bodyLength, type='response', at=list(bodyLength=bodyLength_range('bodyLength'))) %>% as.data.frame()

head(model3_df)
1
2
3
4
5
6
7
##    color bodyLength       prob         SE  df   asymp.LCL asymp.UCL
## 1   Blue   162.3266 0.90822227 0.06654721 Inf 0.674224946 0.9793036
## 2    Red   162.3266 0.81107636 0.07497809 Inf 0.621985544 0.9180434
## 3 Yellow   162.3266 0.01460516 0.01656817 Inf 0.001549897 0.1239744
## 4   Blue   167.3266 0.90766865 0.05958259 Inf 0.709312449 0.9753722
## 5    Red   167.3266 0.78275913 0.07300889 Inf 0.608334845 0.8931490
## 6 Yellow   167.3266 0.01766287 0.01818495 Inf 0.002299758 0.1230034
1
2
3
4
5
6
7
8
# let's see the difference this makes in the plot! 
ggplot(model3_df, aes(x=bodyLength, y=prob,fill=color)) +
  geom_ribbon(aes(ymin=asymp.LCL, ymax=asymp.UCL), alpha=0.3) + 
  geom_line(aes(color=color)) +
  scale_fill_dragon("Dragon Color") +
  scale_color_manual(values=c("#3574AC","#AD343C","#FBD774")) +
  labs(y='Probability of breathing fire', x='Body Length', color='Dragon Color',title='Effects of color and body length on fire-breathing') +
  theme_allie()

Cool! The trends are still there, but now we have a more detailed depiction of how the probability of fire-breathing changes as a function of body length.

emmeans is an immensely useful companion package for lme4 and lmerTest. The vignettes on the page linked here are very helpful, and can be quite illuminating about all the things you can do with fitted mixed models. I highly recommend playing around with it to build up your knowledge about mixed models!

And once you get familiar with emmeans & lmer, you can take the fast-track to plotting model estimates:

1
2
3
4
5
6
7
8
9
10
glmer(breathesFire ~ color*bodyLength + (1|mountainRange), data=dragons, family=binomial(link = "logit")) %>%
  emmeans(., ~ color*bodyLength, type='response', at=list(bodyLength=bodyLength_range('bodyLength'))) %>%
  as.data.frame() %>%
  ggplot(., aes(x=bodyLength, y=prob,fill=color)) +
  geom_ribbon(aes(ymin=asymp.LCL, ymax=asymp.UCL), alpha=0.3) + 
  geom_line(aes(color=color)) +
  scale_fill_dragon("Dragon Color") +
  scale_color_manual(values=c("#3574AC","#AD343C","#FBD774")) +
  labs(y='Probability of breathing fire', x='Body Length', color='Dragon Color',title='Effects of color and body length on fire-breathing') +
  theme_allie()




Customizing your plots


Perhaps my favorite thing about ggplot2 is how much flexibility and control you have over your plots. Since I don’t know what you don’t know, this section is going to be a compilation of aesthetic tweaks that I’ve had to make in my very short research career. If there’s a specific thing you want to know how to do, let me know and I’ll update the post!


Creating a custom color palette


Like all things programming, there are multiple ways to do this. Allie demonstrated one method above that I’ll reproduce here:

1
2
3
4
scale_fill_dragon <- function(...){
      library(scales)
      discrete_scale("fill","dragon",manual_pal(values = c("#3574AC","#AD343C","#FBD774")), ...)
}


I had no idea you could make a custom palette like this! I’ve always used the paletti package to do this. All you do is specify a vector of HEX codes and pass it to some nested arguments from the packge:

1
2
3
4
5
6
# load the package
library(paletti)
# specify the colors
dragon_colors <- c("#3574AC","#AD343C","#FBD774")
# preview the palette
viz_palette(dragon_colors)

1
2
3
# create the palette
dragon_fill <- get_scale_fill(get_pal(dragon_colors))
dragon_color <- get_scale_color(get_pal(dragon_colors))


Custom palettes are great for a lot of reasons, but they can take a lot of work to hone. Sourcing colors for palettes is one of my favoite forms of procrastination, but if you just want to get things done, here is an excellent resource canvassing color palettes available to R users. I don’t know if it’s completely comprehensive, but it’s pretty darn close. It also has some great resources for generating HEX codes if you are interested in making a plotting theme but have no idea where to start.

Modifying axes


Let’s take a look at one of our glmer() plots. You’ll notice that I’m saving this ggplot call to the variable p. I’m doing this to take advantage of the syntax and demonstrate that you can modify plots even after they’ve been saved. That’s one of the many beautiful things about ggplots: you can just keep adding layers until you get the plot of your dreams. [am i the only one who dreams about beautiful plots?]

1
2
3
4
5
6
7
8
9
p <- ggplot(model3_df, aes(x=bodyLength, y=prob,fill=color)) +
  geom_ribbon(aes(ymin=asymp.LCL, ymax=asymp.UCL), alpha=0.3) + 
  geom_line(aes(color=color)) +
  scale_fill_dragon("Dragon Color") +
  scale_color_manual(values=c("#3574AC","#AD343C","#FBD774")) +
  labs(y='Probability of breathing fire', x='Body Length', color='Dragon Color', title='Effects of color and body length on fire-breathing') +
  theme_allie()

p


It’s beautiful, right? But what if you don’t want that extra gray space around the edges of the x-axis? That’s where the expand argument of scale_x_continuous comes in. This allows you to set the “range expansion constants” the ggplot uses to add padding around the axes. For continuous variables, like we have, the default is to expand the scale by 5% on each side. For discrete variables (scale_x_discrete), the default is 0.6 units.

Based on my tinkering, it seems like small increases in the first or third value of expand will result in big changes at the leftmost and rightmost points on the x-axis, respectively. Small increases to the second or fourth values will results in small changes at the leftmost and rightmost points on the x-axis. There is definitely a better explanation for this (probably having to do with y-values), but I’ll leave that for you to discover! For now, I’ll set all the values except the last to 0 so you can get an idea of how this shifts the plotting space:

1
2
3
4
5
6
7
8
ggplot(model3_df, aes(x=bodyLength, y=prob,fill=color)) +
  geom_ribbon(aes(ymin=asymp.LCL, ymax=asymp.UCL), alpha=0.3) + 
  geom_line(aes(color=color)) +
  dragon_fill() +
  dragon_color() +
  scale_x_continuous(expand = c(0, 0, 0, 0.5)) +
  labs(y='Probability of breathing fire', x='Body Length', color='Dragon Color', fill='Dragon Color', title='Effects of color and body length on fire-breathing') +
  theme_allie()


If you wanted to change the breaks on your x-axis, change the labels of the breaks, or adjust the limits, you can do that all with different arguments inside the scale_x_continuous call.


Have you ever wondered how people get their x-axis labels on a slant? Here’s how:
1
p + theme(axis.text.x = element_text(angle = 45, vjust = 0, hjust=0))

angle adjusts the angle (setting it to 90 would make the labels perpendicular to the plot), and vjust and hjust move the labels along the vertical and horizontal dimensions.


Themes

One of the most powerful ways to customize your plots beyond the ggplot defaults is to play with different themes. Allie built her own (which is next-level), but there are many really nice and elegant themes that can really take your plot to the next level. Some are built into ggplot (like theme_minimal() or theme_classic), but others come from devoted packages. One of my favorites is theme_pubr(). Let’s see how it changes our glmer() plot:

1
2
library(ggpubr)
p + theme_pubr()

If you don’t like the legend on top, you can adjust that very simply:

1
p + theme_pubr(legend='right')

To adjust the position of your legend, or really anything about the plotting space that doesn’t have to do with geoms (so things like axes, backgrounds, text size, etc.), you’ll pass a command to the theme() function of ggplot. Something I can’t recommend highly enough is taking time to learn about and play around with all the different arguments that can be passed to theme(). That’s where the vast majority of customization comes in!

An important note: the layering of arguments within a ggplot object does indeed matter. Geoms get stacked on top of each other based on the order they’re listed. So if you want to overlay a jitter on top of a bar plot, you need to make sure geom_jitter comes after geom_col. Likewise with theme arguments that aren’t included in the base ggplot2 package. If you specify, for example, theme_pubr(legend='right'), but then add legend.position='center' in your theme() call that comes after the theme_pubr() call, then your legend is going to be placed in the location of the most recent call; in this example, it would be centered. I’ve run into this with things like font styles before, and endured a fair share of head-bashing before realizing what I was doing. Please learn from my mistakes!

Writing a plotting function


Let’s say you have a plot style that you want to retain for a number of different plots. Instead of copying and pasting the same lines of code, you can simply write a function that defines all your plot parameters and use that to generate a bunch of plots. Let’s step out of the emmeans()-verse for a minute and look at how testScore and bodyLength are related to color:

1
2
3
4
5
6
7
8
9
10
11
dragon_plot <- function(yvar, title) {
  ggplot(dragons, aes_(x=~color, y=yvar, fill=~color)) +
  geom_bar(stat = "summary") +
  geom_jitter(width = 0.1, height = 0, alpha = 0.2) +
  ggtitle(title) +
  scale_fill_dragon() +
  theme_allie()
}

plot_bodyLength <- dragon_plot(~bodyLength, 'Body length by color') 
plot_testScore <- dragon_plot(~testScore, 'Test score by color') 

The important thing to note here is that you use aes_ instead of the regular aes call, that the variables that you want to stay constant across plots are preceded by a ~ when you define the function, and that the variables you want to sub in are preceded by a ~ when you call the function.

During the workshop, Kevin noted that you can use aes_string() instead of aes_ and then substitute all the ~s for character strings.


Patchwork!


Let’s make these two plots into one figure using the awesome package patchwork. This is another package I can’t recommend enough. It uses the syntax of ggplot2 to allow you to combine multiple plots into one figure. I won’t go into details here, but you should definitely check out the (beautiful and easy to use) website to learn more.

1
2
library(patchwork)
plot_bodyLength + plot_testScore + plot_layout(guides='collect')

1
2
# and if we wanted to export this figure as one single image, we would run:
#ggsave('filepath', dpi='retina')




Spaghetti Plots


Next, let’s learn how to create spaghetti plots that connect the dots for within-subjects measures. Unfortunately, we only measured our dragons once (after all, they are hard to catch). So we’ll quickly load one of Allie’s real datasets to look at some repeated measures.

1
2
3
4
5
6
7
8
9
10
PE_recon <- read.csv("PE_recon.csv", fileEncoding="UTF-8-BOM")
PE_recon_sum <- summarySE(PE_recon, measurevar = "Errors_mean", groupvars = c("Group", "Type"))

ggplot(PE_recon, aes(x = Type, y = Errors_mean)) +
  theme_allie() +
  geom_point(aes(group=ID), color="#2b2e33", alpha = 0.25, size = 1.5) +
  geom_line(aes(group=ID), color="#2b2e33", alpha = 0.25) +
  xlab("Reactivation Type") +
  ylab("Average False Memories per Video") +
  ggtitle("False Memories")


The “group=ID” argument is doing the magic here. Try removing it from the geom_point and geom_line layers, and see what happens. Not pretty.
But wait! We actually have another grouping variable in this dataset. This was a 2x2 design, where we had a between-subjects contrast and a within-subjects contrast. Let’s create separate panels for our between-subjects variable, Group.

1
2
3
4
5
6
7
8
ggplot(PE_recon, aes(x = Type, y = Errors_mean)) +
  theme_allie() +
  geom_point(aes(group=ID), color="#2b2e33", alpha = 0.25, size = 1.5) +
  geom_line(aes(group=ID), color="#2b2e33", alpha = 0.25) +
  xlab("Reactivation Type") +
  ylab("Average False Memories per Video") +
  ggtitle("False Memories") +
  facet_wrap(.~Group)


Much more informative, but this plot is still very bland. Let’s add some bars to display the average for each condition, and give those bars some color. We can add errorbars while we’re at it.
Note that we are putting the geoms for the new bars BEFORE the geoms for the points and lines. What happens if we move these geoms around? Try changing the order to see how it affects the plot.

1
2
3
4
5
6
7
8
9
10
11
12
ggplot(PE_recon, aes(x = Type, y = Errors_mean, fill = Type)) +
  theme_allie() +
  scale_fill_dragon(name = "Reactivation Type") +
  scale_x_discrete(name = "Reactivation Type") +
  geom_bar(stat = "summary", fun.y = "mean", na.action=na.omit) +
  geom_errorbar(data = PE_recon_sum, aes(ymin = Errors_mean - ci, ymax = Errors_mean + ci), width=.2, size = 0.75, position=position_dodge(.9)) +
    xlab("Reactivation Type") +
  geom_point(aes(group=ID), color="#2b2e33", alpha = 0.25, size = 1.5) +
  geom_line(aes(group=ID), color="#2b2e33", alpha = 0.25) +
  ylab("Average False Memories per Video") +
  ggtitle("False Memories") +
  facet_wrap(. ~ Group)




Raincloud Plots


In the final section, we’ll use our new layering skills to create a raincloud plot! These are pretty plots that display the distribution of your data, as well as the summary metrics for comparison.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#we'll borrow this function that allows us to plot half violin plots
source("https://raw.githubusercontent.com/datavizpyr/data/master/half_flat_violinplot.R", local = knitr::knit_global())

#this is the same as the palette we generated earlier, but we need to specify it for COLOR instead of FILL (allowing us to apply it to dots in addition to bars)
scale_color_dragon <- function(...){
      library(scales)
      discrete_scale("color","dragon",manual_pal(values = c("#3574AC","#AD343C","#FBD774")), ...)
}

#think through what each of these layers adds to the plot. try turning them off and on!
#try enabling or disabling coord_flip() to change the orientation.

ggplot(data = PE_recon, aes(x = Group, y = Errors_mean, fill = Group, color = Group)) +
theme_allie() +
geom_flat_violin(position = position_nudge(x = .2, y = 0), alpha = .8, width = 0.7) +
geom_jitter(width = 0.1, height = 0, size = 2, alpha = 0.5) +
geom_boxplot(width = .1, alpha = 0.75, position = position_nudge(x = -.2, y = 0), outlier.shape = NA) +
scale_fill_dragon() +
scale_color_dragon() +
guides(fill = FALSE) +
guides(color = FALSE) +
#coord_flip() +
xlab("False Memories") +
ylab("Group") +
ggtitle("It's Raining Data!")

If we call ggplot() without specifying any variables, it just generates a blank square! This is actually the bottom layer of every plot. It’s a blank slate, waiting for us to layer data on top.

Let’s say our first research question is whether the color of the dragon is related to its intelligence. What happens when we specify those variables, but don’t add any geoms? We initialize a grid and axes, but don’t actually plot any data.

1
ggplot(data = dragons, aes(x = color, y = breathesFire))







1
2
# if we wanted to save the plot in our current working directory, we could run the line below.
#ggsave("dragon_color.png", dpi = 300, width = 4, height = 4, units = "in")


The biggest difference between plotting in the emmeans-verse and in the way Allie showed above is that your y variable will almost always be “emmean” and you don’t have to compute confidence intervals yourself.

But if you want to add information about the raw data (like the jitter above), then you’ll need to add a layer that takes data from the source dataframe. Note that you only need to specify the dimension on which your variable is different from the one in the main ggplot call – i.e., I only had to specify y=testScore in the geom_jitter() call.

1
2
3
4
5
6
7
8
9
ggplot(model_df, aes(x = color, y = emmean, fill = color)) +
  scale_fill_dragon("Dragon Color") +
  geom_col() +
  geom_errorbar(aes(ymin=lower.CL, ymax=upper.CL), width = 0.4, size = 1) +
  geom_jitter(data=dragons, aes(y=testScore), width = 0.1, height = 0, alpha = 0.2) +
  xlab("Dragon Color") +
  ylab("Cognitive Test Score") +
  ggtitle("Test Score by Color") +
  theme_allie()




Three things to note: first, we indicate that we want to extract marginal means for the main terms and interactions with the | operator (did you know it’s called a ‘pipe’??? TIL!). The * operator will do exactly the same thing, and it has the added benefit of being more similar to lmer syntax.

Second, we made another Stroop plot! The colors of the geoms don’t match the color names, and we no longer have Allie’s beautiful colors. Unfortunately, this shows the limits of custom color palettes – since they’re hard-coded, they need to have enough colors to cover all the different combinations of your discrete variables. So if you wanted to make an interaction plot like this with the same colors as Allie’s, you’d need to make another palette that has 9 values instead of 3.

Third, we have two legends! One where the geoms are colored and another where they are black. This highlights the difference between the color and fill arguments. Color is used to set the color of line-based objects, like the outline of a bar, violin, ribbon, etc. It’s also used to set the color of points. Given that our geom is a combination of point & line, it’s only going to be modifiable by color arguments. Fill is used to “fill in” empty spaces, like the inside of a column, ribbon, or violin. Since we don’t have any geoms that take the fill argument in this plot, then it just shows up as black. Getting rid of fill=color and scale_fill_dragon() will get rid of the superfluous legend.



Cool! The trends are still there, but now we have a more detailed depiction of how the probability of fire-breathing changes as a function of body length.

emmeans is an immensely useful companion package for lme4 and lmerTest. The vignettes on the page linked here are very helpful, and can be quite illuminating about all the things you can do with fitted mixed models. I highly recommend playing around with it to build up your knowledge about mixed models!

And once you get familiar with emmeans & lmer, you can take the fast-track to plotting model estimates:

1
2
3
4
5
6
7
8
9
10
glmer(breathesFire ~ color*bodyLength + (1|mountainRange), data=dragons, family=binomial(link = "logit")) %>%
  emmeans(., ~ color*bodyLength, type='response', at=list(bodyLength=bodyLength_range('bodyLength'))) %>%
  as.data.frame() %>%
  ggplot(., aes(x=bodyLength, y=prob,fill=color)) +
  geom_ribbon(aes(ymin=asymp.LCL, ymax=asymp.UCL), alpha=0.3) + 
  geom_line(aes(color=color)) +
  scale_fill_dragon("Dragon Color") +
  scale_color_manual(values=c("#3574AC","#AD343C","#FBD774")) +
  labs(y='Probability of breathing fire', x='Body Length', color='Dragon Color',title='Effects of color and body length on fire-breathing') +
  theme_allie()


1
2
3
# create the palette
dragon_fill <- get_scale_fill(get_pal(dragon_colors))
dragon_color <- get_scale_color(get_pal(dragon_colors))




angle adjusts the angle (setting it to 90 would make the labels perpendicular to the plot), and vjust and hjust move the labels along the vertical and horizontal dimensions.


If you don’t like the legend on top, you can adjust that very simply:

1
p + theme_pubr(legend='right')

To adjust the position of your legend, or really anything about the plotting space that doesn’t have to do with geoms (so things like axes, backgrounds, text size, etc.), you’ll pass a command to the theme() function of ggplot. Something I can’t recommend highly enough is taking time to learn about and play around with all the different arguments that can be passed to theme(). That’s where the vast majority of customization comes in!

An important note: the layering of arguments within a ggplot object does indeed matter. Geoms get stacked on top of each other based on the order they’re listed. So if you want to overlay a jitter on top of a bar plot, you need to make sure geom_jitter comes after geom_col. Likewise with theme arguments that aren’t included in the base ggplot2 package. If you specify, for example, theme_pubr(legend='right'), but then add legend.position='center' in your theme() call that comes after the theme_pubr() call, then your legend is going to be placed in the location of the most recent call; in this example, it would be centered. I’ve run into this with things like font styles before, and endured a fair share of head-bashing before realizing what I was doing. Please learn from my mistakes!

Writing a plotting function


1
2
# and if we wanted to export this figure as one single image, we would run:
#ggsave('filepath', dpi='retina')





-->