When building plots with categorical data, ggplot
will default to ordering categories alphabetically; this may not make sense for your data. (Example: a barplot showing days of the week on the x-axis; alphabetical order is not the correct ordering/sequencing.) Within the forcats
package, there is a family of functions that helps with these types of problems.
Make a plot comparing how many penguins of each species are in the palmerpenguins
dataset. Notice that the bars are ordered alphabetically, when perhaps you would rather have them ordered by n
, the number of penguins of each species.
penguins %>%
count(species) %>%
ggplot(aes(x = species, y = n)) +
geom_col()
If you wish to order a categorical variable (species
) by some quantitative variable (n
) fct_reorder()
is a useful option. Generally the syntax is as follows:
fct_reorder(.f = categorical_variable_to_be_ordered, .x = quantitative_variable_to_order_other_variable_by)
To make a change to the “factor levels” of the data, use this function inside of mutate
:
penguins %>%
count(species) %>%
mutate(species_ordered = fct_reorder(.f = species, .x = n)) %>%
ggplot(aes(x = species_ordered, y = n)) +
geom_col()
Note the change in the order of the bars in the graph produced by the code below. To order the bars in descending rather than ascending (numeric) order, set the .desc
argument to TRUE in fct_reorder()
:
penguins %>%
count(species) %>%
mutate(species_ordered = fct_reorder(.f = species, .x = n, .desc = TRUE)) %>%
ggplot(aes(x = species_ordered, y = n)) +
geom_col()
Boxplots can be trickier because there may not be some explicit variable in the dataset that fct_reorder
can use as a reference. You can use fct_reorder
to order boxplots by some function, which is generally a summary function that can be applied to your y-variable. (A few examples include max
, min
, median
, and mean
.)
For example, start with a plot featuring multiple boxplots (one for each species), where the boxplots are ordered alphabetically:
penguins %>%
ggplot(aes(x = species, y = bill_length_mm)) +
geom_boxplot()
For contrast, create a plot with multiple boxplots ordered by the median value of bill_length_mm
.
The below code uses the argument .fun
to reorder by median. (Note that the function is written as median
and not as median()
.) You may notice the specification na.rm = T
; this removes missing data from the calculation of median. If there are missing values in your data, it may cause the reordering to not function properly.
penguins %>%
mutate(species = fct_reorder(.f = species, .x = bill_length_mm, .fun = median, na.rm = T)) %>%
ggplot(aes(x = species, y = bill_length_mm)) +
geom_boxplot()
Similar to the earlier example, you can put groups in descending order using .desc
:
penguins %>%
mutate(species = fct_reorder(.f = species, .x = bill_length_mm, .fun = median, na.rm = T, .desc = T)) %>%
ggplot(aes(x = species, y = bill_length_mm)) +
geom_boxplot()
For a different example, create a new demo dataset called days_of_week
, showing the hours worked on each day:
days_of_week <- tibble(
days = c('Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu'),
hours_worked = seq(20,50, 5)
)
days_of_week
## # A tibble: 7 x 2
## days hours_worked
## <chr> <dbl>
## 1 Fri 20
## 2 Sat 25
## 3 Sun 30
## 4 Mon 35
## 5 Tue 40
## 6 Wed 45
## 7 Thu 50
Plotting this data using a barplot yields:
days_of_week %>%
ggplot(aes(x = days, y = hours_worked)) +
geom_col()
The columns are automatically sorted alphabetically, but days of the week are ordinal data, meaning that the order of the values is important. Here fct_reorder
is not the correct tool because order is not defined by another variable, but by external information. (In this case, knowledge of a calendar.)
The forcats
function fct_relevel
will help here. The syntax is generally:
fct_relevel(.f = factor_that_needs_reordering, levels = levels_to_order_by)
The below code creates a vector with days (levels) in the desired order, and then feeds that vector into the fct_relevel
call.
day_order <- c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')
days_of_week %>%
mutate(days = fct_relevel(.f = days, levels = day_order)) %>%
ggplot(aes(x = days, y = hours_worked)) +
geom_col()
After reordering factors, the bar order in the bargraph has changed.
To see how this works with Boxplots, make some changes to the demo dataset:
days_of_week_new <- tibble(
days = rep(c('Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu'), 10),
hours_worked = rnorm(70, 25, 7)
)
Plotting once again yields a x-axis ordered alphabetically, which is not what we want.
days_of_week_new %>%
ggplot(aes(x = days, y = hours_worked)) +
geom_boxplot()
Use the same vector describing the levels inside of fct_relevel
to get:
days_of_week_new %>%
mutate(days = fct_relevel(.f = days, levels = day_order )) %>%
ggplot(aes(x = days, y = hours_worked)) +
geom_boxplot()