library(tidytuesdayR)
library(tidyverse)
library(RColorBrewer)
library(ggforce)
library(knitr)
library(kableExtra)
20
Overview
In my first ever #tidytuesday we are taking a look at a coffee survey. I’ll make a note up front that I am prioritizing self education over analytical integrity here. Curiosity is in the drivers seat, so we are going to see where this takes us. Anything I present here is not intended to draw any conclusions or correlations (I tried, and any statistically sound correlation just wasn’t there with this data).
ELT
Load Up
Nothing crazy.
Theme Setup
Coffee Colors?
<- brewer.pal(n = 6, name = "YlOrBr")
brew_colors
<- theme_minimal() +
coffee_theme theme(
plot.title = element_text(family = "Hot Ink", face = "plain", size = 26, color = brew_colors[1], hjust = 0.5),
axis.title.x = element_text(family = "Fira Sans", face = "italic", size = 12, color = brew_colors[1], margin = margin(t = 20, r = 20, b = 20, l = 20, unit = "pt")),
axis.title.y = element_text(family = "Fira Sans", face = "italic", size = 12, color = brew_colors[1], margin = margin(t = 20, r = 20, b = 20, l = 20, unit = "pt")),
axis.text.x = element_text(family = "Fira Sans", face = "plain", size = 10, color = brew_colors[2], angle = 45, hjust = 1),
axis.text.y = element_text(family = "Fira Sans", face = "plain", size = 10, color = brew_colors[2]),
legend.title = element_text(family = "Fira Sans", face = "bold", size = 12, color = brew_colors[1]),
legend.text = element_text(family = "Fira Sans", face = "plain", size = 10, color = brew_colors[2]),
plot.background = element_rect(fill = "black", color = NA),
panel.background = element_rect(fill = "black", color = NA),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.margin = margin(t = 35, r = 10)
)
theme_set(coffee_theme)
Extract
I tried the tidytuesdayR::tt_load method and kept getting an error, so we url it.
<- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-05-14/coffee_survey.csv', show_col_types = FALSE) coffee_survey
Transform
Nice initial cleaning. I decided straight away that my analysis would not be considering open-response answers. I organized the rest of the columns into two categories: multiple response, and factored. Here I factor the latter group. Later I’ll decide what to do with the former.
Finally I have the pleasure of correcting the “spend” column typo. People need to stop using “spend” as a noun. It is a verb. The noun is “expenditure”.
# Trim out free-text data, fix "spend" typo
# "spend" is not a noun, stop trying to make it one
<- coffee_survey %>%
clean_tib select(-matches("other|specify|coffee_a|coffee_b|coffee_c|coffee_d|prefer")) %>%
rename(monthly_expenditure = total_spend) %>%
mutate(across(everything(), ~ if(is.character(.)) tolower(.) else .)) %>%
mutate(across(c(age, cups, favorite, style, strength, roast_level, caffeine,
expertise, wfh, monthly_expenditure, taste, know_source,
most_paid, most_willing, value_cafe, spent_equipment,
value_equipment, gender, education_level, ethnicity_race,
employment_status, number_children, political_affiliation
), as.factor
) )
EDA
I saw the multiple-response columns and decided to try to learn more about the best ways to handle these. I could make each cell a vector, turning the column into a list of character vectors, or I could try one-hot encoding. I brought in the caret library and played around with dummy variables enough to get a basic transformation of all the multiple-response columns into multiple single Boolean columns for each response.
This was messy and I didn’t love my function but it was enough to try some Chi Square/Cramérs V analysis comparing every column pair for possible correlation. I would love to have found ANY surprise coffee trend that correlated to personal data (like maybe all the cinnamon takers are all unemployed), but the highest Cramérs V value was .5 and nothing compelling. So I gave up before I started showing more bias than I had already in wanting to find something.
I may clean that code up and include it later.
Strip Columns
But for now I stripped out all of those multiple-response columns. This allowed me to drop all NA cells. These columns were the questions that many people seem to have just skipped, so stripping them saved more observations.
# List of columns with multi-value cells
<- c("where_drink",
hot_columns "brew",
"purchase",
"additions",
"dairy",
"sweetener",
"why_drink"
)
# Drop multi-value columns before dropping NA cells
<- clean_tib %>%
strip_tib select(-all_of(hot_columns)) %>%
drop_na()
First Plot
I wanted to plot some kind of two-variable comparison anyway so I chose Age vs Cups per day, just to have a look.
# Ordinal ordering
= c("<18 years old", "18-24 years old", "25-34 years old",
age_levels "35-44 years old", "45-54 years old", "55-64 years old",
">65 years old")
= c("less than 1", "1", "2", "3", "4", "more than 4")
cups_levels
<- strip_tib %>%
ageVcups select(age, cups) %>%
mutate(
age = factor(age, levels = age_levels, ordered = TRUE),
cups = factor(cups, levels = cups_levels, ordered = TRUE)
)
ggplot(ageVcups, aes(x = age, fill = cups)) +
geom_bar(
position = "fill"
+
) labs(
title = "Cups of Coffee per Day by Age Group",
x = "Age Group",
y = "Percentage",
fill = "Number of Cups"
+
) scale_fill_brewer(
palette = "YlOrBr"
+
) scale_y_continuous(
labels = ~ paste0(. * 100, "%")
+
) theme(
legend.position = "right",
axis.text.x = element_text(angle = 45, hjust = 1)
)
No considerable surprise that older age groups tend to show a greater proportion of higher cups-per-day.
Finding Our Target
What interested me is the self-reported coffee expertise. These questions are such a labyrinth of psychology I don’t know what analysts really use them for other than profiling over confidence maybe? I thought Dunning-Kruger did away with the presumed value of self-assessed expertise.
What other question might be fun to compare to this self-expertise?
“Do you know the source of your coffee”
Lets take a look. A jitter-box plot will show this nicely.
<- strip_tib %>%
expertiseVknowsource_tib select(expertise, know_source) %>%
mutate(know_source = case_when(
tolower(know_source) == "yes" ~ "Yes",
tolower(know_source) == "no" ~ "No",
TRUE ~ know_source))
<- expertiseVknowsource_tib %>%
evk_lite filter(!(know_source == "No" & expertise == 10))
<- expertiseVknowsource_tib %>%
evk_hilite filter(know_source == "No" & expertise == 10)
<- ggplot(evk_lite, aes(
p x = know_source,
y = as.numeric(expertise))
+
) geom_boxplot(
data = expertiseVknowsource_tib,
outlier.shape = NA,
fill = "brown",
color = "white",
alpha = 0.25
+
) geom_point(
data = evk_lite,
position = position_jitter(seed = 119, width = 0.12, height = 0.69),
color = "tan",
alpha = 0.25
+
) geom_point(
data = evk_hilite,
position = position_jitter(seed = 119, width = 0.1, height = 0.69),
color = "tan",
alpha = 0.25
+
) scale_y_continuous(
breaks = seq(0, 10, by = 1)
+
) labs(
title = "Do Coffee Experts Know Their Sources?",
x = "Do You Know Where Your Coffee Comes From?",
y = "Claimed Level of Coffee Expertise"
+
) theme(
legend.position = "none"
)
p
Results
Not too suprising. The more you claim to know about coffee the more likely you are to also claim to know where your coffee comes from.
But who are those four coffee geniuses who don’t care where their coffee comes from?
<- expertiseVknowsource_tib %>%
hilite filter(know_source == "No" & expertise == 10)
# set.seed(357)
# jitter <- position_jitter(width = 0.25, height = 0.5)
<- p +
p_hilite geom_point(data = hilite, aes(
x = know_source,
y = as.numeric(expertise)),
position = position_jitter(seed = 119, width = 0.1, height = 0.69),
color = "brown",
alpha = .88
+
) annotate(
"text",
x = .75,
y = 10,
label = "EXPERTS",
size = 2.5,
#fontface = "italic",
color = "brown"
)
p_hilite
Drill Down
I must know more about them
<- strip_tib %>%
expVsource_outliers mutate(expertise = as.integer(expertise)) %>%
select(-submission_id) %>%
filter(expertise == 10) %>%
filter(str_detect(know_source, "no"))
kable(expVsource_outliers, "html") %>%
kable_styling() %>%
row_spec(0, bold = TRUE, color = brew_colors[1], background = "brown") %>%
row_spec(1:nrow(expVsource_outliers), color = brew_colors[6], background = brew_colors[1])
age | cups | favorite | style | strength | roast_level | caffeine | expertise | wfh | monthly_expenditure | taste | know_source | most_paid | most_willing | value_cafe | spent_equipment | value_equipment | gender | education_level | ethnicity_race | employment_status | number_children | political_affiliation |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
18-24 years old | 2 | americano | caramalized | very strong | light | full caffeine | 10 | i primarily work from home | <$20 | yes | no | $6-$8 | $4-$6 | yes | $100-$300 | yes | female | bachelor's degree | asian/pacific islander | employed full-time | none | no affiliation |
25-34 years old | 1 | regular drip coffee | chocolatey | somewhat strong | dark | full caffeine | 10 | i primarily work from home | $40-$60 | yes | no | $10-$15 | $15-$20 | yes | $20-$50 | yes | female | bachelor's degree | asian/pacific islander | employed full-time | none | no affiliation |
18-24 years old | 1 | latte | chocolatey | very strong | medium | full caffeine | 10 | i primarily work in person | $60-$80 | yes | no | $6-$8 | $6-$8 | yes | $50-$100 | yes | female | bachelor's degree | asian/pacific islander | employed full-time | none | independent |
25-34 years old | 1 | pourover | chocolatey | somewhat strong | dark | full caffeine | 10 | i primarily work from home | <$20 | yes | no | $8-$10 | $10-$15 | yes | $300-$500 | yes | male | bachelor's degree | white/caucasian | employed full-time | none | independent |
Responsible Analysis
Let’s remind ourselves that we left sound analysis practices behind at this point. We are simply having fun. These are outliers and ZERO conclusions can be drawn from digging into their data. So lets just stop here.