20

Overview

In my first ever #tidytuesday we are taking a look at a coffee survey. I’ll make a note up front that I am prioritizing self education over analytical integrity here. Curiosity is in the drivers seat, so we are going to see where this takes us. Anything I present here is not intended to draw any conclusions or correlations (I tried, and any statistically sound correlation just wasn’t there with this data).

ELT

Load Up

Nothing crazy.

library(tidytuesdayR)
library(tidyverse)
library(RColorBrewer)
library(ggforce)
library(knitr)
library(kableExtra)

Theme Setup

Coffee Colors?

brew_colors <- brewer.pal(n = 6, name = "YlOrBr")

coffee_theme <- theme_minimal() +
  theme(
    plot.title = element_text(family = "Hot Ink", face = "plain", size = 26, color = brew_colors[1], hjust = 0.5),
    axis.title.x = element_text(family = "Fira Sans", face = "italic", size = 12, color = brew_colors[1], margin = margin(t = 20, r = 20, b = 20, l = 20, unit = "pt")),
    axis.title.y = element_text(family = "Fira Sans", face = "italic", size = 12, color = brew_colors[1], margin = margin(t = 20, r = 20, b = 20, l = 20, unit = "pt")),
    axis.text.x = element_text(family = "Fira Sans", face = "plain", size = 10, color = brew_colors[2], angle = 45, hjust = 1),
    axis.text.y = element_text(family = "Fira Sans", face = "plain", size = 10, color = brew_colors[2]),
    legend.title = element_text(family = "Fira Sans", face = "bold", size = 12, color = brew_colors[1]),
    legend.text = element_text(family = "Fira Sans", face = "plain", size = 10, color = brew_colors[2]),
    plot.background = element_rect(fill = "black", color = NA),
    panel.background = element_rect(fill = "black", color = NA),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    plot.margin = margin(t = 35, r = 10)
  )

theme_set(coffee_theme)

Extract

I tried the tidytuesdayR::tt_load method and kept getting an error, so we url it.

coffee_survey <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-05-14/coffee_survey.csv', show_col_types = FALSE)

Transform

Nice initial cleaning. I decided straight away that my analysis would not be considering open-response answers. I organized the rest of the columns into two categories: multiple response, and factored. Here I factor the latter group. Later I’ll decide what to do with the former.

Finally I have the pleasure of correcting the “spend” column typo. People need to stop using “spend” as a noun. It is a verb. The noun is “expenditure”.

# Trim out free-text data, fix "spend" typo 
# "spend" is not a noun, stop trying to make it one
clean_tib <- coffee_survey %>%
  select(-matches("other|specify|coffee_a|coffee_b|coffee_c|coffee_d|prefer")) %>%
  rename(monthly_expenditure = total_spend) %>%
  mutate(across(everything(), ~ if(is.character(.)) tolower(.) else .)) %>%
  mutate(across(c(age, cups, favorite, style, strength, roast_level, caffeine,
                  expertise, wfh, monthly_expenditure, taste, know_source,
                  most_paid, most_willing, value_cafe, spent_equipment,
                  value_equipment, gender, education_level, ethnicity_race,
                  employment_status, number_children, political_affiliation
                  ), as.factor
                )
         )

EDA

I saw the multiple-response columns and decided to try to learn more about the best ways to handle these. I could make each cell a vector, turning the column into a list of character vectors, or I could try one-hot encoding. I brought in the caret library and played around with dummy variables enough to get a basic transformation of all the multiple-response columns into multiple single Boolean columns for each response.

This was messy and I didn’t love my function but it was enough to try some Chi Square/Cramérs V analysis comparing every column pair for possible correlation. I would love to have found ANY surprise coffee trend that correlated to personal data (like maybe all the cinnamon takers are all unemployed), but the highest Cramérs V value was .5 and nothing compelling. So I gave up before I started showing more bias than I had already in wanting to find something.

I may clean that code up and include it later.

Strip Columns

But for now I stripped out all of those multiple-response columns. This allowed me to drop all NA cells. These columns were the questions that many people seem to have just skipped, so stripping them saved more observations.

# List of columns with multi-value cells
hot_columns <- c("where_drink",
                 "brew",
                 "purchase",
                 "additions",
                 "dairy",
                 "sweetener",
                 "why_drink"
                 )

# Drop multi-value columns before dropping NA cells
strip_tib <- clean_tib %>%
  select(-all_of(hot_columns)) %>%
  drop_na()

First Plot

I wanted to plot some kind of two-variable comparison anyway so I chose Age vs Cups per day, just to have a look.

# Ordinal ordering
age_levels = c("<18 years old", "18-24 years old", "25-34 years old",
               "35-44 years old", "45-54 years old", "55-64 years old",
               ">65 years old")
cups_levels = c("less than 1", "1", "2", "3", "4", "more than 4")

ageVcups <- strip_tib %>% 
  select(age, cups) %>% 
  mutate(
    age = factor(age, levels = age_levels, ordered = TRUE),
    cups = factor(cups, levels = cups_levels, ordered = TRUE)
  )

ggplot(ageVcups, aes(x = age, fill = cups)) +
  geom_bar(
    position = "fill"
    ) +
  labs(
    title = "Cups of Coffee per Day by Age Group",
    x = "Age Group",
    y = "Percentage",
    fill = "Number of Cups"
    ) +
  scale_fill_brewer(
    palette = "YlOrBr"
    ) +
  scale_y_continuous(
    labels = ~ paste0(. * 100, "%")
    ) +
  theme(
    legend.position = "right",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

No considerable surprise that older age groups tend to show a greater proportion of higher cups-per-day.

Finding Our Target

What interested me is the self-reported coffee expertise. These questions are such a labyrinth of psychology I don’t know what analysts really use them for other than profiling over confidence maybe? I thought Dunning-Kruger did away with the presumed value of self-assessed expertise.

What other question might be fun to compare to this self-expertise?

“Do you know the source of your coffee”

Lets take a look. A jitter-box plot will show this nicely.

expertiseVknowsource_tib <- strip_tib %>%
  select(expertise, know_source) %>% 
    mutate(know_source = case_when(
      tolower(know_source) == "yes" ~ "Yes",
      tolower(know_source) == "no" ~ "No",
      TRUE ~ know_source))

evk_lite <- expertiseVknowsource_tib %>% 
  filter(!(know_source == "No" & expertise == 10))

evk_hilite <- expertiseVknowsource_tib %>% 
  filter(know_source == "No" & expertise == 10)

p <- ggplot(evk_lite, aes(
    x = know_source, 
    y = as.numeric(expertise))
  ) +
  geom_boxplot(
    data = expertiseVknowsource_tib,
    outlier.shape = NA, 
    fill = "brown", 
    color = "white", 
    alpha = 0.25
  ) +
  geom_point(
    data = evk_lite,
    position = position_jitter(seed = 119, width = 0.12, height = 0.69),
    color = "tan",
    alpha = 0.25
  ) +
  geom_point(
    data = evk_hilite,
    position = position_jitter(seed = 119, width = 0.1, height = 0.69),
    color = "tan",
    alpha = 0.25
  ) +
  scale_y_continuous(
    breaks = seq(0, 10, by = 1)
  ) +
  labs(
    title = "Do Coffee Experts Know Their Sources?",
    x = "Do You Know Where Your Coffee Comes From?",
    y = "Claimed Level of Coffee Expertise"
  ) +
  theme(
    legend.position = "none"
  )

p

Results

Not too suprising. The more you claim to know about coffee the more likely you are to also claim to know where your coffee comes from.

But who are those four coffee geniuses who don’t care where their coffee comes from?

hilite <- expertiseVknowsource_tib %>% 
  filter(know_source == "No" & expertise == 10)

# set.seed(357)
# jitter <- position_jitter(width = 0.25, height = 0.5)

p_hilite <- p +
  geom_point(data = hilite, aes(
    x = know_source, 
    y = as.numeric(expertise)),
    position = position_jitter(seed = 119, width = 0.1, height = 0.69), 
    color = "brown",
    alpha = .88
    ) +
  annotate(
    "text",
    x = .75,
    y = 10,
    label = "EXPERTS",
    size = 2.5,
    #fontface = "italic",
    color = "brown"
  )

p_hilite

Drill Down

I must know more about them

expVsource_outliers <- strip_tib %>%
  mutate(expertise = as.integer(expertise)) %>%
  select(-submission_id) %>% 
  filter(expertise == 10) %>%
  filter(str_detect(know_source, "no"))


kable(expVsource_outliers, "html") %>%
  kable_styling() %>%
  row_spec(0, bold = TRUE, color = brew_colors[1], background = "brown") %>%
  row_spec(1:nrow(expVsource_outliers), color = brew_colors[6], background = brew_colors[1])
age cups favorite style strength roast_level caffeine expertise wfh monthly_expenditure taste know_source most_paid most_willing value_cafe spent_equipment value_equipment gender education_level ethnicity_race employment_status number_children political_affiliation
18-24 years old 2 americano caramalized very strong light full caffeine 10 i primarily work from home <$20 yes no $6-$8 $4-$6 yes $100-$300 yes female bachelor's degree asian/pacific islander employed full-time none no affiliation
25-34 years old 1 regular drip coffee chocolatey somewhat strong dark full caffeine 10 i primarily work from home $40-$60 yes no $10-$15 $15-$20 yes $20-$50 yes female bachelor's degree asian/pacific islander employed full-time none no affiliation
18-24 years old 1 latte chocolatey very strong medium full caffeine 10 i primarily work in person $60-$80 yes no $6-$8 $6-$8 yes $50-$100 yes female bachelor's degree asian/pacific islander employed full-time none independent
25-34 years old 1 pourover chocolatey somewhat strong dark full caffeine 10 i primarily work from home <$20 yes no $8-$10 $10-$15 yes $300-$500 yes male bachelor's degree white/caucasian employed full-time none independent

Responsible Analysis

Let’s remind ourselves that we left sound analysis practices behind at this point. We are simply having fun. These are outliers and ZERO conclusions can be drawn from digging into their data. So lets just stop here.