Dimensionality Reduction on Cocktail Data in R

An example of Dimensionality Reduction using PCA and UMAP with Tidymodels in R

Last updated on Jul 3, 2020 6 min read Machine Learning

Image by Pexels from Pixabay

NOTE: These days I am following Julia Silge for learning tidymodels framework better. This post is inspired from what I learned from her. You can find a screencast of her vidoes here

Drinking fancy cocktails always puts us in the mood to have a good time, like crafting edible garnishes and floral beverages. But sometimes we don’t feel like having to mix up an elaborate drink (even if it’s supposedly healthier than our usual tipple) just to quench our cocktail craving. Luckily, mixology doesn’t have to be complicated. Here we will showcase a method to find out what ingredients in a cocktail go together and what should not be mixed using dimensionality reduction techniques. We will demonstrate two most popular and powerful methods:

Principal Component Analysis (also knowns as PCA)
Uniform Monifold Approximation and Projection ( also known as UMAP)

Sounds fun..! Let’s get started.

Load the data

Our data comes from the tidy tuesday dataset which is a weekly data project aimed at the R ecosystem from the R4DS online learning community.

boston_cocktails <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-26/boston_cocktails.csv")
boston_cocktails %>% head() %>% 
  knitr::kable()

name	category	row_id	ingredient_number	ingredient	measure
Gauguin	Cocktail Classics	1	1	Light Rum	2 oz
Gauguin	Cocktail Classics	1	2	Passion Fruit Syrup	1 oz
Gauguin	Cocktail Classics	1	3	Lemon Juice	1 oz
Gauguin	Cocktail Classics	1	4	Lime Juice	1 oz
Fort Lauderdale	Cocktail Classics	2	1	Light Rum	1 1/2 oz
Fort Lauderdale	Cocktail Classics	2	2	Sweet Vermouth	1/2 oz

We see 3643 observations with 6 variables, Let’s explore it more to see if it needs further transformation.

Data Transformation

cocktail_parsed <- boston_cocktails %>% 
    mutate(ingredient = str_to_lower(ingredient),
           ingredient = str_replace_all(ingredient, "-", " "),
           ingredient = str_remove(ingredient, " liqueur"),
           ingredient = case_when(str_detect(ingredient, "bitters") ~ "bitters",
                                  str_detect(ingredient, "orange") ~ "orange juice",
                                  str_detect(ingredient, "lemon") ~ "lemon juice",
                                  str_detect(ingredient, "lime") ~ "lime juice",
                                  str_detect(ingredient, "grapefruit") ~ "grapefruite juice",
                                  TRUE ~ ingredient),
           measure = case_when(str_detect(ingredient, "bitters") ~ str_replace(measure, "oz", "dash"),
                               TRUE ~ measure),
           measure = str_replace(measure, " ?1/2", ".5"),
           measure = str_replace(measure, " ?1/4", ".25"),
           measure = str_replace(measure, " ?3/4", ".75"),
           measure_number = parse_number(measure),
           measure_number = if_else(str_detect(measure, "dash$"), measure_number / 50, measure_number)) %>% 
    add_count(ingredient, sort = T) %>% 
    filter(n>15) %>% 
    select(-n) %>% 
    distinct(row_id, ingredient, .keep_all = T) %>% 
    na.omit()

cocktail_parsed %>%
    head() %>% 
    knitr::kable()

name	category	row_id	ingredient_number	ingredient	measure	measure_number
Gauguin	Cocktail Classics	1	3	lemon juice	1 oz	1.00
John Collins	Whiskies	6	2	lemon juice	1 oz	1.00
Amber Amour	Cordials and Liqueurs	10	2	lemon juice	.25 oz	0.25
The Joe Lewis	Whiskies	11	4	lemon juice	.5 oz	0.50
Apple Pie No. 1	Cocktail Classics	13	5	lemon juice	1 oz	1.00
Why Not?	Cocktail Classics	16	4	lemon juice	1 oz	1.00

We have done some clean up in the ingredient column and separated out the quantity from measure column and have created a separate variable called measure_number. This wrangling leaves us with 2546 rows and 7 columns.

Lets widen the dataset so that we can bring it in a form where dimensionality reduction can be applied.

cocktail_df <- cocktail_parsed %>% 
    select(-ingredient_number, -row_id, -measure) %>% 
    pivot_wider(names_from = ingredient, values_from = measure_number,
                values_fill = 0) %>% 
    janitor::clean_names() %>% 
    na.omit()

DataExplorer::introduce(cocktail_df)

## # A tibble: 1 x 9
##    rows columns discrete_columns continuous_colu~ all_missing_col~
##   <int>   <int>            <int>            <int>            <int>
## 1   937      42                2               40                0
## # ... with 4 more variables: total_missing_values <int>, complete_rows <int>,
## #   total_observations <int>, memory_usage <dbl>

Finally, we have our tidy dataframe ready for some modelling. We have 937 observations with total 42 different variables out of which 40 are continuous and 2 are discrete. We will apply dimensionality reduction now.

Principal Component Analysis (PCA)

We will start with PCA. To apply PCA, we will create a recipe in an unsupervised fashion without mentioning any outcome variable. We won’t use name and category features, So their roles have been updated to “ID”. Step_pca performs pca in R tidymodels framework.

pca_rec <- recipe(~., data = cocktail_df) %>% 
    update_role(name, category, new_role = "ID") %>% 
    step_normalize(all_predictors()) %>% 
    step_pca(all_predictors())

pca_prep <- prep(pca_rec)
pca_prep

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##         ID          2
##  predictor         40
## 
## Training data contained 937 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for lemon_juice, lime_juice, gin, ... [trained]
## PCA extraction with lemon_juice, lime_juice, gin, ... [trained]

When we prep the recipe, it performs all the step transformations specified in our recipe. If we want to see the details we can use tidy() function to get a tabular summary.

tidied_pca <- tidy(pca_prep, 2)
tidied_pca

## # A tibble: 1,600 x 4
##    terms          value component id       
##    <chr>          <dbl> <chr>     <chr>    
##  1 lemon_juice   0.0167 PC1       pca_I3IqE
##  2 lime_juice   -0.224  PC1       pca_I3IqE
##  3 gin           0.260  PC1       pca_I3IqE
##  4 bitters      -0.0210 PC1       pca_I3IqE
##  5 orange_juice -0.0405 PC1       pca_I3IqE
##  6 simple_syrup -0.312  PC1       pca_I3IqE
##  7 vodka        -0.200  PC1       pca_I3IqE
##  8 light_rum    -0.163  PC1       pca_I3IqE
##  9 triple_sec   -0.0137 PC1       pca_I3IqE
## 10 dry_vermouth  0.0874 PC1       pca_I3IqE
## # ... with 1,590 more rows

Visualise Principal Components

tidied_pca %>% 
    filter(component %in% paste0("PC", 1:5)) %>% 
    mutate(component = fct_inorder(component)) %>% 
    ggplot(aes(value, terms, fill = terms), alpha = 0.5) +
    geom_col(show.legend = F) +
    facet_wrap(~ component, nrow = 1)

library(tidytext)
tidied_pca %>%
    filter(component %in% paste0("PC", 1:4)) %>%
    group_by(component) %>%
    top_n(10, abs(value)) %>%
    ungroup() %>%
    mutate(terms = reorder_within(terms, abs(value), component)) %>%
    ggplot(aes(abs(value), terms, fill = value > 0)) +
    geom_col(alpha = 0.8) +
    scale_y_reordered() +
    facet_wrap( ~ component, scales = "free_y") +
    labs(x = NULL, y = NULL, fill = "Positive?")

juice(pca_prep) %>%
    ggplot(aes(PC1, PC2, label = name)) +
    geom_point(aes(color = category), size = 2) +
    geom_text(check_overlap = T, hjust = "inward") +
    labs(color = NULL)

Uniform Manifold Approximation and Projection (UMAP)

We saw how PCA converts high dimensional data into principal components that explain a large part of the variations in the data. Let us now use UMAP to transform the same data. We will create recipe from UMAP transform and then prep it as we did in PCA

library(embed)

umap_rec <- recipe(~., data = cocktail_df) %>% 
    update_role(name, category, new_role = "ID") %>% 
    step_normalize(all_predictors()) %>% 
    step_umap(all_predictors())

umap_prep <- prep(umap_rec)
umap_prep

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##         ID          2
##  predictor         40
## 
## Training data contained 937 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for lemon_juice, lime_juice, gin, ... [trained]
## UMAP embedding for lemon_juice, lime_juice, gin, ... [trained]

Visualise UMAP Components

Using 42 variables, UMAP has created two UMAP components. Let’s chec them visually.

juice(umap_prep) %>%
    ggplot(aes(umap_1, umap_2, label = name)) +
    geom_point(aes(color = category), size = 2) +
    geom_text(check_overlap = T, hjust = "inward") +
    labs(color = NULL)

Conclusion

In this post, we saw how PCA and UMAP algorithms can be used conveniently to reduce the dimensions of a high dimensional dataset. These algorithms can be used not only to reduce the dimension but also to understand the patterns in our data and what goes woth what to some extent.

I hope this helps. Thank you so much for reading. See you again in the next post..!!

dimensionality reduction PCA rstats tidymodels tidyverse UMAP