A Classification Model on Volcanic Eruptions in R

An example of Multinomial Classification using Random Forest and Tidymodels in R

Image by Julius Silver from Pixabay

NOTE: These days I am following Julia Silge for learning tidymodels framework better. This post is inspired from what I learned there.

I love doing data science in R. It is so easy to follow along when you work in R. In this post, we will talk about a volcano eruption dataset available here. This dataset talks about various types of volcanic eruptions that take place around the world. We are interested in understanding that by having demographic information present in data, Is it possible to correctly identify the type of the volcanic eruption. It is a multiclass or multinomial classification probelm and we will use random forest algorithm to train the model. Let’s get started..!

Let’s load the data and see what we have here.

volcano_raw <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/volcano.csv")
volcano_raw %>% 
    head() %>% 
    knitr::kable()
volcano_numbervolcano_nameprimary_volcano_typelast_eruption_yearcountryregionsubregionlatitudelongitudeelevationtectonic_settingsevidence_categorymajor_rock_1major_rock_2major_rock_3major_rock_4major_rock_5minor_rock_1minor_rock_2minor_rock_3minor_rock_4minor_rock_5population_within_5_kmpopulation_within_10_kmpopulation_within_30_kmpopulation_within_100_km
283001AbuShield(s)-6850JapanJapan, Taiwan, MarianasHonshu34.500131.600641Subduction zone / Continental crust (>25 km)Eruption DatedAndesite / Basaltic AndesiteBasalt / Picro-BasaltDacite359795941178054071152
355096AcamarachiStratovolcanoUnknownChileSouth AmericaNorthern Chile, Bolivia and Argentina-23.292-67.6186023Subduction zone / Continental crust (>25 km)Evidence CredibleDaciteAndesite / Basaltic Andesite072949092
342080AcatenangoStratovolcano(es)1972GuatemalaMéxico and Central AmericaGuatemala14.501-90.8763976Subduction zone / Continental crust (>25 km)Eruption ObservedAndesite / Basaltic AndesiteDaciteBasalt / Picro-Basalt43296073010428367634778
213004Acigol-NevsehirCaldera-2080TurkeyMediterranean and Western AsiaTurkey38.53734.6211683Intraplate / Continental crust (>25 km)Eruption DatedRhyoliteDaciteBasalt / Picro-BasaltAndesite / Basaltic Andesite1278631278632184692253483
321040AdamsStratovolcano950United StatesCanada and Western USAUSA (Washington)46.206-121.4903742Subduction zone / Continental crust (>25 km)Eruption DatedAndesite / Basaltic AndesiteBasalt / Picro-BasaltDacite0704019393303
283170AdatarayamaStratovolcano(es)1996JapanJapan, Taiwan, MarianasHonshu37.647140.2811728Subduction zone / Continental crust (>25 km)Eruption ObservedAndesite / Basaltic AndesiteDaciteBasalt / Picro-Basalt42839367170785024654
221170AdwaStratovolcanoUnknownEthiopiaAfrica and Red SeaAfrica (northeastern) and Red Sea10.07040.8401733Rift zone / Intermediate crust (15-25 km)Evidence CredibleRhyoliteBasalt / Picro-BasaltTrachyte / Trachydacite101485186451242922
221110AfderaStratovolcanoUnknownEthiopiaAfrica and Red SeaAfrica (northeastern) and Red Sea13.08840.8531250Rift zone / Intermediate crust (15-25 km)Evidence UncertainRhyoliteBasalt / Picro-BasaltTrachybasalt / Tephrite BasaniteAndesite / Basaltic Andesite5160428611161009
284160AgriganStratovolcano1917United StatesJapan, Taiwan, MarianasIzu, Volcano, and Mariana Islands18.770145.670965Subduction zone / Crustal thickness unknownEruption ObservedBasalt / Picro-BasaltAndesite / Basaltic Andesite0000
342100AguaStratovolcanoUnknownGuatemalaMéxico and Central AmericaGuatemala14.465-90.7433760Subduction zone / Continental crust (>25 km)Evidence CredibleAndesite / Basaltic AndesiteDacite989011440425304497441660

Now, Let’s explore the data

We will check the different type of volcanic eruptions that are given in the dataset. We see that there are 26 types of eruptions listed here.

volcano_raw %>% 
    count(primary_volcano_type, sort = T)
## # A tibble: 26 x 2
##    primary_volcano_type     n
##    <chr>                <int>
##  1 Stratovolcano          353
##  2 Stratovolcano(es)      107
##  3 Shield                  85
##  4 Volcanic field          71
##  5 Pyroclastic cone(s)     70
##  6 Caldera                 65
##  7 Complex                 46
##  8 Shield(s)               33
##  9 Submarine               27
## 10 Lava dome(s)            26
## # ... with 16 more rows

It shows that there are 26 different types of eruptions. Although it can be modelled to classify for all the 26 types but seeing the size of the data it makes more sense to group the eruptions in three categories. Lets take the first two categories and put rest of the data in “other” category to keep it simple.

volcano_df <- volcano_raw %>% 
    transmute(volcano_type = case_when(str_detect(primary_volcano_type, "Stratovolcano") ~ "Stratovolcano",
                                       str_detect(primary_volcano_type, "Shield") ~ "Shield",
                                       TRUE ~ "Others"),
              volcano_number, latitude, longitude, tectonic_settings, elevation, major_rock_1) %>% 
    mutate_if(is.character, factor)

Lets plot some visualizations

Since we have spatial data, Lets see where these volcanos are situated on a world map. We will put these volcanos on their location and will color them according to the volcanic type. Let’s see, how it looks..!

world <- map_data("world")

ggplot() +
    geom_map(data = world, map = world,
             aes(x = long, y = lat, map_id = region), color = "white", fill = "gray50", alpha = 0.2) +
    geom_point(data = volcano_df, aes(longitude, latitude, color = volcano_type), alpha = 0.5) +
    labs(x = NULL, y = NULL,
         title = "Different Type of Volcanic Eruptions around the world") +
    scale_x_continuous(labels = NULL) +
    scale_y_continuous(labels = NULL) 

Volcano Map

It looks like these volcanos create a ring of fire around the pacific ocean. Also, they are spreaded all over the world but we can see most of them in coastal areas.

Let’s prepare the data

Rather than creating a split we will create bootstrap resamples as we don’t have much data. Hence resampling will help in preventing the model from overfitting and will also give us a robust fit.

volcano_boot <- volcano_df %>% 
    bootstraps()
volcano_boot
## # Bootstrap sampling 
## # A tibble: 25 x 2
##    splits            id         
##    <list>            <chr>      
##  1 <split [958/353]> Bootstrap01
##  2 <split [958/357]> Bootstrap02
##  3 <split [958/340]> Bootstrap03
##  4 <split [958/358]> Bootstrap04
##  5 <split [958/363]> Bootstrap05
##  6 <split [958/361]> Bootstrap06
##  7 <split [958/336]> Bootstrap07
##  8 <split [958/353]> Bootstrap08
##  9 <split [958/353]> Bootstrap09
## 10 <split [958/368]> Bootstrap10
## # ... with 15 more rows

Let’s pre-process the data

We will create a recipe now. We will use smote analysis to overcome class imbalance. smote() function is a part of themis package. So, remember to load themis package before creating recipe. We will also remove volcano number as it is not useful in modelling. We will group the less frequent classes and put them into a class called ‘other’. We will do the same thing with major_rock_1 variable also. Then we will create dummy variable for all categorical variables. will remove zero variance variable and normalize the numerical variable

library(themis)

volcano_rec <- recipe(volcano_type~., data = volcano_df) %>% 
    update_role(volcano_number, new_role = "id") %>% 
    step_other(tectonic_settings) %>% 
    step_other(major_rock_1) %>% 
    step_dummy(tectonic_settings, major_rock_1) %>% 
    step_zv(all_predictors()) %>% 
    step_normalize(all_predictors()) %>% 
    step_smote(volcano_type)

volcano_prep <- prep(volcano_rec)
volcano_prep

Lets create a model spec

Lets create a random forest model specification. We will set engine as ranger and mode as classification. We keep number of trees equal to 1000 to be grown.

rf_spec <- rand_forest(trees = 1000) %>% 
    set_engine(engine = "ranger") %>% 
    set_mode(mode = "classification")

Lets create a workflow

Workflow helps us in putting all the things together which in turn helps us in modelling faster. We are adding recipe and model specs heer to the model.

volcano_wf <- workflow() %>% 
    add_recipe(recipe = volcano_rec) %>% 
    add_model(rf_spec)

Lets train the model

We will train the model now using bootstrap resamples. Verbose argument is not necessary as in but it helps us in seeing the progress while random forest algorithm converges.

volcano_res <- fit_resamples(volcano_wf, 
              resamples = volcano_boot,
              control = control_resamples(save_pred = T, 
                                          verbose = T))

Explore results

Let’s see how the model performs on the resampled data. Although area under the curve is ~79% still model doesn’t have a good accuracy. It fares slightly better than a base model.

volcano_res %>% 
    collect_metrics()
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy multiclass 0.648    25 0.00506
## 2 roc_auc  hand_till  0.788    25 0.00414

Lets have a look at the confusion metrics to see how model predicts different classes.

volcano_res %>% 
    collect_predictions() %>% 
    conf_mat(volcano_type, .pred_class)
##                Truth
## Prediction      Others Shield Stratovolcano
##   Others          1960    344           853
##   Shield           289    583           222
##   Stratovolcano   1236    193          3238

We will also like to see the positive predicted value which is a useful measure in multinomial classification models.

volcano_res %>% 
    collect_predictions() %>% 
    ppv(volcano_type, .pred_class)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 ppv     macro          0.616

We can check the ppv across various bootstraps to see how it behaves in different data samples. It doesn’t change much, which indicates that model is not too sensitive to the new data.

volcano_res %>% 
    collect_predictions() %>% 
    group_by(id) %>% 
    ppv(volcano_type, .pred_class)
## # A tibble: 25 x 4
##    id          .metric .estimator .estimate
##    <chr>       <chr>   <chr>          <dbl>
##  1 Bootstrap01 ppv     macro          0.610
##  2 Bootstrap02 ppv     macro          0.622
##  3 Bootstrap03 ppv     macro          0.621
##  4 Bootstrap04 ppv     macro          0.638
##  5 Bootstrap05 ppv     macro          0.616
##  6 Bootstrap06 ppv     macro          0.641
##  7 Bootstrap07 ppv     macro          0.578
##  8 Bootstrap08 ppv     macro          0.671
##  9 Bootstrap09 ppv     macro          0.672
## 10 Bootstrap10 ppv     macro          0.616
## # ... with 15 more rows

It will also be interesting to see which variable is of more importance for the model to correctly classify volcano eruptions. Let’s plot a variable importance plot.

library(vip)
rf_spec %>%
    set_engine("ranger", importance = "permutation") %>%
    fit(
        volcano_type ~ .,
        data = juice(volcano_prep) %>% select(-volcano_number) %>% janitor::clean_names()
    ) %>% 
    vip(geom = "point")

VIP Plot

Let’s combine the predictions into a dataframe

volcano_pred <- volcano_res %>% 
    collect_predictions() %>% 
    mutate(correct = .pred_class == volcano_type) %>% 
    left_join(volcano_df %>% 
                  mutate(.row = row_number()))

Let’s check on the world map, how many of the valcanos can our model classify correctly.

The map shows, how it fares in terms of predictions by plotting our predictions again on a world map.

ggplot() +
    geom_map(data = world, map = world,
             aes(x = long, y = lat, map_id = region), color = "white", fill = "gray50", alpha = 0.5) +
    stat_summary_hex(data = volcano_pred, aes(longitude, latitude, z = as.integer(correct)), 
                     fun = "mean", alpha = 0.8, bins = 60) +
    labs(x = NULL, y = NULL,
         title = "On Average how accurate were our predictions across the world",
         fill = "Percent correct") +
    scale_x_continuous(labels = NULL) +
    scale_y_continuous(labels = NULL) +
    scale_fill_gradient(high = "cyan3", labels = scales::percent)

Volcano Pred Plot

Okay. So in this post, we used volcanic data to build a multinomial classification model to see if on the basis of demographic data and other information about a volcano, how correctly can we predict the type of the eruption. We also built a use case around multiclass classification and used some performance metrics to evaluate the model effectiveness. I hope this helps. Thank you so much for reading. See you again in the next post..!!

Gaurav Sharma
Gaurav Sharma
Data Enthusiast | Engineer | INTJ

My research interests include Manufacturing Analytics, Industry 4.0 and Factory Digitization.

Related