Energizing Change: Electric Vehicle Rise in Switzerland

Author

Pierre Vernet, Alexis Jeanneret, Hugo Troendle, Urs Hurni

Published

December 22, 2023

1 Introduction

1.1 Executive Summary

This study explores electric vehicle (EV) adoption trends in Switzerland, investigating factors like regional differences, demographic influences, and comparisons with France. It uses diverse datasets, including vehicle registrations, oil prices, demographics, Google trends, and political affiliations. Key findings include a rise in EV registrations, variations in adoption rates across regions and demographics, and the influence of charging station availability. The analysis also highlights the role of political dynamics in EV adoption. Limitations include the lack of detailed pricing data and the unexplored impact of marketing and government subsidies. Future research could delve into these aspects for a more nuanced understanding.

1.2 Background and Motivation

In an era marked by unprecedented environmental challenges, our world is at a critical juncture where sustainable practices are no longer an option but a necessity. Climate change, driven by anthropogenic activities, poses an imminent threat to the delicate ecological balance. Recognizing the urgency of this global issue, we are compelled to investigate the specific context of Switzerland, the country we live in, and which has a lot of financial capacities to change rapidly.

The automotive sector plays a pivotal role in shaping the a sustainable world, and our focus on electric vehicle adoption in Switzerland reflects a strategic choice to comprehend the intricate interplay of factors driving this transition.

The motivations underlying our research stem from a multifaceted perspective. Firstly, the automotive industry is undergoing a transformative shift globally, with electric vehicles emerging as a key solution to reduce carbon emissions. By narrowing our focus to Switzerland, we aim to provide nuanced insights into the factors influencing EV adoption, offering a unique perspective on the socio-economic, environmentally stable and financially comfortable Swiss context. The adoption of electric vehicles provides an insight into people’s inclination to make efforts towards sustainability. This emphasizes the crucial role that electric vehicle adoption plays in evaluating society’s commitment to sustainable practices.

Secondly, our commitment to environmental sustainability extends beyond a generic interest. The urgency of addressing climate change has never been more palpable, and our research seeks to contribute to the discussion on sustainable mobility.

Moreover, we believe the future is encapsulated in our dedication to understanding the dynamics of environmentally conscious consumer behavior. By unraveling the complexities of EV adoption in Switzerland over the years (2005-2022), we aim to offer valuable insights that could guide policymakers, businesses, and consumers towards more informed and sustainable choices.

We believe that by comprehensively analyzing the impact of external factors on the adoption of electric vehicles in Switzerland, we can contribute meaningfully to the ongoing global efforts towards a more sustainable and resilient future. Through this project, we aspire to inspire positive change, foster innovation, and advocate for a paradigm shift towards eco-friendly practices in the automotive sector and beyond.

1.4 Reseach Questions

1. Based on past electric vehicle adoption trends in Switzerland, can we forecast future adoption rates and pinpoint times of significant increases or decreases correlated with major events or policy changes?
1. Are there differences in adoption rate within the different regions in Switzerland? And are there different buying behaviors displayed by the demographic segments within Switzerland ?
1. How has the growth of electric vehicles evolved in comparison to other countries such as France, and what factors might account for the differences in their evolution ?
1. To what extent does the evolution in the availability of charging stations exert an influence on the adoption of electric vehicles in Switzerland?

2 Data

2.1 Raw Datasets

First, we had to do a check for Large Data: If any of our datasets are particularly large (like the swiss_vehicle), we used a more efficient data reading librariy data.table (with fread) to speed up the data loading process.

We also added a common file path prefix (“../data/”) to make the code cleaner and easier to change for any usage in the future.

Code

library(data.table)
file_path <- "../data/"

google_trends_files <- paste0(file_path, "googletrends_", c("auto-elettrica", "elektro-auto", "elektrofahrzeug", "elektromobil", "eletric-car", "EV", "vehicule-electrique", "voiture-electrique"), "_2005-2022.csv")
google_trends_data <- lapply(google_trends_files, fread)
vehicle_data_2005_2008 <- fread(paste0(file_path, "road_vehicle_CH_2005-2008.csv"))
vehicle_data_2009_2022 <- fread(paste0(file_path, "road_vehicle_CH_2009-2022.csv"))
oil_prices_data <- fread(paste0(file_path, "BrentOilPrices.csv"))
demographic_data <- fread(paste0(file_path, "demographic.csv"))
charging_station <- fread(paste0(file_path, "charging_station.csv"))
france_v <- read_excel(paste0(file_path, "parc_vp_france_2022.xlsx"), sheet = 2)
political_data_sheets_prep <- excel_sheets(paste0(file_path, "political_data.xlsx"))
df_swisspop_2022 <- read_excel(paste0(file_path, "swiss_pop.xlsx"), sheet = 1)
df_swisspop_2021 <- read_excel(paste0(file_path, "swiss_pop.xlsx"), sheet = 2)
df_swisspop_2020 <- read_excel(paste0(file_path, "swiss_pop.xlsx"), sheet = 3)
df_swisspop_2019 <- read_excel(paste0(file_path, "swiss_pop.xlsx"), sheet = 4)
df_swisspop_2018 <- read_excel(paste0(file_path, "swiss_pop.xlsx"), sheet = 5)
charge_ch_fr <- fread(paste0(file_path,"df_charging_points_CH_FR.csv"))

For this phase of the project, our approach involves presenting diverse data sets in a table form. To achieve this, we used the kable function from the kableExtra package, resulting in a table that displays each variable along with its corresponding definition and also added some bootstrap_otions for different styles.

2.1.1 Swiss Vehicle Registration Dataset

This data set explains the new registrations of road vehicles by vehicle group, canton, vehicle type, fuel, month and year in Switzerland. This data set is essential to evaluate the EVs’ evolution within our chosen location.

Code

# Create a tibble with variable descriptions
variable_table_vehicle <- tibble(
  Variable = c("Canton", "Vehicle Group / Type", "Fuel", "Month", "2009-2022"),
  Description = c(
    "The region in Switzerland.",
    "Type or group of the vehicle.",
    "Type of fuel used by the vehicle.",
    "The month of the data.",
    "Number of vehicles for each respective year. Each with its own column"
  )
)

# Display the table using kableExtra
variable_table_vehicle %>%
  kbl() %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover"))

Variable	Description
Canton	The region in Switzerland.
Vehicle Group / Type	Type or group of the vehicle.
Fuel	Type of fuel used by the vehicle.
Month	The month of the data.
2009-2022	Number of vehicles for each respective year. Each with its own column

Source - bfs.admin.ch

2.1.2 Oil price Dataset

This data set explains the evolution of the price of brent oil over time. From this data set, we will be able to obtain precious information to compute our linear regressions concerning oil price and EVs adoptions.

Code

# Create a tibble with variable descriptions
variable_table <- tibble(
  Variable = c("Date", "Price"),
  Description = c(
    "The date when the price was recorded. In a day-month-year format.",
    "The price of Brent Oil on the given date."
  )
)

# Display the table using kableExtra
variable_table %>%
  kbl() %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))

Variable	Description
Date	The date when the price was recorded. In a day-month-year format.
Price	The price of Brent Oil on the given date.

Source - Kaggle.com

2.1.3 Swiss Demographics Dataset

While this dataset contains a substantial amount of information, only a fraction of it will be pertinent to our research. It explains the evolution of the Swiss population over time through variables given below.

Code

# Create a tibble with variable descriptions
variable_table <- tibble(
  Variable = c(
    "Year", "Citizenship (category)", "Sex", "Age", "Population on 1 January", 
    "Live birth", "Death", "Natural change", 
    "Immigration incl. change of population type", "Emigration", 
    "Net migration incl. change of population type", 
    "Change of population type", "Acquisition of Swiss citizenship", 
    "Gender change in the civil register (entry)", 
    "Gender change in the civil register (exit)", "Statistical adjustment", 
    "Population on 31 December", "Population change"
  ),
  Description = c(
    "The year of the demographic data.",
    "The categorization of the citizenship status.",
    "Gender category.",
    "Age category.",
    "Population count at the beginning of the year.",
    "Number of births in the year.",
    "Number of deaths in the year.",
    "Change in the population due to births and deaths.",
    "Number of immigrants, including change of population type.",
    "Number of emigrations.",
    "Net migration count, including change of population type.",
    "Change in the categorization of the population.",
    "Number of individuals who acquired Swiss citizenship.",
    "Number of gender changes registered (entry).",
    "Number of gender changes registered (exit).",
    "Adjustments made to the data for accuracy.",
    "Population count at the end of the year.",
    "Change in population over the year."
  )
)

# Display the table using kableExtra
variable_table %>%
  kbl() %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))

Variable	Description
Year	The year of the demographic data.
Citizenship (category)	The categorization of the citizenship status.
Sex	Gender category.
Age	Age category.
Population on 1 January	Population count at the beginning of the year.
Live birth	Number of births in the year.
Death	Number of deaths in the year.
Natural change	Change in the population due to births and deaths.
Immigration incl. change of population type	Number of immigrants, including change of population type.
Emigration	Number of emigrations.
Net migration incl. change of population type	Net migration count, including change of population type.
Change of population type	Change in the categorization of the population.
Acquisition of Swiss citizenship	Number of individuals who acquired Swiss citizenship.
Gender change in the civil register (entry)	Number of gender changes registered (entry).
Gender change in the civil register (exit)	Number of gender changes registered (exit).
Statistical adjustment	Adjustments made to the data for accuracy.
Population on 31 December	Population count at the end of the year.
Population change	Change in population over the year.

Source - bfs.admin.ch

2.1.4 Google trend data

This Data set shows the evolution of Google trends on 8 different terms. We chose these terms in all 3 different main Swiss national languages (French, Italian and German) to make sure we get the most accurate data possible. Those variables will help us highlight potential tendencies/trends.

Code

# Create a tibble with variable descriptions for Google Trends data
variable_table_google_trend <- tibble(
  Variable = c("Month", 
                "Search Term: Voiture électrique", 
                "Search Term: Véhicule électrique", 
                "Search Term: EV", 
                "Search Term: Electric Car", 
                "Search Term: Elektromobil", 
                "Search Term: Elektrofahrzeug", 
                "Search Term: Elektro Auto", 
                "Search Term: Auto Elettrica"),
  Description = c(
    "The month of the data",
    "Google Trends data for 'Voiture électrique: (Suisse)'",
    "Google Trends data for 'Véhicule électrique: (Suisse)'",
    "Google Trends data for 'EV: (Suisse)'",
    "Google Trends data for 'Electric car: (Suisse)'",
    "Google Trends data for 'Elektromobil: (Suisse)'",
    "Google Trends data for 'Elektrofahrzeug: (Suisse)'",
    "Google Trends data for 'Elektro Auto: (Suisse)'",
    "Google Trends data for 'Auto Elettrica: (Suisse)'"
  )
)

# Display the table using kableExtra
variable_table_google_trend %>%
  kbl() %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))

Variable	Description
Month	The month of the data
Search Term: Voiture électrique	Google Trends data for 'Voiture électrique: (Suisse)'
Search Term: Véhicule électrique	Google Trends data for 'Véhicule électrique: (Suisse)'
Search Term: EV	Google Trends data for 'EV: (Suisse)'
Search Term: Electric Car	Google Trends data for 'Electric car: (Suisse)'
Search Term: Elektromobil	Google Trends data for 'Elektromobil: (Suisse)'
Search Term: Elektrofahrzeug	Google Trends data for 'Elektrofahrzeug: (Suisse)'
Search Term: Elektro Auto	Google Trends data for 'Elektro Auto: (Suisse)'
Search Term: Auto Elettrica	Google Trends data for 'Auto Elettrica: (Suisse)'

Sources for Google Trends Data: - Voiture électrique Trends - Véhicule électrique Trends - EV Trends - Electric Car Trends - Elektromobil Trends - Elektrofahrzeug Trends - Elektro Auto Trends - Auto Elettrica Trends

2.1.5 French vehicle registration data

We decided to add this data set to help us compare with another country. Indeed, France is an adjacent country which might imply potential similarities in the outcome. This data set is made of variables such as the Date of the French vehicle registration or whether it is from a professional user or not. The information we are interested in concern the number of registered vehicles in the country by propulsion method over time.

Code

# Create a tibble with variable descriptions for vehicle categorization data
variable_table <- tibble(
  Variable = c("Date", "Ensemble des voitures particulieres", "Professionnel", "Particulier", "Crit'Air"),
  Description = c(
    "The date, in year format, in which the observations are registered",
    "The sum of vehicle in every groups",
    "Category 'Professional' of vehicle",
    "Category 'Passenger Car' of vehicle",
    "Categorizes vehicles by emissions with from best to worst classes"
  )
)

# Display the table using kableExtra
variable_table %>%
  kbl() %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))

Variable	Description
Date	The date, in year format, in which the observations are registered
Ensemble des voitures particulieres	The sum of vehicle in every groups
Professionnel	Category 'Professional' of vehicle
Particulier	Category 'Passenger Car' of vehicle
Crit'Air	Categorizes vehicles by emissions with from best to worst classes

Source - Auto Elettrica Trends

2.1.6 Charging Stations

This data set is meant to determine the number of charging stations per canton, per charging power, per month, per plug type in Switzerland since November 2020. It is important to mention that a station can have multiple charging points. These values will help us to obtain more information about the Swiss adaption to EVs and their potential effects on EVs adoption by the Swiss population

Code

# Create a tibble with variable descriptions 
variable_table_charging <- tibble(
  Variable = c("year",
                "month",
                "stations_CH_count",
                "stations_XY_count",
                "locations_CH_count",
                "locations_XY_count",
                "plugs_CH_count",
                "plugs_XY_count",
                "chargingPower_CH_sum",
                "chargingPower_XY_sum",
                "chargingPower_CH_count",
                "chargingPower_XY_count",
                "chargingPower_10kW_count",
                "chargingPower_21kW_count",
                "chargingPower_42kW_count"),
  Description = c(
    "Year of the key figures collection.",
    "Month of the key figures collection. These are collected daily and published monthly.",
    "Number of available charging stations in Switzerland. A station can have multiple charging points.",
    "Number of charging stations per canton. A station can have multiple charging points. This attribute is available for all cantons, XY representing the official canton abbreviation.",
    "Number of stations in Switzerland. A station can have multiple charging points.",
    "Number of stations per canton. A station can have multiple charging points. This attribute is available for all cantons, XY representing the official canton abbreviation.",
    "Number of charging plugs in Switzerland.",
    "Number of charging plugs per plug type in Switzerland. This attribute is available for all plug types, XY corresponding to the plug type according to the Open Intercharge Protocol (OICP) version 2.2 (see below).",
    "Total maximum charging power in kilowatts of recharge stations in Switzerland. Only charging points with known power are considered (see chargingPower_CH_count).",
    "Total maximum charging power in kilowatts of recharge stations in a canton. Only charging points with known power are considered (see chargingPower_XY_count).",
    "Number of recharge stations in Switzerland with known maximum charging power and for which the total maximum charging power has been taken into account (chargingPower_CH_sum).",
    "Number of recharge stations per canton with known maximum charging power and for which the total maximum charging power has been taken into account (chargingPower_XY_sum).",
    "Number of recharge stations in Switzerland with known maximum charging power delivering a maximum charging power of 10 kW.",
    "Number of recharge stations in Switzerland with known maximum charging power ranging from over 10 kW to 21 kW.",
    "Number of recharge stations in Switzerland with known maximum charging power ranging from over 21 kW to 42 kW."
  )
)

# Display the table using kableExtra
variable_table_charging %>%
  kbl() %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))

Variable	Description
year	Year of the key figures collection.
month	Month of the key figures collection. These are collected daily and published monthly.
stations_CH_count	Number of available charging stations in Switzerland. A station can have multiple charging points.
stations_XY_count	Number of charging stations per canton. A station can have multiple charging points. This attribute is available for all cantons, XY representing the official canton abbreviation.
locations_CH_count	Number of stations in Switzerland. A station can have multiple charging points.
locations_XY_count	Number of stations per canton. A station can have multiple charging points. This attribute is available for all cantons, XY representing the official canton abbreviation.
plugs_CH_count	Number of charging plugs in Switzerland.
plugs_XY_count	Number of charging plugs per plug type in Switzerland. This attribute is available for all plug types, XY corresponding to the plug type according to the Open Intercharge Protocol (OICP) version 2.2 (see below).
chargingPower_CH_sum	Total maximum charging power in kilowatts of recharge stations in Switzerland. Only charging points with known power are considered (see chargingPower_CH_count).
chargingPower_XY_sum	Total maximum charging power in kilowatts of recharge stations in a canton. Only charging points with known power are considered (see chargingPower_XY_count).
chargingPower_CH_count	Number of recharge stations in Switzerland with known maximum charging power and for which the total maximum charging power has been taken into account (chargingPower_CH_sum).
chargingPower_XY_count	Number of recharge stations per canton with known maximum charging power and for which the total maximum charging power has been taken into account (chargingPower_XY_sum).
chargingPower_10kW_count	Number of recharge stations in Switzerland with known maximum charging power delivering a maximum charging power of 10 kW.
chargingPower_21kW_count	Number of recharge stations in Switzerland with known maximum charging power ranging from over 10 kW to 21 kW.
chargingPower_42kW_count	Number of recharge stations in Switzerland with known maximum charging power ranging from over 21 kW to 42 kW.

Source - IEA.org

2.1.7 Complementary charging station

The below data set is supposed to be used as a complement of the charging station data set presented in 2.1.6. One value represents the total number of available charging point for a given region, year and powertrain. Moreover, this file is composed of both Swiss and French datas. Indeed, those elements will help us answering our Research Question treating about the comparison between French and Swiss EV adoption as well as giving us some more information concerning EV adoption by the Swiss population.

Code

# Create a tibble with variable descriptions 
variable_charging_station_v2 <- tibble(
  Variable = c("Region", "Category", "Parameter", "Mode","Powertrain", "Unit","Value"),
  Description = c(
    "Either Switzerland or France",
    "One category: Historical",
    "One parameter: EV charging point",
    "One mode: EV",
    "Either Publicly available fast, or Publicly availabe slow",
    "One unit: charging point",
    "The total number of available charging point for a given region, year and powertrain"
  )
)

# Display the table using kableExtra
variable_charging_station_v2 %>%
  kbl() %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover"))

Variable	Description
Region	Either Switzerland or France
Category	One category: Historical
Parameter	One parameter: EV charging point
Mode	One mode: EV
Powertrain	Either Publicly available fast, or Publicly availabe slow
Unit	One unit: charging point
Value	The total number of available charging point for a given region, year and powertrain

Source - bfs.admin.ch

2.1.8 Political strength per canton

This dataset is composed of the name and political strength of the major Swiss political parties per Canton, in a given year from 1971 to 2023.

While this data set would seem to be irrelevant to some, we decided to include it to our paper in order to add the political influence dimension to our analysis. We believe that political affiliation might have a link with the EV adoption in Switzerland.

Code

# Create a tibble with variable descriptions 
variable_table_politic <- tibble(
  Variable = c("Years 1971-2023", "Canton", "Political parties", "Taux de participation"),
  Description = c(
    "The years were elections to the Conseil National took place from 1971 to 2023",
    "The name of the canton in Switzerland",
    "The name of all political parties in Switzerland and their strength in a given Canton, (Canton's total = 100%), in a given   year",
    "The participation rate to the elections of the Conseil National in a given year"
  )
)

# Display the table using kableExtra
variable_table_politic %>%
  kbl() %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover"))

Variable	Description
Years 1971-2023	The years were elections to the Conseil National took place from 1971 to 2023
Canton	The name of the canton in Switzerland
Political parties	The name of all political parties in Switzerland and their strength in a given Canton, (Canton's total = 100%), in a given year
Taux de participation	The participation rate to the elections of the Conseil National in a given year

Source - bfs.admin.ch

2.1.9 Swiss Population

The below data file contains the total population of each canton given by age groups, sex, citizenship, marital status and typology. We mainly focused on the variable Age Group to catch the potential influence of being part of one of those groups to the Swiss EV adoption.

Code

# Create a tibble with variable descriptions for the Swiss population dataset
variable_table_swiss_pop <- tibble(
  Variable = c("Canton", "Total Population", "Age Groups", "Sex", "Citizenship", "Marital Status", "Typology: Area with urban character"),
  Description = c(
    "The name of the canton in Switzerland.",
    "Total number of permanent residents in the canton.",
    "Population broken down into age groups: 0–19, 20–64, 65 and over.",
    "Population broken down by sex: Male and Female.",
    "Population broken down by citizenship: Swiss and Foreigner.",
    "Population broken down by marital status: Single, Married, Widowed, Divorced, Unmarried, In a registered partnership, Partnership dissolved.",
    "Typology of the area based on urban character: Urban core, Area influenced by urban cores, Area beyond urban influence."
  )
)

# Display the table using kableExtra
variable_table_swiss_pop %>%
  kbl() %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover"))

Variable	Description
Canton	The name of the canton in Switzerland.
Total Population	Total number of permanent residents in the canton.
Age Groups	Population broken down into age groups: 0–19, 20–64, 65 and over.
Sex	Population broken down by sex: Male and Female.
Citizenship	Population broken down by citizenship: Swiss and Foreigner.
Marital Status	Population broken down by marital status: Single, Married, Widowed, Divorced, Unmarried, In a registered partnership, Partnership dissolved.
Typology: Area with urban character	Typology of the area based on urban character: Urban core, Area influenced by urban cores, Area beyond urban influence.

Source - bfe.admin.ch

2.2 Data Wrangling

2.2.1 Data Wrangling : Oil clean

Initially, while attempting to format the dataset, we discovered the generation of numerous NAs. Upon delving deeper into the data, we identified two distinct date formats: the first being [DD-MMM-YYYY] and the second [MMM DD, YYYY]. Consequently, the initial date formatting was ineffective. Therefore, we undertook the task of matching the two tables with the same date format.

Another issue we encountered was to present the date in a standardized form across all our dataset to perform time series analyses. For that, we had to change the abbreviated months names (i.e. Jan, Feb, etc.) into numbers.

Code

# Format 1: DD-MMM-YYYY (e.g., 15-Apr-2020)
oil_df_1 <- oil_prices_data[1:8360,] %>%
  separate(Date, into = c("Day", "Month", "Year"), sep = "-") %>%
  mutate(Date = dmy(paste(Day, Month, Year)))

# Format 2: MMM DD, YYYY (e.g., Apr 22, 2020)
# Handling the separator with space and comma
oil_df_2 <- oil_prices_data[8361:nrow(oil_prices_data),] %>%
  separate(Date, into = c("Month", "Day", "Year"), sep = " ", extra = "merge") %>%
  mutate(Day = word(Day, 1),  # Extracts just the day part
         Date = mdy(paste(Month, Day, Year)))

# Merge and filter for dates after 2005
df_oil <- rbind(oil_df_1, oil_df_2) %>%
  filter(year(Date) >= 2005) %>%
  select(Date, Price)

# Display cleaned data
reactable(
  df_oil,  # Use your oil prices dataframe
  columns = list(
    Date = colDef(
      align = "center",  # Center align the date column
      minWidth = 150
    ),
    Price = colDef(
      align = "right",  # Right align the price column
      style = function(value) {
        # Apply conditional formatting based on the price value
        if (value > 120) {
          color <- "green"  # Green for high prices
        } else if (value < 30) {
          color <- "red"    # Red for low prices
        } else {
          color <- "black"  # Default color for other values
        }
        list(color = color)
      }
    )
  ),
  highlight = TRUE,  # Highlight rows on hover
  defaultPageSize = 10,  # Display 10 rows per page
  paginationType = "numbers",  # Use numbers for page navigation
  searchable = TRUE,  # Make the table searchable
  sortable = TRUE,  # Allow sorting
  resizable = TRUE  # Allow column resizing
)

2.2.2 Data Wrangling : Google trend clean

Comprehending Google Trends information involves understanding the contextualization of the provided figures. The given data is standardized which limits the range of information we can obtain from it.

The crucial element for effective narration lies in utilizing the normalized Trends data. This normalization indicates that when observing search interest trends over time, it is interpreted as a ratio relative to all searches conducted on all topics on Google during that specific period and location. Likewise, when assessing regional search interest for a specific topic, it is construed as the search interest for that particular topic in a given region relative to all searches on all topics on Google in that same place and time.

Source

To clean and process eight Google Trends datasets related to different search terms, we created a function process_dataset() that reads each CSV file, removes date indices, excludes the first row, checks for missing values, renames columns, converts types, and standardizes the date format. Then, we iterated through each dataset, merged them based on the “Date” column, summed values for each search term per date, and normalized these values between 1 and 100.

Code

dataset_names <- c(
  "googletrends_auto-elettrica_2005-2022.csv",
  "googletrends_elektro-auto_2005-2022.csv",
  "googletrends_elektrofahrzeug_2005-2022.csv",
  "googletrends_elektromobil_2005-2022.csv",
  "googletrends_eletric-car_2005-2022.csv",
  "googletrends_EV_2005-2022.csv",
  "googletrends_vehicule-electrique_2005-2022.csv",
  "googletrends_voiture-electrique_2005-2022.csv"
)

# Function to process each dataset
process_dataset <- function(file_path) {
  # Read the dataset
  google_trends_data <- read.csv(file_path)

  # Remove the dates from the index
  google_trends_data <- google_trends_data |> rownames_to_column(var = "Date")

  # Remove the first row using slice
  google_trends_data <- slice(google_trends_data, -1)

  # Check for NA
  cat("NA count for", file_path, ": ", sum(is.na(google_trends_data$Date)), "\n")
  cat("NA count for SearchCount in", file_path, ": ", sum(is.na(google_trends_data$Catégorie...Toutes.catégories)), "\n")

  # Rename col
  colnames(google_trends_data)[2] <- "SearchCount"

  # Convert to numeric
  google_trends_data$SearchCount <- as.numeric(google_trends_data$SearchCount)

  # Convert the column to date with the desired format
  google_trends_data$Date <- as.Date(paste(google_trends_data$Date, "01", sep = "-"))

  return(google_trends_data)
}

# Process each dataset and store in a list
processed_datasets <- list()

for (dataset_name in dataset_names) {
  file_path_gt <- file.path("../data", dataset_name)
  processed_dataset <- process_dataset(file_path_gt)
  processed_datasets[[dataset_name]] <- processed_dataset
}
#> NA count for ../data/googletrends_auto-elettrica_2005-2022.csv :  0 
#> NA count for SearchCount in ../data/googletrends_auto-elettrica_2005-2022.csv :  0 
#> NA count for ../data/googletrends_elektro-auto_2005-2022.csv :  0 
#> NA count for SearchCount in ../data/googletrends_elektro-auto_2005-2022.csv :  0 
#> NA count for ../data/googletrends_elektrofahrzeug_2005-2022.csv :  0 
#> NA count for SearchCount in ../data/googletrends_elektrofahrzeug_2005-2022.csv :  0 
#> NA count for ../data/googletrends_elektromobil_2005-2022.csv :  0 
#> NA count for SearchCount in ../data/googletrends_elektromobil_2005-2022.csv :  0 
#> NA count for ../data/googletrends_eletric-car_2005-2022.csv :  0 
#> NA count for SearchCount in ../data/googletrends_eletric-car_2005-2022.csv :  0 
#> NA count for ../data/googletrends_EV_2005-2022.csv :  0 
#> NA count for SearchCount in ../data/googletrends_EV_2005-2022.csv :  0 
#> NA count for ../data/googletrends_vehicule-electrique_2005-2022.csv :  0 
#> NA count for SearchCount in ../data/googletrends_vehicule-electrique_2005-2022.csv :  0 
#> NA count for ../data/googletrends_voiture-electrique_2005-2022.csv :  0 
#> NA count for SearchCount in ../data/googletrends_voiture-electrique_2005-2022.csv :  0

# Merge datasets based on the "Date" column
merged_data <- reduce(processed_datasets, left_join, by = "Date")

# Sum the values for each search term for a given date
df_gtrends <- merged_data %>%
  rowwise() %>%
  mutate(Sum_SearchCount = sum(c_across(starts_with("SearchCount"))))

# Normalize the Sum_SearchCount values between 1 and 100
df_gtrends$SearchRatio <- df_gtrends$Sum_SearchCount / 8

# Print the result
df_gtrends <- df_gtrends[,-(2:10)]

#display cleaned data
reactable(
  df_gtrends,
  columns = list(
    Date = colDef(
      align = "center",  # Center align the date column
      minWidth = 150
    ),
    SearchRatio = colDef(
      align = "right",  # Right align the SearchRatio column
      style = function(value) {
        if (value > 45) {
          color <- "green"  # Green for values greater than 50
        } else if (value < 5) {
          color <- "red"    # Red for values less than 5
        } else {
          color <- "black"  # Default color for other values
        }
        list(color = color)
      }
    )
  ),
  highlight = TRUE,  # Highlight rows on hover
  defaultPageSize = 10,  # Display 10 rows per page
  paginationType = "numbers",  # Use numbers for page navigation
  searchable = TRUE,  # Make the table searchable
  sortable = TRUE,  # Allow sorting
  resizable = TRUE  # Allow column resizing
)

2.2.3 Cleaning of demographic_data

The main issue with this data set was to transform it in a way that makes it useful for our analysis. We first had to remove all rows containing category summaries in the middle of the data.

We then formulated the following hypothesizes:

Male and Female present the same attitude towards electric vehicles
Swiss and Foreign citizen living in Switzerland present the same attitude towards electric vehicles
People who are not allowed to drive do not have any impact on electric vehicles use

We then decided to do the following cleaning:

Only kept the “Total” for the Male-Female category
Only kept the “Total” for the Swiss-Foreign category
Only kept people aged between 18 and 99. (People aged over 99 account for 0.02% of the population)

Then, instead of keeping a raw data set with all ages between 18 and 99, we decided to group them in widely accepted age segments, namely Generation Z, Millennials, Generation X and Baby boomers. This categorization will help us to analyse whether different generations have a different approach to their mobility or not.

Code

# Clean and process the demographic data
df_demographic <- demographic_data %>%
  filter(Year >= 2005, Year <= 2022) %>%
  filter(str_detect(Sex, "total"), str_detect(`Citizenship (category)`, "total")) %>%
  mutate(Age_clean = gsub(" years", "", Age)) %>%
  filter(!str_detect(Age_clean, "total")) %>%
  mutate(Age_num = as.numeric(Age_clean)) %>%
  filter(!is.na(Age_num), Age_num >= 18, Age_num <= 98) %>%
  select(Year, Age = Age_num, `Population on 1 January`) %>%
  mutate(
    Generation = cut(Age, breaks = c(17, 26, 42, 58, Inf), labels = c("Generation Z", "Millennials", "Generation X", "Baby Boomers"), include.lowest = TRUE),
    Year = ymd(paste(Year, "01", "01", sep = "-")) # Converting Year to Date format
  ) %>%
  group_by(Year, Generation) %>%
  summarise(Population = sum(`Population on 1 January`, na.rm = TRUE)) %>%
  pivot_wider(names_from = Generation, values_from = Population)
#> Warning: There was 1 warning in `mutate()`.
#> i In argument: `Age_num = as.numeric(Age_clean)`.
#> Caused by warning:
#> ! NAs introduced by coercion

reactable(
  df_demographic, 
  columns = list(
    Year = colDef(
      align = "center",  # Center align the year column
      minWidth = 100
    ),
    `Generation Z` = colDef(
      align = "right",  # Right align the population column
      style = function(value) {
        color <- ifelse(value > 860000, "green", "black")  # Green for population over 1 million
        list(color = color)
      }
    ),
    Millennials = colDef(
      align = "right",
      style = function(value) {
        color <- ifelse(value > 1900000, "green", "black")
        list(color = color)
      }
    ),
    `Generation X` = colDef(
      align = "right",
      style = function(value) {
        color <- ifelse(value > 2000000, "green", "black")
        list(color = color)
      }
    ),
    `Baby Boomers` = colDef(
      align = "right",
      style = function(value) {
        color <- ifelse(value > 2300000, "green", "black")
        list(color = color)
      }
    )
  ),
  highlight = TRUE,  # Highlight rows on hover
  defaultPageSize = 10,  # Display 10 rows per page
  paginationType = "numbers",  # Use numbers for page navigation
  searchable = TRUE,  # Make the table searchable
  sortable = TRUE,  # Allow sorting
  resizable = TRUE  # Allow column resizing
)

Code

str(df_demographic)
#> gropd_df [18 x 5] (S3: grouped_df/tbl_df/tbl/data.frame)
#>  $ Year        : Date[1:18], format: "2005-01-01" ...
#>  $ Generation Z: int [1:18] 788674 797102 803780 819198 835811 8482..
#>  $ Millennials : int [1:18] 1783064 1762564 1743133 1743763 1754815..
#>  $ Generation X: int [1:18] 1693753 1722516 1754075 1790725 1832671..
#>  $ Baby Boomers: int [1:18] 1764390 1800973 1839602 1876304 1914329..
#>  - attr(*, "groups")= tibble [18 x 2] (S3: tbl_df/tbl/data.frame)
#>   ..$ Year : Date[1:18], format: "2005-01-01" ...
#>   ..$ .rows: list<int> [1:18] 
#>   .. ..$ : int 1
#>   .. ..$ : int 2
#>   .. ..$ : int 3
#>   .. ..$ : int 4
#>   .. ..$ : int 5
#>   .. ..$ : int 6
#>   .. ..$ : int 7
#>   .. ..$ : int 8
#>   .. ..$ : int 9
#>   .. ..$ : int 10
#>   .. ..$ : int 11
#>   .. ..$ : int 12
#>   .. ..$ : int 13
#>   .. ..$ : int 14
#>   .. ..$ : int 15
#>   .. ..$ : int 16
#>   .. ..$ : int 17
#>   .. ..$ : int 18
#>   .. ..@ ptype: int(0) 
#>   ..- attr(*, ".drop")= logi TRUE

2.2.4 Data Wrangling : Swiss vehicle clean

Code

# Define a function to process vehicle data, excluding the first row as it's not actual data
process_vehicle_data <- function(vehicle_data, start_year, end_year) {
  # the first row is a header or metadata and should be excluded
  vehicle_data <- vehicle_data[-1, ]
  
  col_names <- c("Canton", "VehicleGroupType", "Fuel", "Month", paste0("X", start_year:end_year))
  names(vehicle_data) <- col_names
  
  vehicle_data %>%
    filter(!str_detect(trimws(VehicleGroupType), "^>")) %>%
    rename(Location = Canton, VehicleType = VehicleGroupType) %>%
    mutate(VehicleType = str_remove(VehicleType, "^\\.\\.\\.\\s*")) %>%
    pivot_longer(cols = all_of(paste0("X", start_year:end_year)), names_to = "Year", values_to = "Count") %>%
    mutate(Year = as.numeric(str_remove(Year, "X")),
           MonthNum = match(Month, month.name),
           Date = as.Date(paste(Year, ifelse(is.na(MonthNum), 1, MonthNum), "01", sep = "-"), format = "%Y-%m-%d")) %>%
    select(-Month, -Year, -MonthNum)
}

# Process 2005 to 2008 and 2009 to 2022 data
v_2005_2008 <- process_vehicle_data(vehicle_data_2005_2008, 2005, 2008)
v_2009_2022 <- process_vehicle_data(vehicle_data_2009_2022, 2009, 2022)

# Merge and arrange data
df_v <- bind_rows(v_2005_2008, v_2009_2022) %>%
  arrange(Date)

# Canton and Fuel Type Standardization
standard_names <- c(
  "Switzerland" = "Switzerland", "Zürich" = "ZH", "Bern" = "BE", "Luzern" = "LU",
  "Uri" = "UR", "Schwyz" = "SZ", "Obwalden" = "OW", "Nidwalden" = "NW",
  "Glarus" = "GL", "Zug" = "ZG", "Fribourg" = "FR", "Solothurn" = "SO",
  "Basel-Stadt" = "BS", "Basel-Landschaft" = "BL", "Schaffhausen" = "SH",
  "Appenzell Ausserrhoden" = "AR", "Appenzell Innerrhoden" = "AI", 
  "St. Gallen" = "SG", "Graubünden" = "GR", "Aargau" = "AG", 
  "Thurgau" = "TG", "Ticino" = "TI", "Vaud" = "VD", "Valais" = "VS",
  "Neuchâtel" = "NE", "Genève" = "GE", "Jura" = "JU", "Confederation" = "Confederation"
)

df_v <- df_v %>%
  mutate(Location = iconv(Location, from = "latin1", to = "UTF-8")) %>%
  mutate(Location = map_chr(str_split(Location, " / "), ~ .x[1])) %>%
  mutate(Location = standard_names[Location],
         Fuel = case_when(
           Fuel %in% c("Petrol-electricity: conventional hybrid", "Diesel-electricity: conventional hybrid") ~ "Conventional hybrid",
           Fuel %in% c("Petrol-electricity: plug-in hybrid", "Diesel-electricity: plug-in hybrid") ~ "Plug-in hybrid",
           Fuel == "Gas (monovalent and bivalent)" ~ "Gas",
           TRUE ~ Fuel))

# Count number of vehicles for a particular year
#vehicle_count_2022 <- df_v %>%
#  filter(Location == "Switzerland", year(Date) == 2023, VehicleType == "Passenger car") %>%
#  summarize(TotalCount = sum(Count))

Redundancy

The dataset, representing new vehicle registrations in Switzerland from 2005 onwards, exhibited redundancy in its structure. For example, the data contained both main categories (like ‘> Passenger cars’) and their respective sub-categories (‘… Passenger car’ and ‘.. Heavy passenger car’). On examination, it was observed that the counts under the main categories were simply aggregates of the counts of their sub-categories. Such redundancy could lead to double counting in analytical processes.

We addressed this issue by removing main categories that were aggregates of their sub-categories, preventing potential double counting in analyses. For two subsets of data (2005 to 2008 and 2009 to 2022), We filtered out main categories, cleaned and standardized the column names, reshaped the data set from wide to long format to facilitate analysis, and extracted and formatted the year information for each observation. This process allowed for a more granular and consistent data set for further analysis and interpretation.

Merging and Joining

As the vehicle data was split across multiple files, it was necessary to combine them for a holistic view.

We used bind_rowsto create df_v, then sorted the dataset by year to ensure a structured chronological view of the data.

Rows with 0

Retaining rows with a Count of 0 ensures the completeness of our dataset, indicating periods when specific vehicle types or fuel categories had no registrations. This not only provides a holistic view of vehicle adoption trends over time but also aids in generating continuous time series visualizations without gaps, offering a true representation of the data.

Code

# Display cleaned data
reactable(head(df_v, 500), sortable = TRUE, searchable = TRUE)

2.2.5 Data wrangling French vehicle clean

This dataset will be used to compare our data in Switzerland with a bordering country (FR)

It is important to note that this data set contains all registered vehicles in a given year (vs. all new registered cars for our Swiss data) We will only keep the vehicles used for personal use (not the ones for professional use), to match our Swiss dataset

We decided to remove some anecdotal observations (i.e Hydrogen, Unknown, etc.) as their production / use is marginal. Moreover, we have decided to regroup the different kind of hybrid motors into two Hybrid categories. The 2 categories are the plugable hybrid vehicle HR (petrol or diesel) and the non-plugable hybrid vehicles HNR (petrol or diesel). This was done to match our Swiss data set.

The main issue with this data set was to set it in a way where it is comparable to our Swiss cars data set. Indeed, this dataset presents the total of all cars in France for a given year and type, while our Swiss data set accounts for new registration of cars for a given year and type.

To be able to compare two data sets, we decided to format the French data set in the following way:

Remove categories that are not relevant for our analysis, and widely under-represented in the population (i.e Gaz, hydrogen)
Focus on the private use of cars (vs. professional), because the decision to use a certain type of cars for professional reasons are often explained by factors hard to capture in our analysis (i.e bundle deals, greenwashing). We therefore hypothesized that private use of a certain type of car better gives insights on population view.
The match our swiss data set (new registration per year), we computed the difference of total cars registered in France per years. Having only the data between 2011 and 2022, we have “lost” the insights for 2011. Indeed, when creating the deltas (difference) columns, 2011 deltas could not be computed (no data for 2010), we therefore decided to delete that year from our data

Code

# Only taking the private cars from this data and transforming the data
df_v_fr <- france_v[25:37, ] %>%
  t() %>%
  as_tibble(.name_repair = "minimal") %>%
  setNames(.[1, ]) %>%
  slice(-1) %>%
  mutate(Year = 2011:2022) %>%
  select(-c("Particulier", "Gaz", "Gaz HNR", "Gaz HR", "Hydrogène et autre ZE", "Inconnu")) %>%
  mutate(across(-Year, ~ floor(as.numeric(.)))) %>%
  mutate(
    Conventional_Hybrid = as.numeric(`Diesel HNR`) + as.numeric(`Essence HNR`),
    Plug_in_Hybrid = as.numeric(`Diesel HR`) + as.numeric(`Essence HR`),
    across(c(Diesel, Essence, Conventional_Hybrid, Plug_in_Hybrid, Electrique), as.numeric),
    Diesel_delta = Diesel - lag(Diesel),
    Essence_delta = Essence - lag(Essence),
    Conventional_Hybrid_delta = Conventional_Hybrid - lag(Conventional_Hybrid),
    Plug_in_Hybrid_delta = Plug_in_Hybrid - lag(Plug_in_Hybrid),
    Electrique_delta = Electrique - lag(Electrique)
  ) %>%
  filter(!is.na(Diesel_delta)) %>%
  select(Date = Year, Diesel, Diesel_delta, Essence, Essence_delta, Conventional_Hybrid, Conventional_Hybrid_delta, Plug_in_Hybrid, Plug_in_Hybrid_delta, Electrique, Electrique_delta) %>%
  mutate(Date = as.Date(paste(Date, "-01-01", sep = ""), format = "%Y-%m-%d"))

# Display cleaned data
reactable(head(df_v_fr, 100), sortable = TRUE, searchable = TRUE)

2.2.6 Availability of charging stations

This data set will be used to explore and analyze the effects of charging station availability on EVs adoption.

However, two important points are to note.

We were unable to find data before November 2020
Charging station availability and EV in market are likely strongly reciprocally correlated.

Moreover, it is important to keep standardized dates and locations

We merged ‘year’ and ‘month’ columns to create a ‘Date’ column in the format YYYY-MM-01, subsequently reorganizing the dataframe to position Date as the first column and removing the original year and month columns.

Code

# Combine 'year' and 'month' columns to create a new 'Date' column
charging_station$Date <- as.Date(paste(charging_station$year, charging_station$month, "01", sep = "-"), format = "%Y-%m-%d")

# Rearrange columns with 'Date' as the first column and drop 'year' and 'month'
df_charging_station <- charging_station %>%
  select(Date, everything()) %>%
  select(-year, -month)

# Display cleaned data
reactable(df_charging_station,
          sortable = TRUE, 
          searchable = TRUE)

2.2.7 Availability of charging station in France and Switzerland

This data set is used to complement the prior charging station data set.

This data covers period running from 2012 to 2023. The data concern both Switzerland and France.

Code

# Only selecting necessary columns
charge_ch_fr <- charge_ch_fr %>%
  select(c("year", "region", "powertrain", "value"))

# Setting year as.Date
charge_ch_fr$year <- as.Date(
  paste(charge_ch_fr$year,"-01-01", sep = ""), format = "%Y-%m-%d")

df_charge_number_CH <- charge_ch_fr %>%
  filter(region == "Switzerland")

df_charge_number_FR <- charge_ch_fr %>%
  filter(region == "France")

# Display cleaned data
reactable(charge_ch_fr,
          sortable = TRUE, 
          searchable = TRUE)

2.2.8 Political strength per canton

This data set will be used to explore and analyze the effects of political convictions on EVs adoption (per Canton).

The two main difficulties with this data set were the following:

Create a process (loop) general enough to clean all the Cantons at the same time, but specific enough for it to work.
The Swiss political parties’ names have evolved quite a bit over the years.

To use this data effectively in our analysis. We will group them in 5 categories.

The 5 categories respond to the following question: “What is the political party’s stance on sustainability / ecological measures?”

Against
Slightly Against
Neutral
Slightly in Favour
In Favour

We assigned each political parties in one of those categories based on their program.

Code

# We import each different sheets (one per canton) into a data set
political_data_sheets <- lapply(setdiff(excel_sheets("../data/political_data.xlsx"), "Contenu"), 
                                function(sheet) {
                                  p_data <- suppressMessages(read_excel("../data/political_data.xlsx", sheet = sheet))
                                  return(p_data)
                                })

sheet_names <- setdiff(political_data_sheets_prep, "Contenu")

named_data_list <- setNames(political_data_sheets, sheet_names)

# Now to the cleaning part
for (i in seq_along(named_data_list)) {
  # setting one dataset we work on
  current_dataset <- named_data_list[[i]]
  
  # where is "taux de participation"
  index_to_keep <- which(current_dataset[[1]] == "Taux de participation")[1]
  
  # keep only rows until "taux de participation" and delete the first one "force des partis"
  named_data_list[[i]] <- current_dataset[2:index_to_keep-2, ]
  
  # remove columns 2 and 3 (they are all NAs)
  named_data_list[[i]] <- named_data_list[[i]][ ,-c(2,3)]
  
  # transposing the data sets to have years in a single columns
  transposed_data <- t(named_data_list[[i]])
  named_data_list[[i]] <- as_tibble(transposed_data[-1, ])
  
  # set column names and change date as.Date
  colnames(named_data_list[[i]]) <- transposed_data[1, ]
  
  named_data_list[[i]][[2]] <- as.Date(
  paste(named_data_list[[i]][[2]], "-01-01", sep = ""), format = "%Y-%m-%d")
  
  # find the columns that are only NAs and remove them
  named_data_list[[i]] <- named_data_list[[i]][, colSums(!is.na(named_data_list[[i]])) > 0, drop = FALSE]
  
  # Removing all non-numeric values
  named_data_list[[i]][, -1] <- apply(named_data_list[[i]][, -1], 2, function(x) as.numeric(gsub("[^0-9.]", "", x)))
  
  # Setting the first column name as "Date"
  colnames(named_data_list[[i]])[1] <- "Date"
  
  # Removing the rows before 01.01.1999
  threshold_date <- as.Date("1999-01-01")
  named_data_list[[i]] <- named_data_list[[i]][named_data_list[[i]]$Date >= threshold_date, ]
}
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column
#> names if `.name_repair` is omitted as of tibble 2.0.0.
#> i Using compatibility `.name_repair`.
#> Warning in FUN(newX[, i], ...): NAs introduced by coercion

# Creating a map for the political parties stance on sustainability
sustainability_mapping <- c(
  "Against" = c("MCG (MCR)","PBD",  "PBD 1",    "PBD 2",    "UDC"),
  "Slightly Against" = c("Lega","PDC",  "PDC 1",    "PDC 2",    "PLR",  "PLS",  "UDF"),
  "Neutral" = c("Adl",  "DS",   "PdL",  "POCH", "PSL",  "PST",  "Rép.", "Sol.","Separ.", "Autres"),
  "Slightly in Favour" = c("Il Centro", "Il Centro 1",  "Le Centre 1","Le Centre",  "Le Centre 2",  "PCS","PPD",    "PPD 1"),
  "In Favour" = c("AVF",    "AVF 1",    "PEV", "PS",    "PSA",  "PVL","VERDI",  "VERDI 2","VERT-E-S  ", "VERT-E-S 2",   "VERT-E-S 3")
)

# Looping through the data sets the change the names of the parties by their stance
for (i in seq_along(named_data_list)){
  current_dataset <- named_data_list[[i]]
  
  for (party_name in names(current_dataset)[-1]) {
    stance <- sapply(sustainability_mapping, function(x) party_name %in% x)
    stance <- names(stance)[which(stance)]
    
    if (length(stance) > 0) {
      selected_columns <- intersect(c(party_name, stance), colnames(current_dataset))
      
      current_dataset[[stance]] <- rowSums(current_dataset[selected_columns], na.rm = TRUE)
      current_dataset[[party_name]] <- NULL
    }
  }
  named_data_list[[i]] <- current_dataset
}

# Creating my 5 columns
categories <- c("Against", "Slightly Against", "Neutral", "Slightly in Favour", "In Favour")

# Aggregating the information of the data sets into these 5 categories
for (i in seq_along(named_data_list)) {
  current_dataset <- named_data_list[[i]]
  
  # Extract the Date column
  result_dataset <- current_dataset[, "Date", drop = FALSE]
  
  # Loop through each category and aggregate values
  for (category in categories) {
    matching_columns <- grep(paste0("^", category, "\\d*$"), colnames(current_dataset), value = TRUE)
    result_dataset[[category]] <- rowSums(current_dataset[matching_columns], na.rm = TRUE)
  }
  
  named_data_list[[i]] <- result_dataset
}

# Storing the data sets into a list:
list_politic <- list()

for (i in seq_along(named_data_list)) {
  tibble_name <- paste0("politic_", sheet_names[i])
  list_politic[[tibble_name]] <- named_data_list[[i]]
}

# Correcting a NA in politic_AI
new_dates <- list_politic[["politic_VD"]]$Date
list_politic[["politic_AI"]]$Date <- new_dates

# Creating a data set per year for Switzerland
political_combined_data <- bind_rows(list_politic, .id = "Canton")
political_combined_data$Year <- as.integer(format(political_combined_data$Date, "%Y"))
political_combined_data <- political_combined_data[, -which(names(political_combined_data) == "Date")]

political_summarized_data <- political_combined_data %>%
  group_by(Year, Canton) %>%
  summarize(
    Against = sum(Against),
    `Slightly Against` = sum(`Slightly Against`),
    Neutral = sum(Neutral),
    `Slightly in Favour` = sum(`Slightly in Favour`),
    `In Favour` = sum(`In Favour`)
  )

yearly_political_datasets <- list()
unique_years <- unique(political_combined_data$Year)
for (year in unique_years) {
  year_political_dataset <- political_combined_data %>% filter(Year == year)
  yearly_political_datasets[[as.character(year)]] <- year_political_dataset
}

# now accessible via ' political_yearly_data$Year '
political_combined_data <- political_combined_data %>%
  mutate(
    Canton = sub("politic_", "", Canton),  # Remove 'politic_' prefix
    Year = ymd(paste(Year, "01", "01"))    # Convert Year to date type
  )

political_combined_data <- political_combined_data %>%
  mutate(
    KANTONSNUM = case_when(
      Canton == "ZH" ~ 1,
      Canton == "BE" ~ 2,
      Canton == "LU" ~ 3,
      Canton == "UR" ~ 4,
      Canton == "SZ" ~ 5,
      Canton == "OW" ~ 6,
      Canton == "NW" ~ 7,
      Canton == "GL" ~ 8,
      Canton == "ZG" ~ 9,
      Canton == "FR" ~ 10,
      Canton == "SO" ~ 11,
      Canton == "BS" ~ 12,
      Canton == "BL" ~ 13,
      Canton == "SH" ~ 14,
      Canton == "AR" ~ 15,
      Canton == "AI" ~ 16,
      Canton == "SG" ~ 17,
      Canton == "GR" ~ 18,
      Canton == "AG" ~ 19,
      Canton == "TG" ~ 20,
      Canton == "TI" ~ 21,
      Canton == "VD" ~ 22,
      Canton == "VS" ~ 23,
      Canton == "NE" ~ 24,
      Canton == "GE" ~ 25,
      Canton == "JU" ~ 26
    )
  )

# Display cleaned data
reactable(political_combined_data,
          sortable = TRUE, 
          searchable = TRUE)

Code

str(political_combined_data)
#> tibble [182 x 8] (S3: tbl_df/tbl/data.frame)
#>  $ Canton            : chr [1:182] "ZH" "ZH" "ZH" "ZH" ...
#>  $ Against           : num [1:182] 32.5 33.4 33.9 35.1 34.3 ...
#>  $ Slightly Against  : num [1:182] 24.9 23.7 22.8 18.8 21.6 ...
#>  $ Neutral           : num [1:182] 7.46 12.36 9.09 2.64 2.8 ...
#>  $ Slightly in Favour: num [1:182] 0.218 0 0.147 0.157 0 ...
#>  $ In Favour         : num [1:182] 35 39.7 42 43.2 41.3 ...
#>  $ Year              : Date[1:182], format: "1999-01-01" ...
#>  $ KANTONSNUM        : num [1:182] 1 1 1 1 1 1 1 2 2 2 ...

2.2.9 Swiss Population

We chose to use the population data from the year 2022 to ensure the most current and relevant demographic context, providing a contemporary snapshot that aligns closely with the latest trends in electric vehicle registrations.

The relevant data starts from row 5, canton names are in the first column, and the population figures for 2022 are in the second column.

Code

df_swisspop_2022 <- df_swisspop_2022 %>%
  slice(-1:-4) %>%  # Remove the first 4 rows
  select(Canton = 1, TotalPopulation = 2)  # Select only the canton names and population figures

# Remove rows with NAs in the Canton column
df_swisspop_2022 <- df_swisspop_2022 %>%
  filter(!is.na(Canton))

# Map Canton names to abbreviations
df_swisspop_2022 <- df_swisspop_2022 %>%
  mutate(CantonAbbreviation = case_when(
    Canton == "Zurich" ~ "ZH",
    Canton == "Bern" ~ "BE",
    Canton == "Lucerne" ~ "LU",
    Canton == "Uri" ~ "UR",
    Canton == "Schwyz" ~ "SZ",
    Canton == "Obwalden" ~ "OW",
    Canton == "Nidwalden" ~ "NW",
    Canton == "Glarus" ~ "GL",
    Canton == "Zug" ~ "ZG",
    Canton == "Fribourg" ~ "FR",
    Canton == "Solothurn" ~ "SO",
    Canton == "Basel-Stadt" ~ "BS",
    Canton == "Basel-Landschaft" ~ "BL",
    Canton == "Schaffhausen" ~ "SH",
    Canton == "Appenzell A. Rh." ~ "AR",
    Canton == "Appenzell I. Rh." ~ "AI",
    Canton == "St. Gallen" ~ "SG",
    Canton == "Graubünden" ~ "GR",
    Canton == "Aargau" ~ "AG",
    Canton == "Thurgau" ~ "TG",
    Canton == "Ticino" ~ "TI",
    Canton == "Vaud" ~ "VD",
    Canton == "Valais" ~ "VS",
    Canton == "Neuchâtel" ~ "NE",
    Canton == "Geneva" ~ "GE",
    Canton == "Jura" ~ "JU",
    TRUE ~ NA_character_  # For unrecognized cantons
  ))

# Map Canton names to KANTONSNUM
df_swisspop_2022 <- df_swisspop_2022 %>%
  mutate(KANTONSNUM = case_when(
    Canton == "Graubünden" ~ 18,
    Canton == "Bern" ~ 2,
    Canton == "Valais" ~ 23,
    Canton == "Vaud" ~ 22,
    Canton == "Ticino" ~ 21,
    Canton == "St. Gallen" ~ 17,
    Canton == "Zurich" ~ 1,
    Canton == "Fribourg" ~ 10,
    Canton == "Lucerne" ~ 3,
    Canton == "Aargau" ~ 19,
    Canton == "Uri" ~ 4,
    Canton == "Thurgau" ~ 20,
    Canton == "Schwyz" ~ 5,
    Canton == "Jura" ~ 26,
    Canton == "Neuchâtel" ~ 24,
    Canton == "Solothurn" ~ 11,
    Canton == "Glarus" ~ 8,
    Canton == "Basel-Landschaft" ~ 13,
    Canton == "Obwalden" ~ 6,
    Canton == "Nidwalden" ~ 7,
    Canton == "Geneva" ~ 25,
    Canton == "Schaffhausen" ~ 14,
    Canton == "Appenzell A. Rh." ~ 15,
    Canton == "Zug" ~ 9,
    Canton == "Appenzell I. Rh." ~ 16,
    Canton == "Basel-Stadt" ~ 12,
    TRUE ~ NA_integer_  # For unrecognized cantons
  ))


df_swisspop_2022$TotalPopulation <- as.integer(df_swisspop_2022$TotalPopulation)
# Filter out rows where CantonAbbreviation is NA
df_swisspop_2022 <- df_swisspop_2022 %>%
  filter(!is.na(CantonAbbreviation))

# Display cleaned data
reactable(df_swisspop_2022,
          sortable = TRUE, 
          searchable = TRUE)

3 Exploratory data analysis

3.1 Switzerland

3.1.1 seasonality

Code

#creating dataset for the three seasonality graphs
passenger_cars_processed <- df_v %>%
  filter(VehicleType == "Passenger car") %>%
  mutate(YearMonth = floor_date(Date, "month")) %>%
  group_by(YearMonth) %>%
  summarise(Count = sum(Count, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(Year = year(YearMonth), 
         Month = factor(month(YearMonth), levels = 1:12, labels = month.abb)) %>%
  arrange(Year, Month)

# Ensure the YearMonth is in Date format
passenger_cars_processed$YearMonth <- as.Date(passenger_cars_processed$YearMonth, format = "%Y-%m-%d")

# Calculate a smoothed series using a rolling mean
passenger_cars_processed$Smoothed <- rollmean(passenger_cars_processed$Count, k = 12, fill = NA)

# Create an xts object with both the original and smoothed counts
df_xts <- xts(passenger_cars_processed[, c("Count", "Smoothed")], order.by = passenger_cars_processed$YearMonth)

# Plot using dygraphs
p_seaso1 <- dygraph(df_xts, main = "Passenger Car Adoption Over Time in Switzerland", width = "600px", height = "400px") %>%
  dySeries("Count", label = "Number of Passenger Cars Registered", color = "#24918d") %>%
  dySeries("Smoothed", label = "Smoothed Trend", color = "#2f114a") %>%
  dyOptions(stackedGraph = FALSE) %>%
  dyRangeSelector(height = 20)

# Print the dygraph to display it
p_seaso1

The purple line represents a smoothed trend, indicating an initial increase in car registrations until around 2014-2015, followed by a gradual decline. The blue line shows the actual number of cars registered, with significant variability. We will call this variation seasonality which is better represented in the following graph.

Code

# Plotting the data with ggplot2, showing the trend within each year
p_seaso_2 <- ggplot(passenger_cars_processed, aes(x = Month, y = Count, group = Year, color = as.factor(Year))) +
  geom_smooth(se = FALSE, method = "loess", span = 0.5, size = 0.7) +
  labs(title = "Monthly Passenger Car Registrations by Year",
       x = "Month",
       y = "Number of Passenger Cars Registered",
       color = "Year") +
  theme_minimal() +
  scale_color_viridis_d() +
  theme(legend.position = "bottom", axis.text.x = element_text(angle = 45, hjust = 1))
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> i Please use `linewidth` instead.

# Convert to an interactive plotly object
interactive_plot_seaso_2 <- ggplotly(p_seaso_2, width = 600, height = 400)

# Adjust plotly settings 
interactive_plot_seaso_2 <- interactive_plot_seaso_2 %>%
  layout(margin = list(l = 40, r = 10, b = 40, t = 40), # Adjust margins
         legend = list(orientation = "h", x = 0, xanchor = "left", y = -0.2)) # Adjust legend position

# Display the interactive plot
interactive_plot_seaso_2

This pattern suggests a seasonal trend with a mid-year peak and a year-end increase. 2020 reacts differently than other years. We suggest that it is probably related to Covid policies.

Code

# Plotting the data with ggplot2, showing the trend within each year
p_seaso_3 <- ggplot(passenger_cars_processed, aes(x = Month, y = Count, group = Year, color = as.factor(Year))) +
  geom_line() +
  facet_wrap(~ Year, scales = "free_y") +  # Facet by year with free y scales
  labs(title = "Seasonal Trends in Passenger Car Registrations",
       x = "Month",
       y = "Number of Passenger Cars Registered") +
  theme_minimal() +
  scale_color_viridis_d(guide = FALSE) +  # Use viridis color scale and remove the guide/legend
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # Rotate x-axis labels by 45 degrees
  theme(axis.text.x = element_blank(), # This will remove the month labels
        axis.text.y = element_blank(), # This will remove the month labels
        axis.ticks.x = element_blank(), # This will remove the ticks on the x-axis
        legend.position = "none") # Remove the legend to clean up the plot

# Convert to an interactive plotly object
interactive_plot_seaso_3 <- ggplotly(p_seaso_3, width = 600, height = 400) %>%
  layout(xaxis = list(tickmode = "array",
                      tickvals = 1:12,
                      ticktext = month.abb))

# Display the interactive plot
interactive_plot_seaso_3

This version of the graphs helps us visualizing the evolution of registration for each year individually. As mentioned above, 2020 is the only year which show a decreasing peak instead of an increasing one as it is presented for all other years.

3.1.2 Vehicule Registration by Fuel time over time

Code

# Define color palette for each fuel type using Viridis
fuel_colors <- c(
  "Diesel" = viridis(5)[1], 
  "Electricity" = viridis(5)[2], 
  "Conventional hybrid" = viridis(5)[3], 
  "Plug-in hybrid" = viridis(5)[4], 
  "Petrol" = viridis(5)[5]
)

# Filter df_v for specific fuel types and vehicle type
filtered_df <- df_v %>%
  filter(Fuel %in% c("Petrol", "Diesel", "Conventional hybrid", "Plug-in hybrid", "Electricity") &
         VehicleType == "Passenger car")

# Group by Date and Fuel type, and summarize the count
fuel_type_trends <- filtered_df %>%
  group_by(Date, Fuel) %>%
  summarize(Count = sum(Count, na.rm = TRUE), .groups = 'drop')

# Plotting the trends over time by fuel type
p_fuel_over_time <- ggplot(fuel_type_trends, aes(x = Date, y = Count, color = Fuel)) +
  geom_line(alpha = 0.3, aes(color = Fuel)) +  # Use Viridis colors for lines
  scale_color_manual(values = fuel_colors) +   # Apply the defined color palette
  geom_smooth(aes(group = Fuel), se = FALSE, method = "loess", span = 0.1) +  # Add smoothed lines per fuel type
  labs(title = "Vehicle Registrations by Fuel Type Over Time",
       x = "Date",
       y = "Number of Vehicles Registered") +
  theme_minimal() +
  theme(legend.position = "bottom")  # Adjust legend position to the bottom

# Convert to an interactive plotly object
interactive_plot_fuel_over_time <- ggplotly(p_fuel_over_time, width = 600, height = 400)

# Adjust plotly settings 
interactive_plot_fuel_over_time <- interactive_plot_fuel_over_time %>%
  layout(legend = list(orientation = "h", x = 0, xanchor = "left", y = -0.2))
interactive_plot_fuel_over_time

The graph above shows the vehicle registrations in Switzerland by fuel type from 2005 to 2023.

On the one hand, it highlights the recent decrease in new registrations of thermic vehicles since 2017. On the other hand, the three EVs named as Conventional and Plug-in hybrids along with Electricity ones are gradually increasing since approximately the same period. (In fact, fully electric cars and Conventional hybrids have even reached a higher level of vehicle registration in the past years, spotlighting the EV tendency in recent years.)

3.1.3 Availability of Charging station

3.1.3.1 Availability of Charging station in Switzerland

Code

# Reshape the data into a wide format
df_wide <- df_charge_number_CH %>%
  spread(key = powertrain, value = value)

# Ensure the 'date' column is of type Date
df_wide$year <- as.Date(df_wide$year)

# Convert to xts object for dygraphs
xts_data <- xts(df_wide[, -1], order.by = df_wide$year)

# Create the dygraph with options for improved readability
dygraph(xts_data, main = "Available Charging Stations in Switzerland", width = "600px", height = "400px") %>%
  dySeries("Publicly available fast", label = "Fast Charging", color = "#2f114a") %>%
  dySeries("Publicly available slow", label = "Slow Charging", color = "#24918d") %>%
  dyOptions(strokeWidth = 1.5, fillGraph = TRUE) %>%
  dyLegend(show = "always") %>%
  dyRangeSelector(height = 30) # Adds a range selector for zooming in and out

The graph above points out the evolution of charging station in Switzerland since 2012. As we can see, their numbers seem to be in an exponential increase since 2017. The main point of this graph is to put the spotlight on the fact that both publicly fast and publicly slow are continuously increasing. Furthermore, the sum of the two evolution lines represents the total number of publicly available charging spots in Switzerland.

3.1.3.2 Availability of Charging station in France

Code

# Reshape the data into a wide format
df_wide <- df_charge_number_FR %>%
  spread(key = powertrain, value = value)

# Ensure the 'date' column is of type Date
df_wide$year <- as.Date(df_wide$year)

# Convert to xts object for dygraphs
xts_data <- xts(df_wide[, -1], order.by = df_wide$year)

# Create the dygraph with options for improved readability
dygraph(xts_data, main = "Available Charging Stations in France", width = "600px", height = "400px") %>%
  dySeries("Publicly available fast", label = "Fast Charging", color = "#2f114a") %>%
  dySeries("Publicly available slow", label = "Slow Charging", color = "#24918d") %>%
  dyOptions(strokeWidth = 1.5, fillGraph = TRUE) %>%
  dyLegend(show = "always") %>%
  dyRangeSelector(height = 20) # Adds a range selector for zooming in and out

The chart above follows the same tendency except that it now represents the French publicly available charging spots. Again, the most important aspect of this chart is the fact that both publicly fast and publicly slow are continuously increasing. The tendency seems to be quite similar with the Swiss evolution with the exception that the exponential increase stars from 2020 while the Swiss adoption already started in 2017. From 2016 to 2020, France seems to have a linear increase in the number of charging stations publicly available.

The last two charts allow us to have a better general idea of the evolution of charging stations both in Switzerland and in France, which will be quite useful for answering our two last research questions.

3.1.4 Map

3.1.4.1 Count of Electricity car Registration for all years per cantons

swissBOUNDARIES3D_1_4_TLM_KANTONSGEBIET.shp contains the boundaries of the cantons of Switzerland.

The choice to display the sum of electric vehicle registrations over all years on the map was made to provide a comprehensive historical perspective, highlighting the total adoption of electric vehicles in each Swiss canton since the beginning of the dataset’s timeframe. This approach effectively captures the cumulative impact of electric vehicle adoption across the country.

The process involved filtering vehicle registration data for electric passenger cars and summing up the total registrations for each Swiss canton. This data was then combined with population figures and geographic boundaries to calculate electric vehicle (EV) registrations per capita, providing a standardized comparison of EV adoption across cantons.

Code

detach("package:xts", unload = TRUE)
# Read in the shapefile for Swiss cantons
swiss_cantons <- st_read("../data/CH_map/swissBOUNDARIES3D_1_4_TLM_KANTONSGEBIET.shp")
#> Reading layer `swissBOUNDARIES3D_1_4_TLM_KANTONSGEBIET' from data source `C:\Users\UrsHu\Pillars\Learn\Academic\Master\Semestre 1\Data Science\0_group_project\dsfba_project\data\CH_map\swissBOUNDARIES3D_1_4_TLM_KANTONSGEBIET.shp' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 50 features and 20 fields
#> Geometry type: POLYGON
#> Dimension:     XYZ
#> Bounding box:  xmin: 2490000 ymin: 1080000 xmax: 2830000 ymax: 1300000
#> z_range:       zmin: 193 zmax: 4610
#> Projected CRS: CH1903+ / LV95 + LN02 height
# Define canton abbreviations for matching
abbreviation_values <- c("ZH", "BE", "LU", "UR", "SZ", "OW", "NW", "GL", "ZG", "FR", "SO", "BS", "BL", "SH", "AR", "AI", "SG", "GR", "AG", "TG", "TI", "VD", "VS", "NE", "GE", "JU")

# Prepare the EV data with sum over all years
df_v_map <- df_v %>%
  filter(!Location %in% c("Switzerland", "Confederation"), 
         Fuel == "Electricity", VehicleType == "Passenger car") %>%
  mutate(KANTONSNUM = match(Location, abbreviation_values)) %>%
  group_by(KANTONSNUM) %>%
  summarize(TotalEV = sum(Count), .groups = 'drop')

# Merge EV data with population data
df_v_map <- left_join(df_v_map, df_swisspop_2022, by = c("KANTONSNUM" = "KANTONSNUM"))

str(df_v_map)
#> tibble [26 x 5] (S3: tbl_df/tbl/data.frame)
#>  $ KANTONSNUM        : num [1:26] 1 2 3 4 5 6 7 8 9 10 ...
#>  $ TotalEV           : int [1:26] 29541 11652 5365 325 2960 551 690..
#>  $ Canton            : chr [1:26] "Zurich" "Bern" "Lucerne" "Uri" ...
#>  $ TotalPopulation   : int [1:26] 1579967 1051437 424851 37317 1649..
#>  $ CantonAbbreviation: chr [1:26] "ZH" "BE" "LU" "UR" ...
# Calculate EV registrations per capita
df_v_map <- df_v_map %>%
  mutate(EV_per_Capita = TotalEV / TotalPopulation)

# Merge with shapefile data
map_data <- left_join(swiss_cantons, df_v_map, by = "KANTONSNUM")

# Ensure the geometries are valid and the CRS is set to WGS 84
# Check if 'map_data' is already an sf object
if (!inherits(map_data, "sf")) {
  map_data_sf <- st_as_sf(map_data, wkt = "geometry")
} else {
  map_data_sf <- map_data
}

# Ensure the geometries are valid and the CRS is set
map_data_sf <- st_make_valid(map_data_sf)
# Reproject the data to WGS 84 (EPSG:4326)
map_data_sf <- st_transform(map_data_sf, crs = 4326)


# Create color palettes for the 'Total' and 'EV_per_Capita' columns
color_palette_total <- colorNumeric(palette = "viridis", domain = map_data_sf$TotalEV)
color_palette_per_capita <- colorNumeric(palette = "viridis", domain = map_data_sf$EV_per_Capita)



# Create the leaflet maps
leaflet_map_total <- leaflet(map_data_sf) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addPolygons(
    fillColor = ~color_palette_total(TotalEV),
    weight = 1,
    color = "#FFFFFF",
    fillOpacity = 0.7,
    popup = ~paste(NAME, "<br>Total EV Registrations: ", TotalEV)
  ) %>%
  addLegend(
    pal = color_palette_total, 
    values = ~TotalEV, 
    opacity = 0.7, 
    title = "Total EV <br> Registrations",
    position = "topright"
  )
#second one
leaflet_map_per_capita <- leaflet(map_data_sf) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addPolygons(
    fillColor = ~color_palette_per_capita(EV_per_Capita),
    weight = 1,
    color = "#FFFFFF",
    fillOpacity = 0.7,
    popup = ~paste(NAME, "<br>EV Registrations per Capita: ", 
                   round(EV_per_Capita, 3))
  ) %>%
  addLegend(
    pal = color_palette_per_capita, 
    values = ~EV_per_Capita, 
    opacity = 0.7, 
    title = "EV Registrations <br> per Capita",
    position = "topright"
  )

# Print the maps to view them
leaflet_map_total

3.1.4.2 Count of Electricity car Registration for all years per cantons Standardized

For example, Zurich (ZH) has a relatively lower EV_per_Capita value (0.01870) despite a high total number of EV registrations (29,541), due to its large population (1,579,967). In contrast, Zug (ZG) shows a higher EV_per_Capita (0.04666) with fewer EV registrations (6,120) but a much smaller population (131,164), indicating a greater adoption rate when adjusted for population size.

The limitation of this approach is that it considers the total population, not accounting for the segment of the population that is of driving age or interested in vehicle ownership, which could further refine the EV adoption rates.

Code

leaflet_map_per_capita

3.2 Google Trend

Code

library(xts)
# Convert to xts object for dygraphs
xts_data <- xts(df_gtrends[, -1], order.by = df_gtrends$Date)

# Calculate rolling mean with a width of 12 
roll_mean <- rollapply(xts_data, width = 5, FUN = mean, by.column = TRUE, align = "right", fill = NA)

# Create the dygraph 
dygraph(roll_mean, main = "Google Search About EV in Switzerland", width = "600px", height = "400px") %>%
  dySeries("SearchRatio", label = "Search Ratio", color = "#24918d") %>%
  dyOptions(strokeWidth = 1.5, fillGraph = TRUE) %>%
  dyLegend(show = "always") %>%
  dyRangeSelector(height = 20) # Adds a range selector for zooming in and out

Another interesting point to look out is the Google search about EV engine in Switzerland. As we can see, it seems to have skyrocketed since 2016. This allows us to get an overview of the Swiss population interest concerning this topic and comfort us towards our initial predictions.

3.3 Oil

The chart below represents the oil price evolution through the last two decades. We can observe a quite high degree of volatility concerning the oil valuation. Nevertheless, it seems important to highlight that this value has greatly increased since the past 3 years.

The fig.show='animate' option tells Quarto to render the plot as an animation.

Code


# Create a ggplot object with your data
p <- ggplot(df_oil, aes(x = Date, y = Price, group = 1)) +
  geom_line(color = "#24918d", size = 1) +
  labs(x = "Date", y = "Price", title = "Oil Price Over Time")

# Animate the plot with gganimate, revealing the line over time
animated_plot <- p +
  transition_reveal(Date)

# Render the animation
animate(animated_plot, renderer = gganimate::gifski_renderer(), width = 600, height = 400, res = 96)

3.4 Demographics

Code

# Define color palette for each generation
generation_colors <- c(
  "Generation Z" = viridis(4)[1],
  "Millennials" = viridis(4)[2],
  "Generation X" = viridis(4)[3],
  "Baby Boomers" = viridis(4)[4]
)

# Pivot the data
demographic_data_long <- df_demographic %>%
  pivot_longer(
    cols = c('Generation Z', 'Millennials', 'Generation X', 'Baby Boomers'),
    names_to = "Generation",
    values_to = "Population"
  )

# Plotting the data with ggplot2
p_demog <- ggplot(demographic_data_long, aes(x = Year, y = Population, color = Generation)) +
  geom_line(size = 1) +
  labs(title = "Demographic Trends in Switzerland",
       x = "Year",
       y = "Population") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_color_manual(values = generation_colors)

# Convert the ggplot object to an interactive plotly object
interactive_plot_demog <- ggplotly(p_demog, width = 600, height = 400)

# Adjust plotly settings 
interactive_plot_demog <- interactive_plot_demog %>%
  layout(legend = list(orientation = "h", x = 0, xanchor = "left", y = -0.2))

interactive_plot_demog

This graph concentrates on the demographic trends’ evolution for each segment of age above mentioned throughout the time. It points out the important overall rise of individuals living in Switzerland over the past two decades. It is quite interesting to note that Baby Boomers demographic evolution is still linearly increasing while the Generation Z’s one seems to be moderately flat.

3.5 French vehicles

3.5.1 Total vehicles evolution France

Code

# Define color palette for each fuel type
fuel_colors <- c(
  "Diesel" = viridis(5)[1], 
  "Electrique" = viridis(5)[2], 
  "Conventional_Hybrid" = viridis(5)[3], 
  "Plug_in_Hybrid" = viridis(5)[4], 
  "Essence" = viridis(5)[5]
)

# Reshape data to long format
long_registration_data <- df_v_fr %>%
  select(Date, Diesel, Electrique, Conventional_Hybrid, Plug_in_Hybrid, Essence) %>%
  gather(key = "Fuel_Type", value = "Count", -Date)

# Create ggplot
p <- ggplot(long_registration_data, aes(x = Date, y = log(Count), color = Fuel_Type)) +
  geom_line(size = 1) +
  scale_color_manual(values = fuel_colors) +
  labs(x = "Date", y = "Log-Scale Count", color = "Fuel Type") +
  theme_minimal()

# Convert to interactive plot and adjust legend
ggplotly(p, width = 600, height = 400) %>%
  layout(legend = list(orientation = 'h', x = 0.5, xanchor = 'center', y = -0.15))

The graph above represents the French total number of vehicles from 2012 to 2022. While thermic motors (diesel and essence) convey the impression of a flat growth, both electric and hybrid vehicles seem to follow an increasing slope. It is important to mention that the log-scale count linearizes our results which allows us to compare the two growths. Indeed, without the log-scale the actual numbers of thermic and electric would still be too far away to be compared.

3.5.2 Deltas evolution

Code

fuel_colors <- c(
  "Diesel" = viridis(5)[1], 
  "Electricity" = viridis(5)[2], 
  "Conventional hybrid" = viridis(5)[3], 
  "Plug-in hybrid" = viridis(5)[4], 
  "Petrol" = viridis(5)[5]
)
# Reshape data to long format
long_df_v_fr <- df_v_fr %>%
  select(Date, Diesel_delta, Essence_delta, Conventional_Hybrid_delta, Plug_in_Hybrid_delta, Electrique_delta) %>%
  gather(key = "Fuel_Type", value = "Delta", -Date)

# Map Fuel_Type values to desired names
long_df_v_fr <- long_df_v_fr %>%
  mutate(Fuel_Type = case_when(
    Fuel_Type == "Diesel_delta" ~ "Diesel",
    Fuel_Type == "Essence_delta" ~ "Petrol",
    Fuel_Type == "Conventional_Hybrid_delta" ~ "Conventional hybrid",
    Fuel_Type == "Plug_in_Hybrid_delta" ~ "Plug-in hybrid",
    Fuel_Type == "Electrique_delta" ~ "Electricity",
    TRUE ~ Fuel_Type  # Keep the original value if none of the conditions match
  ))

# Create ggplot
p <- ggplot(long_df_v_fr, aes(x = Date, y = Delta, color = Fuel_Type)) +
  geom_line(size = 1) +
  labs(title = "Evolution of cars registered in France over the years by fuel type (Deltas)",
       x = "Years",
       y = "Value",
       color = "Fuel Category") +
  theme_minimal() +
  scale_color_manual(values = fuel_colors)

# Convert to interactive plot and adjust legend
ggplotly(p, width = 600, height = 400) %>%
  layout(legend = list(orientation = 'h', x = 0.5, xanchor = 'center', y = -0.3))

Now, we will consider a graph showing the evolution of cars registered in France from 2012 to 2022. It is quite different from the last one because it shows the evolution of registered cars in France instead of the total number which allows us to take out a potential tendency in the recent car purchases, neglecting the actual composition of the French automotive fleet. On the one hand, Diesel cars new registrations evolution is fiercely dropping down while petrol ones seem to be surviving for the moment. On the other hand, both electric and hybrid cars gradually increase since 2020 and reached the same evolution rate as petrol cars.

3.6 Political Parties

Here, we have a cluster plot of the Swiss Cantons according to their political stance on sustainability in 1999 and in 2023

1999:

Code

# Let's start with 1999
political_data_1999 <- political_combined_data[,-8] %>%
  filter(Year == as.Date("1999-01-01")) %>%
  select(c("Canton", "Against", "Slightly Against", "Neutral", "Slightly in Favour", "In Favour"))

# We will use the K-Means method
# We start by looking for the right amount of clusters
fviz_nbclust(political_data_1999[,-1], kmeans, method = "wss") +
  geom_vline(xintercept = 7, linetype = 2, color = 'red')

# We can see that 7 clusters seems to be the choice
# Changing my tibble
pol_cantons_1999 <- as.data.frame(political_data_1999)
rownames(pol_cantons_1999) <- pol_cantons_1999$Canton
pol_cantons_1999 <- pol_cantons_1999[,-1]

# Fit k-means with 7 clusters
km.res <- kmeans(pol_cantons_1999, 7, nstart = 26)

cluster_plot <- fviz_cluster(km.res, data = pol_cantons_1999) +
  ggtitle("Cluster Analysis of Cantons' stance on sustainability 1999") +
  scale_color_viridis_d() + # Apply viridis palette to points
  scale_fill_viridis_d()    # Apply viridis palette to cluster polygons

# Display the plot
cluster_plot


# Then the PCA:
pca_cantons <- prcomp(pol_cantons_1999)

pca_plot <- fviz_pca(pca_cantons,
                     col.ind = as.factor(km.res$cluster), # This assigns a color to each cluster
                     label = "all",
                     repel = TRUE) +
  scale_color_viridis_d() +  # This applies the viridis palette to the colors
  ggtitle("PCA Plot for each Canton") +
  coord_cartesian(xlim = c(-100, 100), ylim = c(-100, 100)) +
  theme(legend.title = element_blank()) # Optionally remove the legend title


pca_plot

#Now the PCA for the variables
pca_variables <- prcomp(t(pol_cantons_1999))

# Plot variables
fviz_pca_var(pca_variables, col.var = "contrib", 
             gradient.cols = viridis(3),
             repel = TRUE) +
  ggtitle("PCA Plot for Variables")

The last four charts represent the clustering of the Swiss cantons about their political view concerning the environment in 1999.

On these plots, GL seem to have an outlier behaviour. Upon further exploring we can see that this is explained by the fact that, in 1999, this canton was represented at 85,7% by the “Parti socialiste Suisse”, namely a political party in the “In Favour” category. This strong influence of a single political party explains the atypical position of this Canton in our Cluster plot.

2023:

Code

# Now for 2023
political_data_2023 <- political_combined_data[,-8] %>%
  filter(Year == as.Date("2023-01-01")) %>%
  select(c("Canton", "Against", "Slightly Against", "Neutral", "Slightly in Favour", "In Favour"))

# Changing my tibble
pol_canton_2023 <- as.data.frame(political_data_2023)
row.names(pol_canton_2023) <- pol_canton_2023$Canton
pol_canton_2023 <- pol_canton_2023[,-1]

# How many clusters?
fviz_nbclust(pol_canton_2023, kmeans, method = "wss") +
  geom_vline(xintercept = 6, linetype = 2, color = 'red')

# We can see that 6 clusters seems to be the choice

# Fit k-means with 6 clusters
km.res_2023 <- kmeans(pol_canton_2023, 6, nstart = 26)
fviz_cluster(km.res_2023, data = pol_canton_2023) +
  ggtitle("Cluster Analysis of Cantons' stance on sustainability 2023") +
  scale_color_viridis_d() + # Apply viridis palette to points
  scale_fill_viridis_d()    # Apply viridis palette to cluster polygons

#Then the PCA plot
pca_cantons <- prcomp(pol_canton_2023)
pca_plot <- fviz_pca(pca_cantons,
                     col.ind = as.factor(km.res$cluster),
                     label = "all",
                     repel = TRUE) +
  ggtitle("PCA Plot for each Canton") +
  scale_color_viridis_d() +  # This applies the viridis palette to the colors
  pca_plot + coord_cartesian(xlim = c(-100, 100), ylim = c(-100, 100))

#Now the PCA for the variables
pca_variables <- prcomp(t(pol_canton_2023))

# Plot variables
fviz_pca_var(pca_variables, col.var = "contrib",
             gradient.cols = viridis(3),
             repel = TRUE) +
  ggtitle("PCA Plot for Variables")

The last four graphs represent the clustering of the Cantons according to their political stance in 2023.

From 1999 to 2023, we can point out that we went from 7 to 6 optimal numbers of optimal clusters, which shows more homogeneity concerning this topic along the Swiss Cantons through the time.

First, we can observe that GL has now joined other cantons in the Cluster 1. Indeed, their political representation is far more balanced in 2023 than it was in 1999. It is interesting to see that VD, NE and GE seem to follow the same tendencies both in 1999 and 2023, and are represented by Dimension 1 and 2 by roughly the same amounts.

Another interesting observation is that NW and UR are their own cluster both in 1999 and 2023, as these two geographical neighbors likely show the same political inclination. Indeed, in 1999, both NW and UR had an overwhelming majority of “Slightly Against” (90.4% and 81.7% respectively). In 2023, these two cantons are displaying some of the strongest “Against” tendencies. We can see in the PCA plot for variables (both 1999 and 2023) that these two cantons have a strong contribution to the principal components and therefore explain a large portion of the variance.

One last thing we would like to mention is how the cantons are clustered in a way that is close to their regional/geographical distribution in Switzerland (especially in the 2023 cluster plot). With more populated cantons such as GE or ZH in a cluster, more rural cantons such as VS or GR in another cluster, and then historically conservative cantons such as AR and OW sharing a cluster.

3.7 Swiss vs France

3.7.1 Electric vs Hybrid vs Petrol

Here we compare and visualize specific vehicle fuel types between Swiss and French datasets.

The highlighted line are for Switzerland, whereas the others are for France. We standardized the counts in both the Swiss and French datasets for comparison purposes.

Please click on the legend of the graph to show more comparison between other types of fuel.

Code

# Filtering Swiss data for specific fuel types
swiss_specific_fuel <- df_v %>%
  filter(Fuel %in% c("Diesel", "Electricity", "Conventional hybrid", "Plug-in hybrid", "Petrol")) %>%
  filter(Location == 'Switzerland') |>
  filter(VehicleType == 'Passenger car') |>
  filter(Date > as.Date('2012-01-01')) |>
  filter(Date < as.Date('2021-12-31'))

# Selecting equivalent columns from the French dataset
french_specific_fuel <- df_v_fr %>%
  select(Date, Diesel_delta, Essence_delta, Conventional_Hybrid_delta, Plug_in_Hybrid_delta, Electrique_delta) # Adjust column names accordingly

# Reshape French dataset to long format for easier plotting
french_specific_fuel_long <- french_specific_fuel %>%
  pivot_longer(cols = -Date, names_to = "Fuel", values_to = "Count")

# Standardize counts in each dataset
swiss_specific_fuel <- swiss_specific_fuel %>%
  mutate(Count = scale(Count))

french_specific_fuel_long <- french_specific_fuel_long %>%
  mutate(Count = scale(Count))

# Rename the 'Fuel' column in the French dataset
french_specific_fuel_long <- french_specific_fuel_long %>%
  mutate(Fuel = case_when(
    Fuel == "Diesel_delta" ~ "Diesel",
    Fuel == "Essence_delta" ~ "Petrol",
    Fuel == "Conventional_Hybrid_delta" ~ "Conventional hybrid",
    Fuel == "Plug_in_Hybrid_delta" ~ "Plug-in hybrid",
    Fuel == "Electrique_delta" ~ "Electricity"
  ))

swiss_specific_fuel$Date <- as.Date(swiss_specific_fuel$Date)
french_specific_fuel_long$Date <- as.Date(french_specific_fuel_long$Date)

# Define color palette for each fuel type
fuel_colors <- c(
  "Diesel" = viridis(5)[1], 
  "Electricity" = viridis(5)[2], 
  "Conventional hybrid" = viridis(5)[3], 
  "Plug-in hybrid" = viridis(5)[4], 
  "Petrol" = viridis(5)[5]
)
# Create the ggplot
p <- ggplot() +
  geom_line(data = french_specific_fuel_long, aes(x = Date, y = Count, color = Fuel), size = 1, alpha = 0.4) +
  geom_line(data = swiss_specific_fuel, aes(x = Date, y = Count, color = Fuel), size = 1) +
  scale_color_manual(values = fuel_colors) +
  labs(x = "Date", y = "Standardized Count", color = "Fuel Type") +
  theme_minimal() +
  scale_x_date() # This will handle the dates on the x-axis

# Convert to interactive plot using plotly
interactive_plot <- ggplotly(p, tooltip = c("x", "y", "color"), width = 600, height = 600)

# Use style() to set the visibility of the lines
for (i in 1:length(interactive_plot$x$data)) {
  # Assuming the name of the trace includes the fuel type
  if (grepl("Electricity", interactive_plot$x$data[[i]]$name)) {
    interactive_plot$x$data[[i]]$visible <- TRUE
  } else {
    interactive_plot$x$data[[i]]$visible <- 'legendonly'
  }
}

# Adjust the layout of the interactive plot
p_ch_vs_fr1 <- interactive_plot %>%
  layout(legend = list(orientation = "h", x = 0.5, xanchor = "center", y = -0.2)) # Adjust the legend position

# Print the plot
p_ch_vs_fr1

We can see what seems to be a good correlation between the different results. We can observe that electric and hybrid cars follows roughly the same trajectory while thermic ones seem to follow a decreasing slope, both for France and Switzerland.

Code

p <- ggplot() +
  geom_smooth(data = swiss_specific_fuel, aes(x = Date, y = Count, color = Fuel), 
              method = "loess", se = FALSE, size = 1.5) +
  geom_line(data = french_specific_fuel_long, aes(x = Date, y = Count, color = Fuel), 
            alpha = 0.4, size = 0.8) +
  scale_color_manual(values = fuel_colors, labels = c("Switzerland", "France")) +
  labs(x = "Date", y = "Standardized Count") +
  theme_minimal() +
  theme(legend.position = "bottom",
        strip.background = element_blank(),
        strip.text.x = element_text(size = 10, angle = 0),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_text(size = 12, margin = margin(t = 10)),
        axis.title.y = element_text(size = 12, margin = margin(r = 10))) +
  facet_wrap(~Fuel, scales = 'free_y', ncol = 1)

# Convert to interactive plot
p_ch_vs_fr2 <- ggplotly(p, width = 600, height = 800, tooltip = c("x", "y", "color"))
interactive_plot <- interactive_plot %>%
  layout(legend = list(orientation = "h", x = 0, xanchor = "left", y = -0.2))
p_ch_vs_fr2

The chart above solely represents a zoomed view of the previous one, which highlights the same intuitions as the prior one.

3.8 EV and Google Trends

All in all, the Google trend seems to be in adequation with the actual rise in EVs around Switzerland.

Code


df_v_electric <- df_v |>
  filter(VehicleType == "Passenger car") |>
  filter(Fuel == 'Electricity')
df_electric_vehicles_agg <- df_v_electric %>% group_by(Date) %>% summarize(Count = sum(Count))

# Merge datasets
merged_df <- merge(df_electric_vehicles_agg, df_gtrends, by = "Date")

# Calculate the ratio for the secondary axis
max_values <- merged_df |> summarize(max_count = max(Count, na.rm = TRUE), max_search = max(SearchRatio, na.rm = TRUE))
ratio <- max_values$max_count / max_values$max_search

# Add the ratio-adjusted SearchRatio to the merged dataset
merged_df <- merged_df |> mutate(AdjustedSearchRatio = SearchRatio * ratio)

# Pick colors from the viridis palette
colors <- viridis(2.1)

# Plotting with smoothing and color changes using viridis colors
p <- ggplot(merged_df, aes(x = Date)) +
  geom_bar(aes(y = AdjustedSearchRatio, fill = "Google Trends"), stat = "identity") +
  geom_smooth(aes(y = Count, color = "Electric Vehicles Smoothed"), method = "loess", span = 0.2) +
  scale_y_continuous(
    "Number of Electric Vehicles",
    sec.axis = sec_axis(~ . / ratio, name = "Google Trends")
  ) +
  labs(title = "Comparison of Electric Vehicle Rise and Google Trends Over Time",
       x = "Date", fill = "Legend", color = "Legend") +
  scale_fill_manual(values = c("Google Trends" = colors[2.1])) + # Use first viridis color for bars
  scale_color_manual(values = c("Electric Vehicles Smoothed" = colors[1.5])) + # Use second viridis color for line
  theme_minimal() +
  theme(legend.position = "bottom")

# Convert to interactive plot
interactive_plot <- ggplotly(p, tooltip = "text", width = 600, height = 400) %>%
  layout(legend = list(orientation = "h", x = 0.5, xanchor = "center", y = -0.3))

interactive_plot

3.9 EV and Oil Price

The results of the graphic comparing oil price and EVs evolution comforts us in our choice of including oil price as an explicative variable. Indeed, the tendency is quite similar since 2020. In fact, the previous years refer to a period where EVs were not as commercialized as today. We are, however, aware that many other variables exist that explain the rise in both oil prices and electric vehicle (EV) adoption over time.

Code

# Resample data to monthly frequency and calculate mean oil price
df_oil_monthly <- df_oil %>% 
                  mutate(Date = as.Date(format(Date, "%Y-%m-01"))) %>%
                  group_by(Date) %>% 
                  summarize(Price = mean(Price), .groups = 'drop')

# Resample electric vehicles data to monthly frequency and sum counts
df_electric_vehicles_monthly <- df_electric_vehicles_agg %>% 
                               mutate(Date = as.Date(format(Date, "%Y-%m-01"))) %>%
                               group_by(Date) %>% 
                               summarize(Count = sum(Count), .groups = 'drop')

# Merge datasets
df_merged <- full_join(df_electric_vehicles_monthly, df_oil_monthly, by = "Date")

# Calculate the ratio for the secondary axis
max_values <- df_merged |> summarize(max_count = max(Count, na.rm = TRUE), max_price = max(Price, na.rm = TRUE))
ratio <- max_values$max_count / max_values$max_price

# Add the ratio-adjusted Price to the merged dataset
df_merged <- df_merged |> mutate(AdjustedPrice = Price * ratio)

# Create the viridis color palette
viridis_colors <- viridis_pal()(2.1)

# Plotting with smoothing and color changes
p <- ggplot(df_merged, aes(x = Date)) +
  geom_smooth(aes(y = Count, color = "Electric Vehicles Smoothed"), method = "loess", span = 0.2) +
  geom_line(aes(y = AdjustedPrice, color = "Oil Price")) +
  scale_y_continuous(
    "Number of Electric Vehicles",
    sec.axis = sec_axis(~ . / ratio, name = "Oil Price")
  ) +
  labs(title = "Comparison of Electric Vehicle Rise and Oil Prices Over Time",
       x = "Date", color = "Legend") +
  scale_color_manual(values = c("Electric Vehicles Smoothed" = viridis_colors[1], "Oil Price" = viridis_colors[2])) +
  theme_minimal() +
  theme(legend.position = "bottom")

# Convert to interactive plot
interactive_plot <- ggplotly(p, width = 600, height = 400) %>%
  layout(legend = list(orientation = "h", x = 0.5, xanchor = "center", y = -0.3))

interactive_plot

4 Analysis

4.1 RQ1 - Based on past electric vehicle adoption trends in Switzerland, can we forecast future adoption rates and pinpoint times of significant increases or decreases correlated with major events or policy changes?

Here is the process we used to compute those 4 regressions.

The first step was to choose dependent variables such as sales or EV registrations:

1 . Sales Data :

The main advantage of this regressing with this variable is that it directly reflects market demand and consumer purchasing behaviour. Nonetheless, it can be influenced by short-term factors such as promotions or subsidies, which might not indicate long-term adoption trends.

Registration Data :

The benefits we can obtain from including this variable is that it represents actual additions to the vehicle population and can be more indicative of long-term trends in EV adoption. On the other hand, it might lag behind sales data, as registration occurs post-purchase and can be influenced by administrative processes.

As we are interested in long-term trends in EV adoption and usage, registration data is concluded to be our final choice.

We then performed some analyse using linear regression and multivariate regression.

To evaluate the quality of our models we then reviewed the results and checking weather the key assumptions of the regressions that are:

Linearity: if relationships between variables are linear The linear assumption tells us if the relationship between the independent and dependent variables is a straight line. This makes the model simpler and easier to interpret.

We will diagnose this assumptions through a scatter plot of observed vs. predicted values, looking for a straight line pattern; if the points closely follow a straight line, it suggests a linear relationship.

Normal distribution: if the errors are normally distributed In linear regression, the normal assumption refers to the idea that the errors or residuals, which represent the deviations between observed and predicted values, follow a normal (Gaussian) distribution.

We will diagnose this assumptions through an histogram of residuals which will allows us to see if the distribution of residuals resembles a normal distribution.

Homoscedasticity: if the errors have constant variance A key indicator of homoscedasticity is a random spread of residuals across all levels of fitted values. If the residuals display a pattern, it suggests heteroscedasticity, which means the variance of residuals changes across fitted values, violating the assumption of homoscedasticity.

We will diagnose this assumptions through a scatter plot of residuals vs. fitted values, looking for an even spread (constant variance) of the residuals across different levels of the fitted values.

Multicollinearity : if independent variables are highly correlated with one and another (only for multivariable regression)

The VIF (Variance Inflation Factor) values will indicate the level of multicollinearity. When variables in a model are too similar to each other, it can ruin the accuracy of our predictions.

We will diagnose this through the VIF measure. A rule of thumb is that if VIF is greater than 5 or 10, it indicates high multicollinearity. A high VIF indicates that a predictor variable is highly correlated with other predictors in the model, potentially causing instability in coefficient estimates.

If one of those assumptions doesn’t hold, the model might not accurately capture the relationship, leading to incorrect predictions and conclusions.

Finally, we concluded by indicating the relevance of the results taking into account the quality of the model.

4.1.1. Simple Linear Regression

We started with the simple linear regression models to understand how well a single independent variable can predict EV adoption rates.

4.1.1.1 With Oil

We started with oil price as a independent variable.

Linear models were fitted using the lm() function

Note: the decision to remove rows with missing values was justified as the number of missing values in the Price column was relatively small (28 out of a larger dataset) and couldn’t be reliably imputed, ensuring that the analysis was performed on a more complete and consistent dataset.

Code

# Select the columns to keep in df_v_electric
df_v_electric <- df_v_electric %>%
  select(Location, Count, Date)

# Merge df_oil_monthly with df_v_electric based on the common 'Date' column
df_merged <- merge(df_v_electric, df_oil_monthly, by = "Date", all.x = TRUE)

# Remove rows with missing values in the 'Price' column
df_merged <- df_merged[complete.cases(df_merged), ]

# Splitting the data into training and testing sets
set.seed(123) # for reproducibility
split_index <- sample(1:nrow(df_merged), 0.8 * nrow(df_merged))
train_data <- df_merged[split_index, ]
test_data <- df_merged[-split_index, ]

# Fitting the linear model
mod1_lin <- lm(Count ~ Price, data = train_data)

# Summary of the model to view coefficients and statistics
summary(mod1_lin)
#> 
#> Call:
#> lm(formula = Count ~ Price, data = train_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>    -40    -40    -39    -27   5514 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 40.42528   10.10153    4.00  6.4e-05 ***
#> Price       -0.00542    0.12620   -0.04     0.97    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 222 on 4814 degrees of freedom
#> Multiple R-squared:  3.83e-07,   Adjusted R-squared:  -0.000207 
#> F-statistic: 0.00184 on 1 and 4814 DF,  p-value: 0.966

# Predicting with the test data
predictions <- predict(mod1_lin, test_data)
# Calculating residuals (difference between actual and predicted values)
residuals <- test_data$Count - predictions

# Calculating RMSE
rmse <- sqrt(mean(residuals^2))
print(paste("RMSE:", rmse))
#> [1] "RMSE: 247.396521283555"
# Creating a summary table for the linear model
table_regression <- tbl_regression(mod1_lin)

4.1.1.1.1 Model Diagnostic

4.1.1.1.1.2 Results

Code

table_regression

Characteristic	Beta	95% CI¹	p-value
Price	-0.01	-0.25, 0.24	>0.9
¹ CI = Confidence Interval

The linear regression analysis results indicate that the Price variable has a very weak or negligible effect on explaining the variation in the Count variable. Specifically:

The coefficient for Price is very close to zero (-0.00542), indicating that there is almost no linear relationship between Price and Count.
The p-value for Price (0.97) is much greater than the common significance level of 0.05. This high p-value suggests that Price is not statistically significant in predicting Count, as it fails to reject the null hypothesis that the coefficient is zero.
The R-squared value is very low (3.83e-07), which indicates that the linear regression model solely explains a negligible fraction of the variance related to the variable Count.
The Root Mean Squared Error (RMSE) is relatively high (247), suggesting that the model’s predictions have a substantial amount of error.

4.1.1.1.1.1.2 Linearity

Code

plot_data <- data.frame(
  Observed = test_data$Count,
  Predicted = predictions
)

# Create ggplot object for observed vs predicted values
p <- ggplot(plot_data, aes(x = Predicted, y = Observed)) +
  geom_point(aes(color = Observed), alpha = 0.6) +  # Use color to represent the Observed values
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +  # Line y=x for reference
  scale_color_viridis(option = "D", direction = 1) +  # Consistent color scheme with the previous plots
  theme_minimal() +
  theme(text = element_text(size = 14), legend.position = "bottom") +  # Consistent theme and text size
  labs(title = "Observed vs. Predicted", x = "Predicted Values", y = "Observed Values", color = "Observed Count") +
  expand_limits(x = 0, y = 0)  # Ensure that the plot starts at 0

# Convert to a plotly interactive plot for consistency with the previous plots
ggplotly(p, tooltip = "text", width = 600, height = 400)

The graph indicates a potential issue with the linearity assumption, as the observed values for low predicted values are more spread out and not evenly distributed along a straight line, suggesting the model may not be an appropriate fit for the data.

4.1.1.1.1.1.3 Homoscedasticity

Code

# Create a data frame for plotting
plot_data <- data.frame(
  Fitted = predictions,
  Residuals = residuals
)

# Create ggplot object
p <- ggplot(plot_data, aes(x = Fitted, y = Residuals)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_point(aes(color = Residuals), alpha = 0.6) +
  scale_color_viridis(option = "D", direction = 1) +
  theme_minimal() +
  theme(text = element_text(size = 14), legend.position = "bottom") +
  labs(title = "Residuals vs. Fitted", x = "Fitted Values", y = "Residuals", color = "Residual Size")+
  expand_limits(y = -1000)  # Extend y-axis

# Convert to a plotly interactive plot
ggplotly(p, width = 600, height = 400)

This plot shows a pattern where the majority of residuals are clustered around a narrow range of fitted values, with a few extreme residuals. This could indicate that the model is not capturing the variability in the data. We also observe the presence of extreme outliers or influential points that are affecting the model fit.

Note : The outliers represent rare but possible scenarios within the dataset, providing valuable insights into the extremes of the observed phenomenon, and should therefore be retained for a comprehensive analysis of the data’s behavior.

4.1.1.1.1.1.4 Normality

Code

# Histogram of Residuals
p <- ggplot(plot_data, aes(x = Residuals, fill = ..count..)) +
  geom_histogram(binwidth = 250, alpha = 0.6) +  # Wider bins for clearer view
  scale_fill_viridis_c(option = "D", begin = 0.3, end = 0.7, direction = -1) +  # Single color scale
  theme_minimal() +
  theme(text = element_text(size = 14)) +  # Larger text for better readability
  labs(title = "Histogram of Residuals", x = "Residuals", y = "Frequency", fill = "Frequency")
ggplotly(p, width = 600, height = 400 )
#> Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
#> i Please use `after_stat(count)` instead.
#> i The deprecated feature was likely used in the ggplot2 package.
#>   Please report the issue at
#>   <https://github.com/tidyverse/ggplot2/issues>.

This plot shows that the residuals are not normally distributed. Instead, there is a large spike near zero and a few large residuals, confirming that there are outliers and that the model is not appropriate for the data.

4.1.1.1.3 Conclusion

In summary, based on these results and the diagnostic graphs, it appears that the Price variable does not have a meaningful impact on predicting the Count variable, and the linear regression model is not suitable for explaining the relationship between these two variables. Further exploration of the data and potentially considering other factors or modeling approaches may be necessary to improve predictive accuracy.

The diagnostic plots obtained from our regression analysis indicate possible issues with linearity, normality, and homoscedasticity, highlighting the need to reconsider the model or explore alternative methods.

4.1.1.2. With google trends

We then continue with the google trends, SearchRatio, as an independent variable.

Code

# Merge the two datasets based on the common Date column
df_merged <- merge(df_merged, df_gtrends, by = "Date")
# Check for missing values in df_merged
#sum(is.na(df_merged))
# Check for missing values in each column of df_merged
#colSums(is.na(df_merged))

# Splitting the data into training and testing sets
set.seed(123) # for reproducibility
split_index <- sample(1:nrow(df_merged), 0.8 * nrow(df_merged))
train_data <- df_merged[split_index, ]
test_data <- df_merged[-split_index, ]

# Fitting the linear model
mod2_lin <- lm(Count ~ SearchRatio, data = train_data)

# Summary of the model to view coefficients and statistics
summary(mod2_lin)
#> 
#> Call:
#> lm(formula = Count ~ SearchRatio, data = train_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>   -252    -39    -13      6   5425 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  -82.043      7.139   -11.5   <2e-16 ***
#> SearchRatio    5.957      0.314    19.0   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 214 on 4814 degrees of freedom
#> Multiple R-squared:  0.0695, Adjusted R-squared:  0.0693 
#> F-statistic:  359 on 1 and 4814 DF,  p-value: <2e-16

# Predicting with the test data
predictions <- predict(mod2_lin, test_data)

# Calculating residuals (difference between actual and predicted values)
residuals <- test_data$Count - predictions

# Calculating RMSE
rmse <- sqrt(mean(residuals^2))
print(paste("RMSE:", rmse))
#> [1] "RMSE: 240.599640776535"

# Creating a summary table for the linear model
table_regression2 <- tbl_regression(mod2_lin)

4.1.2.1 Model Diagnostic

4.1.1.2.1.1. Results

Code

# Viewing the table
table_regression2

Characteristic	Beta	95% CI¹	p-value
SearchRatio	6.0	5.3, 6.6	<0.001
¹ CI = Confidence Interval

The linear regression analysis results indicate that the SearchRatio variable has a small but significant effect on explaining the variation in the Count variable.

The coefficient for SearchRatio is 5.957, indicating that there is a positive linear relationship between SearchRatio and Count. For each unit increase in SearchRatio, the Count increases by an average of 5.957 units.
The p-value for SearchRatio is less than 2e-16, which is far below the common significance level of 0.05. This extremely low p-value strongly suggests that SearchRatio is statistically significant in predicting Count, allowing us to reject the null hypothesis that the coefficient is zero.
The R-squared value is 0.0695, which indicates that the linear regression model explains about 6.95% of the variance in the Count variable. This is a modest amount, indicating that while SearchRatio does contribute to explaining Count, other factors are also likely to play a significant role.
The Root Mean Squared Error (RMSE) is 240.6, suggesting that the model’s predictions have a moderate amount of error. This figure should be evaluated in the context of the magnitude of Count values to determine its practical significance.

4.1.2.1.1.2 Linearity

Code

plot_data <- data.frame(
  Observed = test_data$Count,
  Predicted = predictions
)

# Create ggplot object for observed vs predicted values
p <- ggplot(plot_data, aes(x = Predicted, y = Observed)) +
  geom_point(aes(color = Observed), alpha = 0.6) +  # Use color to represent the Observed values
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +  # Line y=x for reference
  scale_color_viridis(option = "D", direction = 1) +  # Consistent color scheme with the previous plots
  theme_minimal() +
  theme(text = element_text(size = 14), legend.position = "bottom") +  # Consistent theme and text size
  labs(title = "Observed vs. Predicted", x = "Predicted Values", y = "Observed Values", color = "Observed Count") +
  expand_limits(x = 0, y = 0)  # Ensure that the plot starts at 0

# Convert to a plotly interactive plot for consistency with the previous plots
ggplotly(p, tooltip = "text", width = 600, height = 400)

This graph shows that for a wide range of predicted values, the observed values are consistently low, suggesting that the model may not be effectively capturing the variance in the observed data, and there could be an issue with both the linearity and precision of the model’s predictions.

4.1.2.1.1.3. Homoscedasticity

Code

# Create a data frame for plotting
plot_data <- data.frame(
  Fitted = predictions,
  Residuals = residuals
)

# Create ggplot object for Residuals vs. Fitted
p_residuals_fitted <- ggplot(plot_data, aes(x = Fitted, y = Residuals)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = 'red') +
  geom_point(aes(color = Residuals), alpha = 0.6) +
  scale_color_viridis(option = "D", direction = 1) +
  theme_minimal() +
  theme(text = element_text(size = 14), legend.position = "bottom") +
  labs(title = "Residuals vs. Fitted", x = "Fitted Values", y = "Residuals", color = "Residual Size") +
  expand_limits(y = -1000)  # Extend y-axis

# Convert to a plotly interactive plot
ggplotly(p_residuals_fitted, width = 600, height = 400)

The residual plots suggest that there may be non-linearity in the relationship between SearchRatio and Count or the presence of outliers, as indicated by the large residuals for higher fitted values.

4.1.2.1.1.4 Normality

Code

# Create ggplot object for Histogram of Residuals
p_histogram_residuals <- ggplot(plot_data, aes(x = Residuals, fill = ..count..)) +
  geom_histogram(binwidth = 250, alpha = 0.6) +  # Adjust binwidth as needed
  scale_fill_viridis_c(option = "D", begin = 0.3, end = 0.7, direction = -1) +
  theme_minimal() +
  theme(text = element_text(size = 14)) +
  labs(title = "Histogram of Residuals", x = "Residuals", y = "Frequency", fill = "Frequency")

# Convert to a plotly interactive plot
ggplotly(p_histogram_residuals, width = 600, height = 400)

As for the other linear regression, the histogram shows a right-skewed distribution with a long tail, indicating and non-normal distribution and the presence of outliers.

4.1.1.2.2 Conclusion

Overall, while there is a statistically significant relationship between SearchRatio and Count, the model’s R-squared value suggests that other factors not included in the model may also influence the Count variable. One might want to explore additional variables along with more complex models to improve predictive accuracy.

The diagnostic plots from our regression analysis further suggest potential violations of linearity, normality, and homoscedasticity, which may necessitate reevaluation of the model or the use of alternative methods.

4.1.1.3 Multivariable Regression

Given the limitations of simple linear models in explaining our dependent variable, EV adoption, likely due to the intricate interplay of various contributing factors, we will now transition to a more comprehensive approach using a multiple linear regression model.

4.1.1.1.3.1 With Demographic groups, Oil Price and Google Trend

We first explored the combined influence of Demographic groups, Oil Price and Google Trend as predictors.

Code

# Merge the data frames on the 'Year' column
df_merged <- merge(df_merged, df_demographic, by.x = "Date", by.y = "Year")

# Splitting the data into training and testing sets
set.seed(123) # for reproducibility
split_index <- sample(1:nrow(df_merged), 0.8 * nrow(df_merged))
train_data <- df_merged[split_index, ]
test_data <- df_merged[-split_index, ]

# Fitting the multivariable linear model
mod1_multi <- lm(Count ~ SearchRatio + Price + `Generation Z` + Millennials + `Generation X` + `Baby Boomers`, data = df_merged)

# Summary of the model to view coefficients and statistics
summary(mod1_multi)
#> 
#> Call:
#> lm(formula = Count ~ SearchRatio + Price + `Generation Z` + Millennials + 
#>     `Generation X` + `Baby Boomers`, data = df_merged)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -132.8  -13.9   -4.8    3.7 2068.2 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)     1.32e+02   1.06e+03    0.12    0.901  
#> SearchRatio     1.04e+00   1.33e+00    0.78    0.437  
#> Price           4.10e-02   2.83e-01    0.14    0.885  
#> `Generation Z`  1.38e-03   1.80e-03    0.77    0.442  
#> Millennials    -3.64e-04   4.96e-04   -0.73    0.464  
#> `Generation X` -1.02e-03   7.95e-04   -1.28    0.201  
#> `Baby Boomers`  6.32e-04   2.66e-04    2.38    0.018 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 117 on 497 degrees of freedom
#> Multiple R-squared:  0.0924, Adjusted R-squared:  0.0815 
#> F-statistic: 8.43 on 6 and 497 DF,  p-value: 9.9e-09

# Predicting with the test data
predictions <- predict(mod1_multi, test_data)

# Evaluating the model
# Calculating R-squared value
r_squared <- cor(test_data$Count, predictions)^2
cat("R-squared: ", r_squared, "\n")
#> R-squared:  0.365

# Calculating RMSE (Root Mean Squared Error)
rmse <- sqrt(mean((predictions - test_data$Count)^2))
cat("RMSE: ", rmse, "\n")
#> RMSE:  35.6

# Create a summary table for the linear model
table_regression_mod1_multi <- tbl_regression(mod1_multi)

# Print the table
table_regression_mod1_multi

Characteristic	Beta	95% CI¹	p-value
SearchRatio	1.0	-1.6, 3.7	0.4
Price	0.04	-0.52, 0.60	0.9
Generation Z	0.00	0.00, 0.00	0.4
Millennials	0.00	0.00, 0.00	0.5
Generation X	0.00	0.00, 0.00	0.2
Baby Boomers	0.00	0.00, 0.00	0.018
¹ CI = Confidence Interval

4.1.1.3.1.1 Model Diagnostic

In multivariable regression, we are going to check the same key assumptions of the model as for the linear regression. However, there’s an additional need to check for multicollinearity among predictors.

4.1.1.3.1.2 Results

The multilinear regression analysis results indicate that most of the independent variables have a weak or negligible effect on explaining the variation in the Count variable.

The coefficient for SearchRatio is slightly above zero (1.04e+00), but with a p-value of 0.437, it suggests that SearchRatio is not statistically significant in predicting EV adoption.
The coefficient for Price is also very close to zero (4.10e-02), and its p-value of 0.885 reinforces the conclusion that Price does not significantly contribute to the model.
The generational variables (Generation Z, Millennials, Generation X) have very small coefficients (ranging from -1.02e-03 to 1.38e-03) and p-values well above the significance level, indicating that these, too, are not statistically significant predictors of EV adoption, with the exception of Baby Boomers.
Baby Boomers is the only variable that shows a statistically significant relationship with Count (coefficient = 6.32e-04, p-value = 0.018)
The R-squared value of the model is low (0.0924), meaning it only explains about 9.24% of the variance in the EV adoption variable, which is not substantially improved when considering the adjusted R-squared (0.0815).
The Root Mean Squared Error (RMSE) is 35.6, which suggests that the model’s predictions have a moderate amount of error relative to the magnitude of EV adoption

4.1.1.3.1.3 Linearity

Code

plot_data <- data.frame(
  Observed = test_data$Count,
  Predicted = predictions
)

# Create ggplot object for observed vs predicted values
p <- ggplot(plot_data, aes(x = Predicted, y = Observed)) +
  geom_point(aes(color = Observed), alpha = 0.6) +  # Use color to represent the Observed values
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +  # Line y=x for reference
  scale_color_viridis(option = "D", direction = 1) +  # Consistent color scheme with the previous plots
  theme_minimal() +
  theme(text = element_text(size = 14), legend.position = "bottom") +  # Consistent theme and text size
  labs(title = "Observed vs. Predicted", x = "Predicted Values", y = "Observed Values", color = "Observed Count") +
  expand_limits(x = 0, y = 0)  # Ensure that the plot starts at 0

# Convert to a plotly interactive plot for consistency with the previous plots
ggplotly(p, tooltip = "text", width = 600, height = 400)

The scatter plot shows many points along the lower left, suggesting the model may underpredict for higher actual values, as evidenced by points falling above the red dashed line, which represents perfect prediction. The color intensity indicates fewer occurrences of higher values.

4.1.1.3.1.4 Homoscedasticity

Code

residuals <- test_data$Count - predictions

# Create a data frame for plotting
plot_data <- data.frame(
  Fitted = predictions,
  Residuals = residuals
)

# Create ggplot object for Residuals vs. Fitted
p_residuals_fitted <- ggplot(plot_data, aes(x = Fitted, y = Residuals)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = 'red') +
  geom_point(aes(color = Residuals), alpha = 0.6) +
  scale_color_viridis(option = "D", direction = -1) +
  theme_minimal() +
  labs(title = "Residuals vs. Fitted - Multivariable Model", x = "Fitted Values", y = "Residuals")

# Convert to a plotly interactive plot
ggplotly(p_residuals_fitted, width = 600, height = 400)

This plot indicates some non-random patterns in the residuals. The spread of residuals is not consistent across the range of fitted values, suggesting potential issues with homoscedasticity and possibly with the linearity assumption. The model may not be capturing all the systematic information present in the data.

4.1.1.3.1.5 Normality

Code

plot_data <- data.frame(Residuals = residuals)

# Create ggplot object for Histogram of Residuals
p_histogram_residuals <- ggplot(plot_data, aes(x = Residuals, fill = ..count..)) +
  geom_histogram(binwidth = 20, alpha = 0.7) +  # Adjust binwidth as necessary
  scale_fill_viridis_c(option = "D", begin = 0.3, end = 0.7, direction = -1) +  # Single color scale
  theme_minimal() +
  labs(title = "Histogram of Residuals", x = "Residuals", y = "Frequency", fill = "Frequency")

# Print the plot
ggplotly( p_histogram_residuals, width = 600, height = 400)

This histogram indicates a distribution of residuals that is not symmetric and shows a peak at the center. This central peak is a good sign, as it often indicates that the model’s predictions are on average correct, but the spread and potential skewness of the data still need to be assessed for normality.

4.1.1.3.1.6 Multicollinearity

Code

vif_values <- vif(mod1_multi)

# Print VIF values
print(vif_values)
#>    SearchRatio          Price `Generation Z`    Millennials 
#>           7.10           1.76          88.11          48.53 
#> `Generation X` `Baby Boomers` 
#>         289.98         110.94

# Interpret VIF values
# A rule of thumb is that if VIF is greater than 5 or 10, it indicates high multicollinearity.
high_vif <- vif_values[vif_values > 5]
if(length(high_vif) > 0){
  print("The following variables have a high VIF indicating potential multicollinearity:")
  print(high_vif)
} else {
  print("No evidence of concerning multicollinearity among the predictor variables.")
}
#> [1] "The following variables have a high VIF indicating potential multicollinearity:"
#>    SearchRatio `Generation Z`    Millennials `Generation X` 
#>            7.1           88.1           48.5          290.0 
#> `Baby Boomers` 
#>          110.9

vif_values <- c(SearchRatio = 7.10, Price = 1.76, `Generation Z` = 88.11, 
                Millennials = 48.53, `Generation X` = 289.98, `Baby Boomers` = 110.94)

# Create a data frame from the VIF values
vif_df <- data.frame(
  Variable = names(vif_values),
  VIF = unname(vif_values),
  Multicollinearity_Interpretation = c(
    "Moderate to high, indicating potential concerns.",
    "Low, suggesting little to no multicollinearity.",
    "Very high, indicating significant multicollinearity.",
    "High, indicating significant multicollinearity.",
    "Very high, indicating significant multicollinearity.",
    "Very high, indicating significant multicollinearity."
  )
)

# Generate the summary table
vif_table <- kable(vif_df, "html", booktabs = TRUE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F) %>%
  column_spec(1, bold = TRUE, width = "15em") %>%
  column_spec(2, width = "5em") %>%
  column_spec(3, width = "30em")

# Print the table to view it in R Markdown or R HTML output
vif_table

Variable	VIF	Multicollinearity_Interpretation
SearchRatio	7.10	Moderate to high, indicating potential concerns.
Price	1.76	Low, suggesting little to no multicollinearity.
Generation Z	88.11	Very high, indicating significant multicollinearity.
Millennials	48.53	High, indicating significant multicollinearity.
Generation X	289.98	Very high, indicating significant multicollinearity.
Baby Boomers	110.94	Very high, indicating significant multicollinearity.

Not surprinsingly, the VIF was high among the generational variables, which should be addressed to improve the model’s interpretability and reliability.

SearchRatio also demonstrate a high VIF, even though it’s not as extreme as for the generational variables.

4.1.1.3.2. Conclusion

Code

effects <- as.data.frame(fortify(mod1_multi))

# Create ggplot object for 'Baby Boomers' Effect
p_baby_boomers_effect <- ggplot(effects, aes_string(x = "`Baby Boomers`", y = ".fitted")) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = 'red', linetype = "dashed") +
  theme_minimal() +
  labs(title = "Baby Boomers Effect on EV Adoption", x = "Baby Boomers", y = "Fitted Values")
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> i Please use tidy evaluation idioms with `aes()`.
#> i See also `vignette("ggplot2-in-packages")` for more information.

# Convert to a plotly interactive plot
ggplotly(p_baby_boomers_effect, width = 600, height = 400)

This multivariate linear regression model does not appear to have a strong explanatory power, as indicated by the low adjusted R-squared value and the relatively high RMSE. Additionally, only the Baby Boomers variable shows statistical significance, as the graph demonstrate, in predicting the Count, while other variables do not appear to be significant in this context. However, given the high VIF this result should be interpreted with caution. Refining the model or exploring other variables to improve its predictive performance might be suggestible.

4.1.1.3.2 Adding Political Parties

As the initial results were inconclusive, we opted to incorporate the Political Parties dataset and perform a backward integration to determine the most suitable model for our prediction.

Code

# Remove 'Confederation' from df_merged
df_multi_reg <- df_merged[df_merged$Location != 'Confederation', ]

# Step 2: Aggregate data for 'Switzerland' in politic data, For each year, sum the values of the 26 cantons to create a combined observation labeled "Switzerland". and then divide by 26
political_combined_data$Year <- year(ymd(political_combined_data$Year))
df_politics <- political_combined_data %>%
  group_by(Year) %>%
  summarise(across(c(`Against`, `Slightly Against`, `Neutral`, `Slightly in Favour`, `In Favour`), sum, na.rm = TRUE)) %>%
  mutate(Canton = 'Switzerland')
#> Warning: There was 1 warning in `summarise()`.
#> i In argument: `across(...)`.
#> i In group 1: `Year = 1999`.
#> Caused by warning:
#> ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
#> Supply arguments directly to `.fns` through an anonymous function
#> instead.
#> 
#>   # Previously
#>   across(a:b, mean, na.rm = TRUE)
#> 
#>   # Now
#>   across(a:b, \(x) mean(x, na.rm = TRUE))

# Append the new 'Switzerland' rows to df_politics
df_politics <- bind_rows(political_combined_data, df_politics)

# List of column names you want to divide by 26
columns_to_divide <- c("Against", "Slightly Against", "Neutral", "Slightly in Favour", "In Favour")

df_politics <- df_politics %>%
  mutate_at(columns_to_divide, list(~./26))

# Step 3: Align 'Date' in df_multi_reg to 'Year' in df_politics 
df_multi_reg$Year <- year(ymd(df_multi_reg$Date))
df_multi_reg <- select(df_multi_reg, -Date)

#Merge the datasets on the aligned 'Location/Canton' and 'Date/Year' columns.
df_politics$Location <- df_politics$Canton 

# Step 4: Merge the datasets on 'Canton' and 'Year'
df_multi_reg <- merge(df_multi_reg, df_politics, by = c("Location", "Year"), all = FALSE)

# Reorder columns in a data frame
df_multi_reg <- df_multi_reg %>%
  select(Count, everything())

df_multi_reg <- df_multi_reg %>% select(-Canton, -KANTONSNUM)
any(is.na(df_multi_reg))
#> [1] FALSE

# Perform regression analysis without 'Location'
full_model <- lm(Count ~ Year + Price + SearchRatio + `Generation Z` + Millennials + `Generation X` + `Baby Boomers` + Against + `Slightly Against` + Neutral + `Slightly in Favour` + `In Favour`, data = df_multi_reg)

# Perform backward elimination
reduced_model <- step(full_model, direction = "backward")
#> Start:  AIC=715
#> Count ~ Year + Price + SearchRatio + `Generation Z` + Millennials + 
#>     `Generation X` + `Baby Boomers` + Against + `Slightly Against` + 
#>     Neutral + `Slightly in Favour` + `In Favour`
#> 
#> 
#> Step:  AIC=715
#> Count ~ Year + Price + SearchRatio + `Generation Z` + Millennials + 
#>     `Generation X` + Against + `Slightly Against` + Neutral + 
#>     `Slightly in Favour` + `In Favour`
#> 
#> 
#> Step:  AIC=715
#> Count ~ Year + Price + SearchRatio + `Generation Z` + Millennials + 
#>     Against + `Slightly Against` + Neutral + `Slightly in Favour` + 
#>     `In Favour`
#> 
#> 
#> Step:  AIC=715
#> Count ~ Year + Price + SearchRatio + `Generation Z` + Against + 
#>     `Slightly Against` + Neutral + `Slightly in Favour` + `In Favour`
#> 
#> 
#> Step:  AIC=715
#> Count ~ Year + Price + SearchRatio + Against + `Slightly Against` + 
#>     Neutral + `Slightly in Favour` + `In Favour`
#> 
#>                        Df Sum of Sq   RSS AIC
#> - Against               1       101 68999 714
#> - `Slightly in Favour`  1       112 69010 714
#> - Year                  1       537 69435 714
#> <none>                              68898 715
#> - SearchRatio           1      1555 70453 716
#> - Price                 1      1668 70566 716
#> - Neutral               1      6034 74932 723
#> - `Slightly Against`    1     17063 85961 737
#> - `In Favour`           1     22929 91828 745
#> 
#> Step:  AIC=714
#> Count ~ Year + Price + SearchRatio + `Slightly Against` + Neutral + 
#>     `Slightly in Favour` + `In Favour`
#> 
#>                        Df Sum of Sq    RSS AIC
#> - `Slightly in Favour`  1        78  69077 712
#> - Year                  1       463  69462 712
#> <none>                               68999 714
#> - SearchRatio           1      1459  70459 714
#> - Price                 1      1624  70624 714
#> - Neutral               1      6289  75289 721
#> - `Slightly Against`    1     17073  86073 736
#> - `In Favour`           1     66523 135523 785
#> 
#> Step:  AIC=712
#> Count ~ Year + Price + SearchRatio + `Slightly Against` + Neutral + 
#>     `In Favour`
#> 
#>                      Df Sum of Sq    RSS AIC
#> - Year                1       478  69556 711
#> <none>                             69077 712
#> - SearchRatio         1      1484  70561 712
#> - Price               1      1696  70773 712
#> - Neutral             1      6457  75534 719
#> - `Slightly Against`  1     18225  87303 735
#> - `In Favour`         1     66469 135547 783
#> 
#> Step:  AIC=711
#> Count ~ Price + SearchRatio + `Slightly Against` + Neutral + 
#>     `In Favour`
#> 
#>                      Df Sum of Sq    RSS AIC
#> - Price               1      1257  70813 710
#> <none>                             69556 711
#> - SearchRatio         1      3651  73207 714
#> - Neutral             1      6957  76513 719
#> - `Slightly Against`  1     17771  87326 733
#> - `In Favour`         1     65991 135547 781
#> 
#> Step:  AIC=710
#> Count ~ SearchRatio + `Slightly Against` + Neutral + `In Favour`
#> 
#>                      Df Sum of Sq    RSS AIC
#> <none>                             70813 710
#> - SearchRatio         1      4023  74836 714
#> - Neutral             1      6722  77535 718
#> - `Slightly Against`  1     17769  88582 733
#> - `In Favour`         1     65625 136438 779

# View the summary of the reduced model
summary(reduced_model)
#> 
#> Call:
#> lm(formula = Count ~ SearchRatio + `Slightly Against` + Neutral + 
#>     `In Favour`, data = df_multi_reg)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -111.61  -10.32   -2.99    8.03  153.23 
#> 
#> Coefficients:
#>                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         -12.858      6.017   -2.14   0.0350 *  
#> SearchRatio           0.827      0.342    2.42   0.0173 *  
#> `Slightly Against`  -10.300      2.026   -5.08  1.7e-06 ***
#> Neutral             -18.494      5.915   -3.13   0.0023 ** 
#> `In Favour`          20.121      2.059    9.77  2.4e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 26.2 on 103 degrees of freedom
#> Multiple R-squared:  0.699,  Adjusted R-squared:  0.687 
#> F-statistic: 59.7 on 4 and 103 DF,  p-value: <2e-16

# Create a summary table for the linear model
table_regression_mod2_multi <- tbl_regression(reduced_model)

4.1.1.3.2.1 Model Diagnosis

4.1.1.3.2.1.1 Results

Code

# Print the table
table_regression_mod2_multi

Characteristic	Beta	95% CI¹	p-value
SearchRatio	0.83	0.15, 1.5	0.017
Slightly Against	-10	-14, -6.3	<0.001
Neutral	-18	-30, -6.8	0.002
In Favour	20	16, 24	<0.001
¹ CI = Confidence Interval

Starting Model: The initial model includes a wide range of predictor variables: Year, Price, SearchRatio, various generational groups (Generation Z, Millennials, Generation X,Baby Boomers), and several categories of opinions (Against, Slightly Against, Neutral, Slightly in Favour, In Favour).
Stepwise Process: The stepwise procedure iteratively removes the least significant variables based on their contribution to the model (using the Akaike Information Criterion, AIC, as a guide). A lower AIC indicates a better model fit with respect to the number of variables.
Model Refinement: As we progress through the steps, we observe the removal of several variables like Baby Boomers, Millennials, and Generation X. This suggests that these variables were not significantly contributing to the explanation of variations in Count.
Final Model: The last step shows the model with the variables: SearchRatio, Slightly Against, Neutral, and In Favour. This model has an AIC of 710, which is lower than the starting AIC of 715, indicating a more efficient model.
Significance of Remaining Variables: In the final model, all the variables are considered significant contributors. If any were to be removed, it would result in a higher AIC, indicating a less optimal model.
Interpretation of Variables:
- SearchRatio, Slightly Against, Neutral, and In Favour are significant predictors for Count.
- The absence of demographic variables in the final model suggests that the adoption count may not be strongly related to these demographic factors, or their effect is captured by other variables.
- The presence of opinion-related variables (like In Favour) indicates a possible correlation between public opinion and the count of the dependent variable (possibly related to EV adoption or similar context).
Cautions: While stepwise regression is useful for variable selection, it can sometimes lead to overfitting or neglecting important variables that don’t show strong individual effects but are important in combination with others. Hence, the results should be interpreted with caution, and further analysis (like checking for multicollinearity, interactions between variables, etc.) is recommended to validate the findings.

4.1.1.3.2.1.2 Linearity

Code

# Step 1: Create predictions
df_multi_reg$Predicted <- predict(reduced_model, newdata = df_multi_reg)

# Step 2: Combine predicted and observed values
plot_data <- data.frame(Observed = df_multi_reg$Count, Predicted = df_multi_reg$Predicted)

# Step 3: Plot using ggplot2
p <- ggplot(plot_data, aes(x = Predicted, y = Observed)) +
  geom_point(aes(color = Observed), alpha = 0.6) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  scale_color_viridis(option = "D", direction = 1) +
  theme_minimal() +
  theme(text = element_text(size = 14), legend.position = "bottom") +
  labs(title = "Observed vs. Predicted", x = "Predicted Values", y = "Observed Values", color = "Observed Count") +
  expand_limits(x = 0, y = 0)

# Step 4: Convert to a plotly interactive plot
ggplotly(p, tooltip = "text", width = 600, height = 400)

This graph shows that for lower predicted values, the observed and predicted values are closer, but as values increase, the predictions become less accurate. This suggest that the model’s predictions are not consistent across the range of data, especially at higher values. This could indicate potential model limitations or the influence of outliers or leverage points at higher values.

4.1.1.3.2.1.3 Homoscedasticity

Code


# Step 1: Calculate residuals and get fitted values
df_multi_reg$Residuals <- residuals(reduced_model)
df_multi_reg$Fitted <- fitted(reduced_model)

# Step 2: Prepare data for plotting
plot_data <- data.frame(Fitted = df_multi_reg$Fitted, Residuals = df_multi_reg$Residuals)

# Step 3: Create ggplot object for Residuals vs. Fitted
p_residuals_fitted <- ggplot(plot_data, aes(x = Fitted, y = Residuals)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = 'red') +
  geom_point(aes(color = Residuals), alpha = 0.6) +
  scale_color_viridis(option = "D", direction = -1) +
  theme_minimal() +
  labs(title = "Residuals vs. Fitted - Multivariable Model", x = "Fitted Values", y = "Residuals")

# Step 4: Convert to a plotly interactive plot
ggplotly(p_residuals_fitted, width = 600, height = 400)

This graph shows a pattern where residuals spread out as the fitted values increase. The presence of a few large residuals suggests potential outliers or influential points. This pattern suggests that the model may not be capturing all the complexities of the data.

4.1.1.3.2.1.4 Normality

Code


# Step 1: Calculate residuals
residuals <- residuals(reduced_model)

# Step 2: Prepare data for plotting
plot_data <- data.frame(Residuals = residuals)

# Step 3: Create ggplot object for Histogram of Residuals
p_histogram_residuals <- ggplot(plot_data, aes(x = Residuals, fill = ..count..)) +
  geom_histogram(binwidth = 20, alpha = 0.7) +  # Adjust binwidth as necessary
  scale_fill_viridis_c(option = "D", begin = 0.3, end = 0.7, direction = -1) +  # Single color scale
  theme_minimal() +
  labs(title = "Histogram of Residuals", x = "Residuals", y = "Frequency", fill = "Frequency")

# Step 4: Convert to an interactive plot
ggplotly(p_histogram_residuals, width = 600, height = 400)

The histogram displays a distribution of residuals with a significant peak in the center, suggesting that many of the residuals are small and close to zero, which is a good sign for model accuracy. However, there are also some residuals far from zero, indicating potential outliers or model misspecification. The presence of outliers can distort the overall distribution, making it appear less normal and potentially impacting the model’s predictive performance.

4.1.1.3.2.1.5 Multicollinearity

Code

vif_values <- vif(reduced_model)

# Interpret VIF values
# A rule of thumb is that if VIF is greater than 5 or 10, it indicates high multicollinearity.
high_vif <- vif_values[vif_values > 5]
if(length(high_vif) > 0){
  print("The following variables have a high VIF indicating potential multicollinearity:")
  print(high_vif)
} else {
  print("No evidence of concerning multicollinearity among the predictor variables.")
}
#> [1] "The following variables have a high VIF indicating potential multicollinearity:"
#> `Slightly Against`            Neutral        `In Favour` 
#>              23.99               5.46              19.32

vif_values <- c(SearchRatio = 1.08, `Slightly Against`= 23.99, Neutral = 5.46, 
                `In Favour` = 19.36)

# Create a data frame from the VIF values
vif_df <- data.frame(
  Variable = names(vif_values),
  VIF = unname(vif_values),
  Multicollinearity_Interpretation = c(
    "No evidence of concerning multicollinearity among the predictor variables.",
    "Very high, indicating significant multicollinearity.",
    "High, indicating significant multicollinearity.",
    "Very high, indicating significant multicollinearity."
  )
)

# Generate the summary table
vif_table <- kable(vif_df, "html", booktabs = TRUE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F) %>%
  column_spec(1, bold = TRUE, width = "15em") %>%
  column_spec(2, width = "5em") %>%
  column_spec(3, width = "30em")

vif_table

Variable	VIF	Multicollinearity_Interpretation
SearchRatio	1.08	No evidence of concerning multicollinearity among the predictor variables.
Slightly Against	23.99	Very high, indicating significant multicollinearity.
Neutral	5.46	High, indicating significant multicollinearity.
In Favour	19.36	Very high, indicating significant multicollinearity.

The table indicates that Slightly Against and In Favour have very high Variance Inflation Factors (VIFs) of 23.99 and 19.32 respectively, pointing to significant multicollinearity issues, which means they share a lot of information and could be problematic for the regression analysis. Neutral has a moderate to high VIF, which might also be a concern. SearchRatio has a low VIF, suggesting it doesn’t have multicollinearity problems with other variables.

4.1.1.3.2.2 Conclusion

Based on the regression results and the graphs, it seems that the final model with variables SearchRatio, Slightly Against, Neutral, and In Favour explains a substantial part of the variance in the dependent variable (Multiple R-squared: 0.699). The residuals show some potential outliers, but the histogram suggests most errors are small. While the model appears statistically significant and explains much of the variability, caution should be exercised due to potential outliers and the patterns observed in the residuals, which could indicate model misspecification or the need for transformation of variables.

Significant multicollinearity in the model, especially with slightly Against and In Favour, casts doubt on the reliability of the regression results and suggests the need for model adjustments or alternative analysis methods.

When considering whether to exclude outliers from our analysis, we should proceed with caution. Outliers shouldn’t be discarded just because they deviate from the norm; doing so might make our model too specific to the data we’ve used, and it might not perform well with new data. It’s important to understand why these data points are outliers:

If they’re caused by mistakes in data entry or measurement errors, it’s appropriate to either fix or remove them.
If they represent a part of the population that our model doesn’t adequately explain, this could indicate the need for a more sophisticated model.
If these outliers are genuine extreme values that naturally occur within the dataset, it’s essential to keep them. Removing valid data points can skew the results.

Given these considerations, we’ve decided to retain these outliers in our dataset. By keeping them, we ensure that our analysis accounts for the full spectrum of data, including those extreme but valid variations, thereby aiming for a more comprehensive and accurate model.

Overall Conclusion of our regressions

In the analysis of Swiss EV adoption trends, the final model with ‘SearchRatio’, ‘Slightly Against’, ‘Neutral’, and ‘In Favour’ showed significant results despite multicollinearity and outlier concerns. For future studies, considering nonlinear models, machine learning techniques, or incorporating time series analysis could provide deeper insights, especially for long-term trends and event-related impacts. These alternative approaches might better capture complex dynamics and interactions among variables, enhancing the model’s predictive power and robustness.

4.2 RQ2 - Are there differences in adoption rate within the different regions in Switzerland? And are there different buying behaviors displayed by the demographic segments within Switzerland?

EV Adoption per Capita

From the initial data:

In 2005, the EV adoption rate per capita was approximately 0.000004, which means there were about 4 EVs per million people. By 2009, this rate increased to approximately 18 EVs per million people.

This approach will give us a general sense of EV adoption in relation to the overall population but won’t provide regional demographic granularity.

The merged data now includes the total population for each year in Switzerland and the total count of electric vehicles (EVs) for those years. We have also calculated the EV adoption rate per capita, which gives us an insight into how EV adoption scales with the population size.

These figures show a growing trend in EV adoption in relation to the population size, albeit the numbers are still quite small relative to the total population.

The trend shows a gradual increase in EV adoption relative to the population size, indicating a growing acceptance and usage of electric vehicles in Switzerland during this period.

Code

df_demo <- df_demographic
df_ev <- df_v_electric
# Convert Date and Year to Date type
df_ev$Date <- as.Date(df_ev$Date)
df_demo$Year <- as.Date(df_demo$Year)

# Summing up the population for each year
df_demo$total_population <- rowSums(df_demo[,c("Generation Z", "Millennials", "Generation X", "Baby Boomers")])

# Aggregating EV data by year
df_ev_yearly <- df_ev %>%
  group_by(Year = as.Date(format(Date, "%Y-01-01"))) %>%
  summarize(total_ev = sum(Count))

# Merging the datasets
merged_data <- merge(df_ev_yearly, df_demo, by = "Year")

# Calculating EV adoption per capita
merged_data$ev_per_capita <- merged_data$total_ev / merged_data$total_population

# Creating a ggplot object with your data
p <- ggplot(merged_data, aes(x = Year, y = ev_per_capita, group = 1)) +
  geom_line(color = "#24918d", size = 1) +
  labs(title = "EV Adoption Per Capita Over Time in Switzerland",
       x = "Year",
       y = "EV Adoption Per Capita")

# Animate the plot with gganimate, revealing the line over time
animated_plot <- p +
  transition_reveal(Year)
# Render the animation
animate(animated_plot, renderer = gganimate::gifski_renderer(), width = 600, height = 400, res = 96)

Correlation of EV registration with age groups

The correlation matrix below shows the relationships between the proportions of different generational groups (Generation Z, Millennials, Generation X, Baby Boomers) and the EV adoption rate per capita in Switzerland. The heatmap provides the following insights:

The correlation coefficients indicate the strength and direction of the relationship between each pair of variables. Positive values suggest a positive correlation (as one increases, so does the other), while negative values suggest an inverse relationship.

Code

# Data Preparation
df_ev$Date <- as.Date(df_ev$Date)
df_demo$Year <- as.Date(df_demo$Year)

# Summing up the population for each year
df_demo$total_population <- rowSums(df_demo[,c("Generation Z", "Millennials", "Generation X", "Baby Boomers")])

# Calculate proportions
df_demo$prop_gen_z <- df_demo$`Generation Z` / df_demo$total_population
df_demo$prop_millennials <- df_demo$Millennials / df_demo$total_population
df_demo$prop_gen_x <- df_demo$`Generation X` / df_demo$total_population
df_demo$prop_boomers <- df_demo$`Baby Boomers` / df_demo$total_population

# Aggregating EV data by year
df_ev_yearly <- df_ev %>%
  group_by(Year = as.Date(format(Date, "%Y-01-01"))) %>%
  summarize(total_ev = sum(Count))

# Merging the datasets
merged_data <- merge(df_ev_yearly, df_demo, by = "Year")

# Calculating EV adoption per capita
merged_data$ev_per_capita <- merged_data$total_ev / merged_data$total_population

# Correlation Matrix
correlation_matrix <- cor(merged_data[,c("prop_gen_z", "prop_millennials", "prop_gen_x", "prop_boomers", "ev_per_capita")])

# Melting the correlation matrix for ggplot
melted_correlation_matrix <- melt(correlation_matrix)
#> Warning in melt(correlation_matrix): The melt generic in data.table
#> has been passed a matrix and will attempt to redirect to the
#> relevant reshape2 method; please note that reshape2 is deprecated,
#> and this redirection is now deprecated as well. To continue using
#> melt methods from reshape2 while both libraries are attached, e.g.
#> melt.list, you can prepend the namespace like
#> reshape2::melt(correlation_matrix). In the next version, this
#> warning will become an error.

# Modify the variable names by removing 'prop_' and replacing '_' with ' '
melted_correlation_matrix$Var1 <- gsub("prop_", "", melted_correlation_matrix$Var1)
melted_correlation_matrix$Var2 <- gsub("prop_", "", melted_correlation_matrix$Var2)
melted_correlation_matrix$Var1 <- gsub("_", " ", melted_correlation_matrix$Var1)
melted_correlation_matrix$Var2 <- gsub("_", " ", melted_correlation_matrix$Var2)

# Choose a single color from the viridis palette
# Here, we pick a color at the midpoint of the palette; 
# 0 would be the start and 1 would be the end of the palette.
single_viridis_color <- viridis_pal()(1)

# Create a gradient of shades based on the single color
viridis_shades <- colorRampPalette(c("#24918d", single_viridis_color))

# Creating the heatmap with the single shade gradient
p <- ggplot(melted_correlation_matrix, aes(Var1, Var2, fill = rescale(value, c(0, 1)))) +
  geom_tile() +
  geom_text(aes(label = sprintf("%.2f", value)), color = "white", size = 4) +
  scale_fill_gradientn(colors = viridis_shades(100), name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "", y = "", title = "Correlation Heatmap")

# Convert to interactive plot
interactive_plot <- ggplotly(p, tooltip = c("label", "fill"), width = 600, height = 500)

interactive_plot

Rural vs Urban

We merged two datasets, one listing Swiss cantons as urban or rural df_cantons and the other tracking monthly electric vehicle counts df_ev, filtered out non-canton data, grouped by urban/rural typology and month, then calculated a 12-month rolling average to smooth out variances, and finally created an interactive time-series graph using dygraphs to visually compare EV trends between urban and rural areas in Switzerland.

Source - bfs.admin.ch

Code

# Text input
text <- "Canton\tUrban-Rural Typology
Aargau\tUrban
Appenzell Ausserrhoden\tRural
Appenzell Innerrhoden\tRural
Basel-Landschaft\tUrban
Basel-Stadt\tUrban
Bern\tUrban
Freiburg\tUrban
Genève\tUrban
Glarus\tRural
Graubünden\tRural
Jura\tRural
Luzern\tUrban
Neuchâtel\tUrban
Nidwalden\tRural
Obwalden\tRural
St. Gallen\tUrban
Schaffhausen\tUrban
Schwyz\tRural
Solothurn\tUrban
Thurgau\tUrban
Ticino\tUrban
Uri\tRural
Valais\tRural
Vaud\tUrban
Zug\tUrban
Zürich\tUrban"

# Split the text into lines
lines <- strsplit(text, "\n")[[1]]

# Split each line into Canton and Typology
data <- strsplit(lines, "\t")

# Extract Cantons and Typologies into separate vectors
cantons <- sapply(data, function(x) x[1])
typologies <- sapply(data, function(x) x[2])

# Remove the header
cantons <- cantons[-1]
typologies <- typologies[-1]

# Create a data frame
dataset <- data.frame(Canton = cantons, Typology = typologies)

# Update the 'Canton' column with abbreviations
df_cantons <- dataset %>%
  mutate(Canton = case_when(
    Canton == "Zürich" ~ "ZH",
    Canton == "Bern" ~ "BE",
    Canton == "Luzerne" ~ "LU",
    Canton == "Uri" ~ "UR",
    Canton == "Schwyz" ~ "SZ",
    Canton == "Obwalden" ~ "OW",
    Canton == "Nidwalden" ~ "NW",
    Canton == "Glarus" ~ "GL",
    Canton == "Zug" ~ "ZG",
    Canton == "Freibourg" ~ "FR",
    Canton == "Solothurn" ~ "SO",
    Canton == "Basel-Stadt" ~ "BS",
    Canton == "Basel-Landschaft" ~ "BL",
    Canton == "Schaffhausen" ~ "SH",
    Canton == "Appenzell Ausserrhoden" ~ "AR",
    Canton == "Appenzell Innerrhoden" ~ "AI",  
    Canton == "St. Gallen" ~ "SG",
    Canton == "Graubünden" ~ "GR",
    Canton == "Aargau" ~ "AG",
    Canton == "Thurgau" ~ "TG",
    Canton == "Ticino" ~ "TI",
    Canton == "Vaud" ~ "VD",
    Canton == "Valais" ~ "VS",
    Canton == "Neuchâtel" ~ "NE",
    Canton == "Genève" ~ "GE", 
    Canton == "Jura" ~ "JU",
    TRUE ~ Canton # Default case to keep original names for unmatched entries
  ))

# Filter out 'Confederation' and 'Switzerland' from df_ev before merging
df_ev_filtered <- df_ev %>%
  filter(!Location %in% c("Confederation", "Switzerland"))

# Convert Date to Date format in df_ev_filtered
df_ev_filtered$Date <- as.Date(df_ev_filtered$Date)

# Perform the merge
df_merged <- inner_join(df_ev_filtered, df_cantons, by = c("Location" = "Canton"))

# Create a sequence of all months present in the data
all_months <- seq(from = min(df_merged$Date), to = max(df_merged$Date), by = "1 month")

# Group by Typology and Date (monthly)
df_grouped <- df_merged %>%
  group_by(Typology, Month = floor_date(Date, "month")) %>%
  summarise(Total_EV_Count = sum(Count), .groups = 'drop')

# Make sure that we have all months for each Typology
df_complete <- df_grouped %>%
  complete(Month = all_months, Typology, fill = list(Total_EV_Count = 0))

# Spread the data into separate columns for each Typology
df_wide <- df_complete %>%
  pivot_wider(names_from = Typology, values_from = Total_EV_Count)

# Replace NA with 0 if there are any left after pivot_wider
df_wide[is.na(df_wide)] <- 0

# Convert to xts object for dygraphs
xts_data <- xts(df_wide[, -1], order.by = df_wide$Month)

# Calculate rolling mean
roll_mean <- rollapply(xts_data, width = 12, FUN = mean, by.column = TRUE, align = "right", fill = NA)

# Create the dygraph with 
dygraph(roll_mean, main = "Urban vs Rural EV Adoption (12-Month Rolling Mean)", width = "600px", height = "400px") %>%
  dySeries("Urban", label = "Urban", color = "#2f114a", fillGraph = TRUE) %>%
  dySeries("Rural", label = "Rural", color = "#24918d", fillGraph = TRUE) %>%
  dyOptions(strokeWidth = 1.5) %>%
  dyLegend(show = "always") %>%
  dyRangeSelector(height = 20) # Adds a range selector for zooming

The graph displays a clear upward trend in electric vehicle adoption over time in both urban and rural areas of Switzerland, with urban areas consistently showing higher counts. This could be due to factors such as better charging infrastructure, higher population density, or more environmental awareness in urban regions.

4.3 RQ3 - How has the growth of electric vehicles evolved in comparison to other countries such as France, and what factors might account for the differences in their evolution ?

The EDA section (3.7) already helps us answering this question involving quite strong arguments. Nevertheless, we decided to find out more about those differences in evolution by comparing the increase in charging station spots both in France and Switzerland. To do so, we computed a bar chart in order to improve the visualisation of this juxtaposition.

Code

data <- charge_ch_fr 

# Convert year to Date format and then extract the year
data$year <- as.Date(paste0(data$year, "-01-01"))
data$year <- format(data$year, "%Y")

# Sum the values by year and region
data_summarized <- data %>%
  group_by(year, region) %>%
  summarize(total_value = sum(value))

viridis_colors <- viridis(2.1)
# Create the ggplot
p <- ggplot(data_summarized, aes(x = year, y = total_value, fill = region)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = viridis_colors) +  # Adjusted custom colors
  labs(title = "Total Availability of Charging Stations in France vs Switzerland",
       x = "Year",
       y = "Total Charging Stations") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), # Center the plot title
        legend.title = element_blank()) # Remove the legend title

# Convert to interactive plot using plotly
p_interactive <- ggplotly(p)
p_interactive <- ggplotly(p, width = 600, height = 600, tooltip = c("x", "y", "color"))
p_interactive <- p_interactive %>%
  layout(legend = list(orientation = "h", x = 0, xanchor = "left", y = -0.2))
p_interactive

The results show that since 2015, France has exponentially increased its number of stations while Switzerland seems to be gradually adapting. It is important to take the respective surface and demography differences of the two countries in consideration.Even with that in mind, Switzerland seems to be late in terms of adoption of charging station spots in the last decade. Overall, an increasing number of charging stations in both countries rhymes with an overall EV adoption for the two historical friends.

4.4 RQ4 - To what extent does the evolution in the availability of charging stations exert an influence on the adoption of electric vehicles in Switzerland?

To evaluate this question in the best conditions, we decided to compute some empirical plots.

Code

# First, let's merge the df_v and df_charge_number_CH data sets, and we will look at Fuel: Electricity

df_v_electric_total_ch <- df_v %>%
  filter(Fuel == "Electricity", VehicleType == "Passenger car", Location == c("Switzerland","Confederation")) %>%
  select(Date, Count)

sum_by_year <- df_v_electric_total_ch %>%
  group_by(Year = lubridate::year(Date)) %>%
  summarise(Total_Count = sum(Count))


# Convert year to a common format for merging
sum_by_year <- sum_by_year %>%
  mutate(year = as.Date(paste0(Year, "-01-01")))

# Merge the datasets based on the "year" column
merged_v_charge <- left_join(sum_by_year, df_charge_number_CH, by = c("year" = "year"))

# cleaning merged data set
merged_v_charge <- merged_v_charge %>%
  filter(Year > "2011") %>%
  select(Year, Total_Count, powertrain, value)

names(merged_v_charge)[names(merged_v_charge) == "Total_count"] <- "EVs"
colnames(merged_v_charge)[colnames(merged_v_charge) == "value"] <- "Charging station"

# Summing Powertrain together
merged_v_charge <- merged_v_charge %>%
  group_by(Year, Total_Count) %>%
  summarise(Count = sum(`Charging station`))

# Checking the correlation
corr_charge_ev <- cor(merged_v_charge$Total_Count, merged_v_charge$Count)

# their correlation is 0.957, almost perfectly correlated (no suprise here)

# Checking for lagged correlation

lags_to_explore <- 1:3

lagged_correlation <- function(data, lag) {
  data %>%
    mutate(Count_Lagged = lag(Count, n = lag, default = NA)) %>%
    summarise(Correlation = cor(Total_Count, Count_Lagged, use = "complete.obs"))
}

# Calculate lagged correlations for each lag
lagged_correlations_df <- data.frame(Lag = lags_to_explore) %>%
  rowwise() %>%
  mutate(Correlation = lagged_correlation(merged_v_charge[, -1], Lag)$Correlation)

# Print the results
print("Original Correlation:")
#> [1] "Original Correlation:"
print(corr_charge_ev)
#> [1] 0.957

print("Lagged Correlations:")
#> [1] "Lagged Correlations:"
print(lagged_correlations_df)
#> # A tibble: 3 x 2
#> # Rowwise: 
#>     Lag Correlation
#>   <int>       <dbl>
#> 1     1       0.943
#> 2     2       0.910
#> 3     3       0.885

# Create a table that displays the correlations

# Putting together year-on-year correlation and lagged correlation
new_row <- tibble(Lag = 0, Correlation = corr_charge_ev)
ev_charging_correlation_df <- bind_rows(new_row, lagged_correlations_df)
ev_charging_correlation_df <- arrange(ev_charging_correlation_df, Lag)


correlation_table <- tibble(
  Variable = ev_charging_correlation_df[,1],
  Description = ev_charging_correlation_df[,2]
)

kable_correlation <- tibble(
  Lag = c("2022 & 2022", "2022 & 2021", "2022 & 2020", "2022 & 2019"),
  Correlations = c("0.957", "0.943", "0.910", "0.885")
)

# Display the table using kableExtra
kable_correlation %>%
  kbl() %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover"))

Lag	Correlations
2022 & 2022	0.957
2022 & 2021	0.943
2022 & 2020	0.910
2022 & 2019	0.885

Code



# Displaying a table for the report:


# Now we formulate the following Hypothesis

# H0: new charging station increase EV adoption vs. H1: new charging station does not increase EV adoption

# Check these hypotheses with a simple linear regression
linear_charging <- lm(Total_Count ~ Count, data = merged_v_charge)

# Poisson Test
# poisson_model <- glm(Total_Count ~ Count, family = poisson, data = merged_v_charge)

# Set up the layout using mfrow
par(mfrow = c(1, 2))  # 1 row, 2 columns

# Plotting for Simple Linear Regression
plot(linear_charging, 1, main = "LM Residuals vs Fitted")
plot(linear_charging, 2, main = "LM Normal Q-Q Plot")
plot(linear_charging, 3, main = "LM Scale-Location Plot")
plot(linear_charging, 5, main = "LM Residuals vs Leverage")

# Plotting for Poisson Regression
#plot(poisson_model, which = 1, main = "Poisson Residuals vs Fitted")
#plot(poisson_model, which = 2, main = "Poisson Normal Q-Q Plot")
#plot(poisson_model, which = 3, main = "Poisson Scale-Location Plot")
#plot(poisson_model, which = 5, main = "Poisson Residuals vs Leverage")

# Printing
summary(linear_charging)
#> 
#> Call:
#> lm(formula = Total_Count ~ Count, data = merged_v_charge)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>  -6059  -2966  -1298   3627   4916 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -5340.399   2122.178   -2.52    0.033 *  
#> Count           3.341      0.338    9.88  3.9e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4150 on 9 degrees of freedom
#> Multiple R-squared:  0.916,  Adjusted R-squared:  0.906 
#> F-statistic: 97.7 on 1 and 9 DF,  p-value: 3.94e-06
# summary(poisson_model)
# Reset the layout
par(mfrow = c(1, 1))

The year-on-year correlation is the highest, the lagged correlation diminishes. One possible interpretation is that the adoption of EVs is extremely reactive to the availability of new charging stations. However, it is almost certainly not the case, rather, the availability of charging stations and the new registration of EVs go hand-in-hand, and the availability of new charging station does not create a demand of new EVs by itself. Correlation does not imply causation, while we see a relationship, we can’t conclude that charging stations directly cause changes in electric vehicle adoption only with a Correlation analysis

To further explore this research question, we used both a linear regression and a Poisson-test.However, given the assumptions needed for a relevant Poisson-test, and the fact that we could not ensure these were respected with our data, we decided not to include it in our final report, although the results were “statistically significant”

For our linear regression. We find evidence of statistically significant relationship between the count of available charging station and the count of electric vehicles registered. Our Prediction variable/coefficient is 3.34 and the p-val < 0.005, namely, for each additional charging station available, 3.34 new EVs would be adopted according to this model. However, it is important to remind ourselves that these variables have a bidirectional / mutual influence, beyond the scope of what our analysis shows. The relationship is not strictly unidirectional and therefore, it is hard to conclude anything without further domain-knowledge and context-specific information

5 Conclusion

Limitations and future work

In the current study, a significant limitation is the lack of comparative data regarding the pricing of electric vehicles (EVs) versus thermal vehicles. This price comparison is a crucial factor that very likely influences consumer preferences and purchase decisions. The acquisition of detailed data in this area represents a substantial project in its own right. It could however be very interesting to compare it with our results to get a wider understanding of buying choices. Furthermore, it could help to get a more precise regression and more explanatory factors to predict the evolution of electric vehicles in the future.

Additionally, the analysis did not account for pollution taxes, an omission that merits attention. Pollution taxes are expected to increase in alignment with environmental objectives, potentially altering the actual cost of fuel at gas stations. While the current data, tied to oil prices, retains relevance, incorporating pollution tax implications could yield a more accurate reflection of real-world costs.

Another unexplored dimension of this study is the marketing impact on EV adoption in Switzerland. The quantification of marketing effects presents a considerable challenge, as does the lack of available data. Developing a measurement scale and undertaking research in this area would be beneficial but requires extensive time and financial resources.

Finally, the study did not investigate the influence of governmental subsidies on EV adoption due to data granularity constraints. The data we used extends only to the canton level, while all the subsidies are given at a commune level. We therefore face a problem we cannot solve with our data. Future studies with access to more granular data could provide valuable insights into the impact of these subsidies on EV adoption rates.

Summary of findings

Our analysis numerically confirmed what we already believed; that electric vehicles are getting implemented for the long run and it is a switch in the propulsion systems of vehicles in Switzerland. Overall, the new registered vehicles are decreasing compared to the peak which occurred around 2015. Registrations of electric, pluggable and non-pluggable vehicles are on the rise, while those of diesel and petrol vehicles are diminishing.

In terms of regional variation, the adoption of EVs is most pronounced in the canton of Zurich, attributable to its large population. Additionally, the canton of Zug, when demographically adjusted for population size, shows a high rate of EV adoption, likely influenced by the population’s affinity for new technologies and their financial prosperity.

A pattern of seasonality has been shown when overlaying the different years in one graph. This result could gain interest if a drastic shifts in climate change occurs in the next few years. In the situation of Siberian or tropical weather conditions, the number of registered vehicles will probably be different, and a link to our seasonality results could be identified.

The study further examined the impact of political dynamics in Switzerland, revealing significant influences from various political parties, including those that are neutral. Therefore, political Neutrality does not rhyme with Environmental neutrality.

To develop a regression model, we considered factors such as oil prices, the availability of charging stations, the evolution of political parties, Google search trends, and demographic data. However, the model failed to show satisfactory results, possibly due to the limitations identified earlier in our report.

Overall, our analysis did not give groundbreaking results. It did empirically validate what was previously considered as common knowledge. Our findings offer factual and numerical confirmation of these assumptions.

1 Introduction

1.1 Executive Summary

1.2 Background and Motivation

1.3 Related Work

1.4 Reseach Questions

2 Data

2.1 Raw Datasets

2.1.1 Swiss Vehicle Registration Dataset

2.1.2 Oil price Dataset

2.1.3 Swiss Demographics Dataset

2.1.4 Google trend data

2.1.5 French vehicle registration data

2.1.6 Charging Stations

2.1.7 Complementary charging station

2.1.8 Political strength per canton

2.1.9 Swiss Population

2.2 Data Wrangling

2.2.1 Data Wrangling : Oil clean

2.2.2 Data Wrangling : Google trend clean

2.2.3 Cleaning of demographic_data

2.2.4 Data Wrangling : Swiss vehicle clean

Redundancy

Merging and Joining

Refinement and Vehicle Classification Insights

Rows with 0

2.2.5 Data wrangling French vehicle clean

2.2.6 Availability of charging stations

2.2.7 Availability of charging station in France and Switzerland

2.2.8 Political strength per canton

2.2.9 Swiss Population

3 Exploratory data analysis

3.1 Switzerland

3.1.1 seasonality

3.1.2 Vehicule Registration by Fuel time over time

3.1.3 Availability of Charging station

3.1.3.1 Availability of Charging station in Switzerland

3.1.3.2 Availability of Charging station in France

3.1.4 Map

3.1.4.1 Count of Electricity car Registration for all years per cantons

3.1.4.2 Count of Electricity car Registration for all years per cantons Standardized

3.2 Google Trend

3.3 Oil

3.4 Demographics

3.5 French vehicles

3.5.1 Total vehicles evolution France

3.5.2 Deltas evolution

3.6 Political Parties

3.7 Swiss vs France

3.7.1 Electric vs Hybrid vs Petrol

3.8 EV and Google Trends

3.9 EV and Oil Price

4 Analysis

4.1 RQ1 - Based on past electric vehicle adoption trends in Switzerland, can we forecast future adoption rates and pinpoint times of significant increases or decreases correlated with major events or policy changes?

4.1.1. Simple Linear Regression

4.1.1.1 With Oil

4.1.1.1.1 Model Diagnostic

4.1.1.1.1.2 Results

4.1.1.1.1.1.2 Linearity

4.1.1.1.1.1.3 Homoscedasticity

4.1.1.1.1.1.4 Normality

4.1.1.1.3 Conclusion

4.1.1.2. With google trends

4.1.2.1 Model Diagnostic

4.1.1.2.1.1. Results

4.1.2.1.1.2 Linearity

4.1.2.1.1.3. Homoscedasticity

4.1.2.1.1.4 Normality

4.1.1.2.2 Conclusion

4.1.1.3 Multivariable Regression

4.1.1.1.3.1 With Demographic groups, Oil Price and Google Trend

4.1.1.3.1.1 Model Diagnostic

4.1.1.3.1.2 Results

4.1.1.3.1.3 Linearity

4.1.1.3.1.4 Homoscedasticity

4.1.1.3.1.5 Normality

4.1.1.3.1.6 Multicollinearity

4.1.1.3.2. Conclusion

4.1.1.3.2 Adding Political Parties

4.1.1.3.2.1 Model Diagnosis

4.1.1.3.2.1.1 Results