1 Exploration & Visualization

1.1 Objectives

The main objectives of this project is to predict :

The overnight stays of the visitors in Vaud, from October 2023 until December 2024.
The overnight stays of visitors from Philippines to Zürich, from October 2023 until December 2024.

1.2 Cleaning & Wrangling

Check the appendix for the initial dataset

Code

# Load the data in folder data named Dataset_tourism.xlsx)
tourism_data <- readxl::read_xlsx(here("data/Dataset_tourism.xlsx"))

#removing value 'Herkunftsland - Total' in column 'Herkunftsland' as it is just the total
#tourism_data <- tourism_data %>% filter(Herkunftsland != "Herkunftsland - Total")
#print unique values in month column
unique(tourism_data$Monat)
#>  [1] "Januar"    "Februar"   "März"      "April"     "Mai"      
#>  [6] "Juni"      "Juli"      "August"    "September" "Oktober"  
#> [11] "November"  "Dezember"
# change ' [1] "Januar"    "Februar"   "März"      "April"     "Mai"       "Juni"      "Juli"      "August" "September" "Oktober"   "November"  "Dezember" into english month'
tourism_data$Monat <- tourism_data$Monat %>% recode_factor(
  "Januar" = "January",
  "Februar" = "February",
  "März" = "March",
  "April" = "April",
  "Mai" = "May",
  "Juni" = "June",
  "Juli" = "July",
  "August" = "August",
  "September" = "September",
  "Oktober" = "October",
  "November" = "November",
  "Dezember" = "December"
)
#add date type column for plotting purposes
tourism_data <- tourism_data %>% mutate(Date = dmy(paste("01", Monat, Jahr)))
# filtering out 'Herkunftsland - Total' in column 'Herkunftsland' as it is just the total
tourism_data_no_total <- tourism_data %>% filter(Herkunftsland != "Herkunftsland - Total")
#check for NAN
sum(is.na(tourism_data_no_total))
#> [1] 51395
#analyse the NAN values, where are they
(tourism_data_no_total %>% filter(is.na(value)))
#> # A tibble: 51,395 x 6
#>    Herkunftsland                  Kanton Monat Jahr  value Date      
#>    <chr>                          <chr>  <fct> <chr> <dbl> <date>    
#>  1 Malta                          Schwe~ Janu~ 2005     NA 2005-01-01
#>  2 Zypern                         Schwe~ Janu~ 2005     NA 2005-01-01
#>  3 Mexiko                         Schwe~ Janu~ 2005     NA 2005-01-01
#>  4 Übriges Zentralamerika, Karib~ Schwe~ Janu~ 2005     NA 2005-01-01
#>  5 Bahrain                        Schwe~ Janu~ 2005     NA 2005-01-01
#>  6 Katar                          Schwe~ Janu~ 2005     NA 2005-01-01
#>  7 Kuwait                         Schwe~ Janu~ 2005     NA 2005-01-01
#>  8 Australien                     Schwe~ Janu~ 2005     NA 2005-01-01
#>  9 Neuseeland, Ozeanien           Schwe~ Janu~ 2005     NA 2005-01-01
#> 10 Oman                           Schwe~ Janu~ 2005     NA 2005-01-01
#> # i 51,385 more rows

1.2.1 Tourism Data - Vaud

Given the two objectives of the project, we are going to filter the initial dataset by the “Kanton” column Vaud and the “Herkunftsland” column by “Herkunftsland - Total.”

Code

# Filter by canton Vaud 
tourism_vaud <- tourism_data %>% filter(Kanton == "Vaud")
#check for NAN
sum(is.na(tourism_vaud))
#> [1] 1869

1.2.2 Tourism Data - Zurich and Philippines

The data Zurich contained the total number of overnight stays in Zurich by tourists from different countries. We are interested in the number of overnight stays by tourists from the Philippines. See appendix for more details on the table.

Thus, we are filtering the “Kanton” column and the ‘Herkunftsland’ column, keeping “Zurich” and “Philippinen” for the country of origin.

Code

#filter column 'Kanton' for Zurich
tourism_data_zurich <- tourism_data_no_total %>% filter(Kanton == "Zürich")
#check for NAN
sum(is.na(tourism_data_zurich))
#> [1] 1869
#analyse the NAN values, where are they
tourism_data_zurich %>% filter(is.na(value))
#> # A tibble: 1,869 x 6
#>    Herkunftsland                  Kanton Monat Jahr  value Date      
#>    <chr>                          <chr>  <fct> <chr> <dbl> <date>    
#>  1 Malta                          Zürich Janu~ 2005     NA 2005-01-01
#>  2 Zypern                         Zürich Janu~ 2005     NA 2005-01-01
#>  3 Mexiko                         Zürich Janu~ 2005     NA 2005-01-01
#>  4 Übriges Zentralamerika, Karib~ Zürich Janu~ 2005     NA 2005-01-01
#>  5 Bahrain                        Zürich Janu~ 2005     NA 2005-01-01
#>  6 Katar                          Zürich Janu~ 2005     NA 2005-01-01
#>  7 Kuwait                         Zürich Janu~ 2005     NA 2005-01-01
#>  8 Australien                     Zürich Janu~ 2005     NA 2005-01-01
#>  9 Neuseeland, Ozeanien           Zürich Janu~ 2005     NA 2005-01-01
#> 10 Oman                           Zürich Janu~ 2005     NA 2005-01-01
#> # i 1,859 more rows

tourism_data_zurich_philippines <- tourism_data_zurich %>% filter(Herkunftsland == "Philippinen")
#show table using reactable
table_zh_ph <- reactable(tourism_data_zurich_philippines, 
          searchable = TRUE,
          defaultPageSize = 5,
          sortable = TRUE,
          filterable = TRUE,
          pagination = TRUE,
          bordered = TRUE,
          highlight = TRUE,
          striped = TRUE,
          compact = TRUE,
          showPageSizeOptions = TRUE,
          showSortable = TRUE
          )

Filtering for Philippinen solved the problem of missing data we had with all countries of origin. The overnight stays are all included throughout the period.

2 EDA - Vaud

2.1 Visitors in Vaud

Graphical view of total number of tourists in canton Vaud :

Code

tourism_vaud_total <- tourism_vaud %>%
  filter(Herkunftsland == 'Herkunftsland - Total') %>%
  select(-c(Herkunftsland, Kanton, Monat, Jahr))

# Create the ggplot object with viridis color palette
plot_vaud_total <- tourism_vaud_total %>%
  ggplot(aes(x = Date, y = value)) +
  geom_line(color = viridis(1)) +  # Use viridis color palette for a single line
  labs(x = "Date", y = "Number of tourists", title = "Total number of tourists in canton Vaud") + 
  theme_minimal()

# Convert to an interactive plotly object with specified width and height
interactive_plot_total <- ggplotly(plot_vaud_total, width = 600, height = 300)

# Adjust plotly settings
interactive_plot_total %>%
  layout(
    margin = list(l = 60, r = 60, b = 60, t = 80),
    showlegend = FALSE  # Remove legend if any
  )

The plot shows key trends in tourist visits to the canton of Vaud. There are noticeable dips in overnight stays at the year’s start, typically due to holidays, with a significant drop in 2020 due to the COVID pandemic. Mainly, tourists in canton Vaud are from Switzerland, followed by France (see appendix 8.2). Swiss tourists show a rising trend excluding the pandemic period. Overall, the data displays a seasonal pattern with fewer tourists at year-end and a peak during summer. Future forecasts should consider these seasonal variations and changing trends.

Code

# Convert data to a time series object
vaud_ts <- tourism_vaud_total %>%
  arrange(Date) %>%
   # Filtre pour enlever les valeurs NA dans 'Date'
  filter(!is.na(Date)) %>%
  # Ensure data is complete and monthly
  complete(Date = seq.Date(min(Date, na.rm = TRUE), max(Date, na.rm = TRUE), by = "month")) %>%
  replace_na(list(value = 0)) %>%  # Replace NA values if there are any
  # Create a time series object
  with(ts(value, frequency = 12, start = decimal_date(min(Date, na.rm = TRUE))))

# Perform STL decomposition
stl_vd <- stl(vaud_ts, s.window = "periodic") %>% 
   autoplot() +
  labs(x = "Date", y = "Number of tourists", title = "STL Decomposition for number of tourists in canton Vaud") +
  theme_minimal()

The decomposition analysis unveils significant trends in the data: a clear upward trajectory until 2020 followed by a sharp decline due to pandemic-induced travel restrictions, alongside evident monthly seasonality marked by regular fluctuations. Notably, the residual component remains stable until 2020, after which a slight uptick in volatility suggests potential external influences not captured by trend and seasonality.

3 EDA - Zurich

As for Vaud, the most frequent visitors to Zurich are Swiss. Germany and United States are the two main foreign countries to visit Zurich. This can be explained by the fact that the canton of Zurich is closer to Germany and therefore easier to reach. The same applies to France with the canton of Vaud. The yellow curve represents the Philippines. The curve is flat and shows a considerably small number of trips from this country over the period. There is a drastic fall in 2020 caused by COVID-19. The pandemic has had a significant impact on the tourism industry worldwide. At first glance, there are regular seasonal peaks for most countries which also suggest the presence of seasonality in tourism in the canton of Zurich. check the Appendix for more details for Overall Zurichs Visitors graph

3.1 Zurich and Philippines Visitors

This graph shows only visitors from the Philippines, as this is the country of interest in our analysis.

Code

##### EDA ZH ALL #####
# Preparing the data
#removing value in column 'Herkunftsland' as it is just the whole of Switzerland
data <- tourism_data_zurich %>%
  filter(!is.na(value)) %>%  # Removing rows with NA values in the 'value' column
  mutate(Monat = month(Date, label = TRUE, abbr = TRUE),  # Extract month 
         Jahr = year(Date)) %>%  # Extract year from Date
  group_by(Herkunftsland, Date) %>%  # Group by country and date
  summarise(Trips = sum(value), .groups = 'drop')  # Summing up trips for each country per date

p <- ggplot(data, aes(x = Date, y = Trips, group = Herkunftsland,
                      color = Herkunftsland == "Philippinen",
                      text = paste("Country:", Herkunftsland, "<br>Trips:", Trips))) +  # Added text for tooltip
  geom_line(show.legend = FALSE) +
    scale_color_viridis_d() +  # Use viridis color palette
  labs(title = "Number of Trips from Each Country to Zurich",
       x = "Date",
       y = "Number of Trips") +
  theme_minimal()

# Convert to an interactive plotly object
interactive_plot <- ggplotly(p, tooltip = "text", width = 600, height = 600)

# Adjust plotly settings 
interactive_plot_zh_all <- interactive_plot %>%
  layout(
    margin = list(l = 60, r = 60, b = 60, t = 80),  # Adjust margins
    showlegend = FALSE  # Show legend
  )

##### EDA ZH PHILIPPINES #####
# Use tourism_data_zurich_philippines data to plot the values in y axis and Date in x axis
p <- ggplot(tourism_data_zurich_philippines, aes(x = Date, y = value)) +
  geom_line(color = viridis(1)) +  # Use viridis color palette for a single line
  labs(title = "Number of Trips from Philippines to Zurich",
       x = "Date",
       y = "Number of Trips") +
  theme_minimal()

# Convert to an interactive plotly object with specified width and height
interactive_plot <- ggplotly(p, width = 600, height = 300)

# Adjust plotly settings
interactive_plot <- interactive_plot %>%
  layout(
    showlegend = FALSE  # Remove legend if any
  )

# Display the interactive plot
interactive_plot

Code

# Convert data to a time series object
tourism_ts <- tourism_data_zurich_philippines %>%
  arrange(Date) %>%
  # Ensure data is complete and monthly
  complete(Date = seq.Date(min(Date), max(Date), by = "month")) %>%
  replace_na(list(value = 0)) %>%  # Replace NA values if there are any
  # Create a time series object
  with(ts(log(value), frequency = 12, start = decimal_date(min(Date))))

# Perform STL decomposition on the transformed data
tourism_stl <- stl(tourism_ts, s.window = "periodic")

# Plotting the results (transforming back by exponentiating)
stl_zh_m <- autoplot(tourism_stl) +
  labs(x = "Date", y = "Number of tourists", title = "Multiplicative STL Decomposition for number of tourists from Philippines to Zurich") +
  theme_minimal()

# Convert data to a time series object
tourism_ts <- tourism_data_zurich_philippines %>%
  arrange(Date) %>%
  # Ensure data is complete and monthly
  complete(Date = seq.Date(min(Date), max(Date), by = "month")) %>%
  replace_na(list(value = 0)) %>%  # Replace NA values if there are any
  # Create a time series object
  with(ts(value, frequency = 12, start = decimal_date(min(Date))))

# Perform STL decomposition
stl_zh <- stl(tourism_ts, s.window = "periodic") %>% 
  autoplot() +
  labs(x = "Date", y = "Number of tourists", title = "STL Decomposition for number of tourists from Philippines to Zurich") +
  theme_minimal()

# several chart per month to see the seasonality
seaso_plot_zh_2 <- ggsubseriesplot(tourism_ts) + ylab("Number of tourists") + xlab("Month") + ggtitle("Seasonal subseries plot")

#debug
#better to use gg_subseries to see the seasonality
#tourism_ts %>% gg_subseries(value) + ylab("Number of tourists") + xlab("Month") + ggtitle("Seasonal subseries plot")

4 Modelling

In this section, we will apply forecasting techniques to model tourism data for Zurich, aiming to predict future trends.

4.1 Total number of visitors in Vaud

We will forecast the total number of visitors in Vaud over a 15-month horizon using ETS, ARIMA, and TSLM models.

4.1.1 ETS model

The ETS (Error, Trend, Seasonality) model is used due to its simplicity and reliability. We employ the ETS “AAA” model, which considers additive error, trend, and seasonality components. The forecast follows trends and seasonality but shows high uncertainty.

Code

ets_vaud <- ets(vaud_ts, model = "AAA")
forecast_ets_vaud <- forecast(ets_vaud, h = 15) 

# Calculate the accuracy of the training set
metrics_ets_vaud <- ets_vaud %>% accuracy() %>%
  kable("html", caption = "ETS AAA Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

# Extract and print the AIC
aic_ets_vaud <- AIC(ets_vaud)

4.1.2 ARIMA model

The automatic ARIMA (AutoRegressive Integrated Moving Average) model selects optimal parameters to capture data patterns. Initially, ARIMA(5,0,0)(0,1,1) was chosen automatically, reflecting the seasonal patterns and a slight downward trend due to the 2020 drop, however, currently, there are no evident reasons for a decrease in the number of visitors. Adjusting to ARIMA(5,1,0)(0,1,1) improves trend capture and aligns forecasts with observed patterns. The model`s explicitly was changed using a differencing order of 1 in the non-seasonal component.

Code

# Convert your time series data to a tsibble object
vaud_tsibble <- as_tsibble(vaud_ts)

# Fit an ARIMA model
fit_arima_vaud <- vaud_tsibble %>%
  model(
    ARIMA_auto = ARIMA(value),
    ARIMA_dif = ARIMA(value ~ pdq(5, 1, 0) + PDQ(0, 1, 1))
  )

# Generate forecasts for the next 15 months
forecast_arima_vaud <- fit_arima_vaud %>%
  forecast(h = 15)

# Plot the forecasts along with the historical data
plot_arima_vaud <- forecast_arima_vaud %>%
  autoplot(vaud_tsibble, alpha = 0.3) +
  labs(title = "ARIMA Forecast of Tourists in Vaud",
       x = "Date",
       y = "Number of Tourists") +
  guides(colour = guide_legend(title = "Forecast"))

# Calculate the accuracy of the training set
metrics_arima_vaud <- fit_arima_vaud %>% accuracy() %>%
  kable("html", caption = "ARIMA Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

# Extract and print the AIC for each model
aic_arima_vaud <- fit_arima_vaud %>%
  glance() %>%
  select(.model, AIC)

4.1.3 TSLM

In light of the significant drop in tourist numbers caused by the COVID-19 pandemic in 2019, it is crucial to address this anomaly effectively in our forecasting methodologies. The COVID-19 pandemic significantly impacted tourist numbers. We incorporate the Stringency Index from the OxCGRT dataset (taken from Our World In Data) into our Time Series Linear Regression (TSLM) model. Assuming no restrictions for the next 15 months, we set the stringency index to 0, aiming for an optimistic forecast reflecting pre-pandemic travel levels.

Code

stringency_df <- read.csv(here("data/stringency_index.csv"))

stringency_switzerland <- filter(stringency_df, location == "Switzerland")

stringency_switzerland$date <- as.Date(format(dmy(stringency_switzerland$date), "%Y-%m-01"))

# Aggregate to monthly average, ensuring date format is maintained
stringency_switzerland <- stringency_switzerland %>%
  group_by(date) %>%
  summarize(avg_stringency_index = mean(stringency_index, na.rm = TRUE))

# Rename date column 
tourism_vaud_total <- tourism_vaud_total %>%
  rename(date = Date)

# Then merge with Switzerland data
merged_vaud <- merge(tourism_vaud_total, stringency_switzerland, by = "date", all.x = TRUE)

# Replace NA values in avg_stringency_index with 0 if necessary
merged_vaud <- merged_vaud %>%
  mutate(avg_stringency_index = replace_na(avg_stringency_index, 0))

# Convert merged_vaud data frame to tsibble object
merged_vaud_tsibble <- merged_vaud %>%
  as_tsibble(index = date)

# Define the model
Model_vaud_stringency <- merged_vaud_tsibble %>%
  model(
    TSLM_vaud_str = TSLM(value ~ trend() + season() + avg_stringency_index)
  )

# Generate future dates
future_dates <- seq.Date(from = as.Date("2023-10-01"), by = "month", length.out = 15)

# Create future data with avg_stringency_index set to 0 for 15 months
future_data <- tibble(
  date = future_dates,
  avg_stringency_index = 0
) %>%
  as_tsibble(index = date)

# Forecast
forecast_vaud_stringency <- Model_vaud_stringency %>%
  forecast(new_data = future_data)

# Calculate the accuracy of the training set
metrics_tslm_vaud <- Model_vaud_stringency %>% accuracy() %>%
  kable("html", caption = "TSLM Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

# Extract and print the AIC for the TSLM model
aic_tslm_vaud <- Model_vaud_stringency %>%
  glance() %>%
  filter(.model == "TSLM_vaud_str") %>%
  pull(AIC)

4.2 Zurich and Philippines

In this section, we focus on the number of visitors from the Philippines to Zurich. We use the Naive model as a benchmark for ETS and ARIMA models to forecast visitors over a 15-month horizon.

4.2.0.1 ETS

We run a Naive model as a benchmark and a AAA model for the ETS model to compare against a multiplicative model. see appendix for the plots and metrics of those 2 models

Code

###### NAIVE part #####
#convert tourism_ts to tsibble
tourism_ts <- tourism_ts %>% as_tsibble()
# Fit a naive model
fit <- tourism_ts %>%
  model(NAIVE = NAIVE(value))
# Forecast the next 2 years periods
forecast <- fit %>%
  forecast(h = 15)
# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposes
plot_naive <- forecast %>%
  autoplot(tourism_ts, alpha = 0.5) +
  labs(title = "Forecast of tourists from Philippines to Zurich",
       x = "Date",
       y = "Number of tourists") + guides(colour = guide_legend(title = "Forecast"))

metrics_naive <- fit %>% accuracy()
# Display accuracy metrics in an HTML table
metrics_naive <- metrics_naive %>%
  kable("html", caption = "Naive Model Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

###### ETS PART AAA #####

# Fit an ETS model
# Adjusting the model parameters according to the characteristics of the data
# Here "A" means additive error, "N" means no trend, and "N" means no seasonality
# change these if needed
fit <- tourism_ts %>%
  model(ETS_M_seaso = ETS(value ~ error("A") + trend("A") + season("A")))
# Forecast the next 2 years periods
forecast <- fit %>%
  forecast(h = 15)
# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposes
plot_ets_AAA <- forecast %>%
  autoplot(tourism_ts, alpha = 0.5) +
  labs(title = "Forecast of tourists from Philippines to Zurich",
       x = "Date",
       y = "Number of tourists") + guides(colour = guide_legend(title = "Forecast"))

# Calculate the accuracy of the training set
metrics_ets_AAA <- fit %>% accuracy() %>%
  kable("html", caption = "ETS AAA Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

##### ETS ####
fit <- tourism_ts %>%
  model(ETS_M_seaso_Ad = ETS(value ~ error("A") + trend("Ad") + season("M")))

# Forecast the next 2 years periods
forecast <- fit %>%
  forecast(h = 15, level = 95)

# Extract the forecasts for the ETS_M_seaso_Ad model
forecast_ETS_M_seaso_Ad <- forecast %>%
  filter(.model == "ETS_M_seaso_Ad") %>%
  mutate(
    # Extract mean forecasts
    ETS_M_seaso_Ad_Forecast = .mean,
    # Extract 95% confidence intervals
    `95%` = hilo(value, level = 95)
  ) %>%
  # Unpack the 95% confidence intervals
  unpack_hilo(`95%`, names_sep = "_")

# Now, rename columns as needed
forecast_ETS_M_seaso_Ad <- forecast_ETS_M_seaso_Ad %>%
  rename(
    Date = index,
    ETS_Lo_95 = `95%_lower`,
    ETS_Hi_95 = `95%_upper`
  )

# Select the relevant columns
forecast_ETS_M_seaso_Ad <- forecast_ETS_M_seaso_Ad %>%
  select(Date, ETS_M_seaso_Ad_Forecast, ETS_Lo_95, ETS_Hi_95)

# comparing several model
fit <- tourism_ts %>%
  model(
        ETS_M_seaso = ETS(value ~ error("A") + trend("A") + season("M")), #multiplicative seasonality
        ETS_M_seaso_Ad = ETS(value ~ error("A") + trend("Ad") + season("M")), #dampted trend
  )

# Forecast the next 2 years periods
forecast <- fit %>%
  forecast(h = 15)

# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposes
plot_ets_AAM_AAdM <- forecast %>%
  autoplot(tourism_ts, level = 90, color = "blue", alpha = 0.5) +
  labs(title = "Forecast of tourists from Philippines to Zurich",
       x = "Date",
       y = "Number of tourists") + guides(colour = guide_legend(title = "Forecast"))

# Extract AIC values
aic_values <- fit %>%
  glance() %>%
  select(.model, AIC)

# Print the AIC values
print(aic_values)
#> # A tibble: 2 x 2
#>   .model           AIC
#>   <chr>          <dbl>
#> 1 ETS_M_seaso    3256.
#> 2 ETS_M_seaso_Ad 3195.

# Display AIC values with forecast metrics
metrics <- fit %>% accuracy()
metrics_with_aic <- metrics %>%
  left_join(aic_values, by = ".model")

# Display metrics with AIC in an HTML table
metrics_ets <- metrics_with_aic %>%
  kable("html", caption = "Model Accuracy Metrics with AIC Values") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Next, we forecast tourists using ETS AAM and AAdM models. The damped model performs better on most metrics. see the appendix for a vizualisation and metrics of the 2 models

We observe that the damped model is better on almost all metrics.

4.2.0.2 ARIMA

We also ran three ARIMA models for comparison, but they underperformed compared to ETS models. see the appendix for the vizualisation of the 3 models

Code

# Fit an automatic ARIMA model
fit_arima <- tourism_ts %>%
  model(ARIMA_auto = ARIMA(value),
        ARIMA_auto_seasonal = ARIMA(value~season()),
        ARIMA_auto_stepwise = ARIMA(value, stepwise = FALSE, approximation = FALSE))

# Forecast the next 15 months
forecast_arima <- fit_arima %>%
  forecast(h = 15)

# Plot the forecasts along with the historical data
plot_arima_zh <- forecast_arima %>%
  autoplot(tourism_ts, alpha = 0.3) +
  labs(title = "ARIMA Forecast of Tourists from the Philippines to Zurich",
       x = "Date",
       y = "Number of Tourists") +
  guides(colour = guide_legend(title = "Forecast"))

You can also check the appendix for the metrics of the ARIMA models

Code

# Extract AIC values
aic_values <- fit_arima %>%
  glance() %>%
  select(.model, AIC)

# Calculate the accuracy of the training set
metrics_arima <- fit_arima %>% accuracy()

# Combine accuracy metrics with AIC values
metrics_arima <- metrics_arima %>%
  left_join(aic_values, by = ".model")

# Display accuracy metrics with AIC values in an HTML table
metrics_arima_table <- metrics_arima %>%
  kable("html", caption = "Model Accuracy Metrics with AIC Values") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Code

# Convert data to a time series object
tourism_ts <- tourism_data_zurich_philippines
#rename Date as date
names(tourism_ts)[names(tourism_ts) == "Date"] <- "date"
  
#read .csv with stringency index
stringency_df <- read.csv(here("data/stringency_index.csv"))

# Filter data by location
stringency_philippines <- filter(stringency_df, location == "Philippines")
stringency_switzerland <- filter(stringency_df, location == "Switzerland")

# Convert dates and set them to the first day of the month
stringency_philippines$date <- as.Date(format(dmy(stringency_philippines$date), "%Y-%m-01"))
stringency_switzerland$date <- as.Date(format(dmy(stringency_switzerland$date), "%Y-%m-01"))

# Aggregate to monthly average, ensuring date format is maintained
stringency_philippines <- stringency_philippines %>%
  group_by(date) %>%
  summarize(avg_stringency_index = mean(stringency_index, na.rm = TRUE))

stringency_switzerland <- stringency_switzerland %>%
  group_by(date) %>%
  summarize(avg_stringency_index = mean(stringency_index, na.rm = TRUE))

# Merge with Philippines data first
merged_data_philippines <- merge(tourism_ts, stringency_philippines, by = "date", all.x = TRUE)

# Then merge with Switzerland data
merged_data <- merge(merged_data_philippines, stringency_switzerland, by = "date", all.x = TRUE, suffixes = c("_PH", "_SW"))

# Replace NA values in avg_stringency_index with 0 if necessary
merged_data$avg_stringency_index_PH[is.na(merged_data$avg_stringency_index_PH)] <- 0
merged_data$avg_stringency_index_SW[is.na(merged_data$avg_stringency_index_SW)] <- 0


# Create a ggplot of the stringency index
stringency_zh <- ggplot(merged_data, aes(x = date, y = avg_stringency_index_PH, color = "Philippines")) +
  geom_line() +
  geom_line(aes(y = avg_stringency_index_SW, color = "Switzerland")) +
  labs(title = "Stringency Index in the Philippines and Switzerland",
       x = "Date",
       y = "Stringency Index") +
  scale_color_manual(values = c("#3C5B6F", "darkred"),
                     labels = c("Philippines", "Switzerland"))

Code

#show merge data using reactable
stringency_table <- reactable(merged_data, 
          searchable = TRUE,
          defaultPageSize = 5,
          sortable = TRUE,
          filterable = TRUE,
          pagination = TRUE,
          bordered = TRUE,
          highlight = TRUE,
          striped = TRUE,
          compact = TRUE,
          showPageSizeOptions = TRUE,
          showSortable = TRUE
          )

4.2.0.3 ARIMAX

We conduct a multiple regression analysis with the number of tourists as the dependent variable and the average stringency index for Switzerland and the Philippines as explanatory variables. The results show counter-intuitive impacts: increased stringency in the Philippines is associated with more tourists, while increased stringency in Switzerland leads to fewer tourists. Only 9.3% of the variance is explained, indicating other influencing factors. We apply an ARIMAX model, fixing seasonality as TRUE and stepwise and approximation as FALSE for a thorough search. See the appendix for ARIMAX model visualizations and metrics

Code

# Transform to a time series object with frequency 12 (monthly data)
tourism_ts <- ts(merged_data$value, frequency = 12, start = c(2005, 1))  # Adjust the start time as per your data

# Ensure exogenous variables have the same length and frequency
exog_data <- cbind(merged_data$avg_stringency_index_PH, merged_data$avg_stringency_index_SW)

# Check if lengths of tourism_ts and exog_data match
if (length(tourism_ts) == nrow(exog_data)) {
  # Fit an ARIMAX model
  model <- auto.arima(tourism_ts, xreg = exog_data, seasonal = TRUE, stepwise = FALSE, approximation = FALSE)
  
  # Summary of the model
  summary(model)
  
  # Forecast the next 15 months, setting future exogenous variables to 0
  future_exog <- matrix(0, nrow = 15, ncol = 2)
  
  forecast_values <- forecast(model, xreg = future_exog, h = 15)
  
  # Convert the forecast to a data frame
  forecast_arimax <- as.data.frame(forecast_values)
  
  # Rename the columns for clarity
  colnames(forecast_arimax) <- c("Point Forecast", "Lo 80", "Hi 80",
                                 "Lo 95", "Hi 95")
  forecast_arimax$Date <- seq(as.Date("2023-10-01"), by = "month",
                              length.out = 15)
  # Plot the forecast along with the actual data using autoplot from the forecast package
  plot_forecast_arimax <- autoplot(forecast_values, series = "Forecast") +
    autolayer(tourism_ts, series = "Actual Data") +
    labs(title = "ARIMAX Forecast of Tourists from the Philippines to Zurich",
         x = "Date",
         y = "Number of Tourists") +
    guides(colour = guide_legend(title = "Data Type"))
  
    # Calculate evaluation metrics on the training data
  residuals <- residuals(model)
  mae <- mean(abs(residuals))
  mape <- mean(abs(residuals / tourism_ts)) * 100
  rmse <- sqrt(mean(residuals^2))
  aic <- AIC(model)
  bic <- BIC(model)
  
  # Print evaluation metrics
  cat("Evaluation Metrics:\n")
  cat("MAE:", mae, "\n")
  cat("MAPE:", mape, "\n")
  cat("RMSE:", rmse, "\n")
  cat("AIC:", aic, "\n")
  cat("BIC:", bic, "\n")
} else {
  stop("Length of tourism_ts and exog_data do not match. Please check the data.")
}
#> Evaluation Metrics:
#> MAE: 65 
#> MAPE: 74.9 
#> RMSE: 105 
#> AIC: 2744 
#> BIC: 2774

Code

#show metric in html
model_metrics <- data.frame(
  Model = "ARIMAX",
  MAE = mae,
  MAPE = mape,
  RMSE = rmse,
  AIC = aic,
  BIC = bic
)

#show html table with metrics
metrics_arimax <- model_metrics %>%
  kable("html", caption = "Model Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Overall, ARIMA_auto_stepwise balances RMSE and AIC well, while ARIMAX excels in MAE and MAPE. Depending on the prioritized metric, either ARIMAX or ARIMA_auto_stepwise is preferred.

See the appendix for additional ideas we considered

5 Model Selection

5.1 Vaud

In the model selection process, we compare the performance of different forecasting models using various metrics to determine which model provides the most accurate and reliable forecasts for the number of tourists in Vaud.

Code

# Create the data frame for metrics
data.frame(
  model = c("ETS Vaud", "ARIMA_auto", "ARIMA_dif", "TSLM Vaud"),
  RMSE = c(8790, 8757, 8654, 22053),
  MAE = c(5328, 5170, 5004, 19487),
  MAPE = c(9.41, 9.11, 9.04, 26.1),
  MASE = c(0.458, 0.445, 0.431, 0.553),
  AIC = c(5339, 4508, 4485, 4521)
) %>%
  kable("html", caption = "Model Metrics for Vaud") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Model Metrics for Vaud
model	RMSE	MAE	MAPE	MASE	AIC
ETS Vaud	8790	5328	9.41	0.458	5339
ARIMA_auto	8757	5170	9.11	0.445	4508
ARIMA_dif	8654	5004	9.04	0.431	4485
TSLM Vaud	22053	19487	26.10	0.553	4521

When comparing forecasting models, we prioritize metrics like AIC, RMSE, MAE, and MAPE, where lower values indicate better accuracy. ARIMA (5,1,0)(0,1,1) consistently achieves the lowest AIC, RMSE, and MAE, and also performs well on MASE, indicating high relative accuracy. Therefore, ARIMA (5,1,0)(0,1,1) is the preferred model for forecasting tourist numbers in Vaud. Its MAPE of 9.04% suggests that, on average, the model’s forecasts deviate from actual values by 9.04%, indicating reasonable performance with some room for improvement.

Code

# Filter forecasts for ARIMA_dif model
forecast_arima_dif <- forecast_arima_vaud %>%
  filter(.model == "ARIMA_dif")

# Extract month and year from the index
forecast_arima_dif$Month <- substr(forecast_arima_dif$index, 6, 8)
forecast_arima_dif$Year <- substr(forecast_arima_dif$index, 1, 4)

# Create a Date column in "Year Month" format
forecast_arima_dif$Date <- paste(forecast_arima_dif$Year, forecast_arima_dif$Month)

# Remove Month and Year columns if needed
forecast_arima_dif <- subset(forecast_arima_dif, select = -c(Month, Year))

# Extract mean and variance from the "value" column
mean_variance <- str_extract_all(forecast_arima_dif$value, "[0-9.]+")
#> Warning in stri_extract_all_regex(string, pattern, simplify =
#> simplify, : argument is not an atomic vector; coercing

# Extract mean and variance values
mean_values <- as.numeric(sapply(mean_variance, function(x) x[1]))
variance_values <- as.numeric(sapply(mean_variance, function(x) x[2]))

# Calculate standard deviations
standard_deviations <- sqrt(variance_values)

# Calculate lower and upper confidence intervals
forecast_arima_dif$Lower_CI <- mean_values - 1.96 * standard_deviations
forecast_arima_dif$Upper_CI <- mean_values + 1.96 * standard_deviations

# Create reactable table with selected columns
forecast_vaud_finaltable <- reactable(
  forecast_arima_dif[, c("Date", ".mean", "Lower_CI", "Upper_CI")],
  columns = list(
    Date = colDef(name = "Date"),
    Forecast_value = colDef(name = ".mean"),
    Lower_CI = colDef(name = "Lower CI"),
    Upper_CI = colDef(name = "Upper CI")
  ),
  defaultColDef = colDef(format = colFormat(digits = 2))
)

5.2 Zurich and Philippines

The performance measures of each of the models we have predicted help us to select the model that best fits our data, i.e. the model that best fits between goodness of fit and complexity.

Code

data.frame(
  model = c("NAIVE", "ETS_M_seaso", "ETS_M_seaso_Ad", "ARIMA_auto", "ARIMA_auto_seasonal", "ARIMA_auto_stepwise", "ARIMAX"),
  RMSE = c(135, 85.7, 74.7, 113, 105, 104, 65),
  MAE = c(79.2, 55.3, 48.9, 67.7, 70.4, 67.6, 74.9),
  MAPE = c(48.3, 102, 105, 123, 146, 143, 105),
  AIC = c("-", 3256, 3195, 2771, 2751, 2738, 2744)
) %>%
  kable("html", caption = "Zurich-Philippines Metric Table") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Zurich-Philippines Metric Table
model	RMSE	MAE	MAPE	AIC
NAIVE	135.0	79.2	48.3	-
ETS_M_seaso	85.7	55.3	102.0	3256
ETS_M_seaso_Ad	74.7	48.9	105.0	3195
ARIMA_auto	113.0	67.7	123.0	2771
ARIMA_auto_seasonal	105.0	70.4	146.0	2751
ARIMA_auto_stepwise	104.0	67.6	143.0	2738
ARIMAX	65.0	74.9	105.0	2744

The project’s main objective is to accurately predict the number of visitors, focusing on the best MAE and RMSE values. The ETS_M_seaso_Ad model has the lowest MAE and RMSE, making it the preferred choice for precise predictions essential for decision-making in Zurich’s tourism sector. However, ARIMAX performs better in AIC and MAPE. Combining both forecasts could offer a more robust prediction, providing increased robustness, reduced bias, and improved overall performance through a ‘forecasting ensemble’ approach.

Check the Appendix for the forecasted values

Code

##### ARIMAX #####
# Rename the columns for clarity
colnames(forecast_arimax) <- c("ARIMAX_Forecast", "ARIMAX_Lo_80", "ARIMAX_Hi_80", "ARIMAX_Lo_95", "ARIMAX_Hi_95", "Date")

# Convert the 'Date' column to Date type for merging
forecast_arimax$Date <- as.Date(forecast_arimax$Date)

forecast_ETS_M_seaso_Ad$Date <- as.Date(forecast_ETS_M_seaso_Ad$Date)
forecast_ETS_M_seaso_Ad
#> # A tibble: 15 x 4
#>    Date       ETS_M_seaso_Ad_Forecast ETS_Lo_95 ETS_Hi_95
#>    <date>                       <dbl>     <dbl>     <dbl>
#>  1 2023-10-01                   1126.      973.     1279.
#>  2 2023-11-01                    879.      718.     1045.
#>  3 2023-12-01                   1571.     1351.     1790.
#>  4 2024-01-01                    458.      299.      616.
#>  5 2024-02-01                    417.      249.      586.
#>  6 2024-03-01                    526.      322.      731.
#>  7 2024-04-01                   1098.      751.     1448.
#>  8 2024-05-01                   1092.      730.     1465.
#>  9 2024-06-01                   1002.      653.     1366.
#> 10 2024-07-01                    863.      537.     1184.
#> 11 2024-08-01                    730.      429.     1034.
#> 12 2024-09-01                    917.      542.     1286.
#> 13 2024-10-01                   1177.      706.     1662.
#> 14 2024-11-01                    916.      528.     1295.
#> 15 2024-12-01                   1632.      944.     2291.

# Merge the two data frames based on the Date
combined_forecast <- merge(forecast_arimax, forecast_ETS_M_seaso_Ad, by = "Date")

# Calculate the average forecast and average confidence intervals
combined_forecast <- combined_forecast %>%
  mutate(
    Avg_Forecast = (ARIMAX_Forecast + ETS_M_seaso_Ad_Forecast) / 2,
    Avg_Lo_95 = (ARIMAX_Lo_95 + ETS_Lo_95) / 2,
    Avg_Hi_95 = (ARIMAX_Hi_95 + ETS_Hi_95) / 2
  )

# Select the relevant columns
final_forecast <- combined_forecast %>%
  select(Date, Avg_Forecast, Avg_Lo_95, Avg_Hi_95)


# Display the combined table using reactable
final_forecast_table_zh <- reactable(
  final_forecast,
  columns = list(
    Date = colDef(name = "Date"),
    Avg_Forecast = colDef(
      name = "Average Forecast",
      format = colFormat(digits = 2)
    ),
    Avg_Lo_95 = colDef(
      name = "Average 95% Lower CI",
      format = colFormat(digits = 2)
    ),
    Avg_Hi_95 = colDef(
      name = "Average 95% Upper CI",
      format = colFormat(digits = 2)
    )
  ),
  defaultPageSize = 5,
  highlight = TRUE,
  striped = TRUE,
  bordered = TRUE,
  filterable = TRUE,
  searchable = TRUE,
  sortable = TRUE,
  resizable = TRUE
)

#save final forecast in a .csv
#write.csv(final_forecast, "../data/final_forecast.csv", row.names=FALSE)

We can see that there is a significant difference between the two models. The differences (delta) are sometimes very high, for example -351, indicating that the two models capture different seasonal components or trends. Combining the two and taking their average would provide robust and balanced predictions.

6 Conclusion

In conclusion, this analysis highlighted the significant impact of global events on tourism in Vaud and Zurich. Seasonal patterns and the COVID-19 pandemic notably influenced tourist behavior. Forecasting models like ETS and ARIMA effectively captured these patterns, providing reliable predictions for future tourist flows. This data is crucial for strategic planning and decision-making in these cantons. We recommend ongoing monitoring and adaptive forecasting to account for unforeseen global impacts, ensuring sustainable tourism growth in Vaud and Zurich.

7 Appendix

7.1 General Overview

As the dataset provided is in German, we have translated the data in English to make it more intuitive and understandable for everyone. Then, we created a new ‘Date’ column, year-month-day, which corresponds to the correct format to be able to make predictions.

Here are the first 1000 instances of the raw data :

Code

#show data using reactable only showing the first 100 rows
reactable(head(tourism_data_no_total, 1000), 
          searchable = TRUE,
          defaultPageSize = 5,
          sortable = TRUE,
          filterable = TRUE,
          pagination = TRUE,
          bordered = TRUE,
          highlight = TRUE,
          striped = TRUE,
          compact = TRUE,
          showPageSizeOptions = TRUE,
          showSortable = TRUE
          )

7.2 Vaud - Tourism Data

Code

tourism_vaud
#> # A tibble: 17,550 x 6
#>    Herkunftsland          Kanton Monat   Jahr  value Date      
#>    <chr>                  <chr>  <fct>   <chr> <dbl> <date>    
#>  1 Herkunftsland - Total  Vaud   January 2005  57942 2005-01-01
#>  2 Schweiz                Vaud   January 2005  27853 2005-01-01
#>  3 Baltische Staaten      Vaud   January 2005    103 2005-01-01
#>  4 Deutschland            Vaud   January 2005   3052 2005-01-01
#>  5 Frankreich             Vaud   January 2005   7667 2005-01-01
#>  6 Italien                Vaud   January 2005   2154 2005-01-01
#>  7 Österreich             Vaud   January 2005    192 2005-01-01
#>  8 Vereinigtes Königreich Vaud   January 2005   3695 2005-01-01
#>  9 Irland                 Vaud   January 2005    118 2005-01-01
#> 10 Niederlande            Vaud   January 2005   1365 2005-01-01
#> # i 17,540 more rows

7.3 VD - Plot visitors from countries

Code

# Create the ggplot object
plot_vaud <- tourism_vaud %>% 
  filter(Herkunftsland != 'Herkunftsland - Total') %>%
  ggplot(aes(x = Date, y = value, group = Herkunftsland, color = Herkunftsland,
             text = paste("Country:", Herkunftsland, "Trips:", value))) +  # Added text for tooltip
  geom_line(show.legend = FALSE) + 
  scale_color_viridis_d() +  # Use viridis color palette
  labs(title = "Number of visitors from Each Country to Vaud",
       x = "Date",
       y = "Number of Trips") +
  theme_minimal()

# Convert to an interactive plotly object with specified width and height
interactive_plot <- ggplotly(plot_vaud, tooltip = "text", width = 600, height = 400)


# Adjust plotly settings 
interactive_plot %>%
  layout(
    margin = list(l = 60, r = 60, b = 60, t = 80),
    showlegend = FALSE # Remove legend
  )

7.4 Vd - STL

Code

stl_vd

7.5 VD - ETS forecast

Code

forecast_ets_vaud <- forecast(ets_vaud, h = 15) %>% plot(main = "Forecast of visitors in Vaud", xlab = "Date", ylab = "Number of visitors")

7.6 VD - ARIMA forecast

Code

plot_arima_vaud

7.7 VD - TSLM forecast

Code

forecast_vaud_stringency %>%
  autoplot(merged_vaud_tsibble) +
  labs(title = "Tourist Forecast for Canton Vaud with Stringency Index = 0", 
       y = "Number of Tourists", x = "Date") +
  guides(colour = guide_legend(title = "Forecast"))

7.8 ZH - Tourism Data

We filtered the “Kanton” column to keep only the canton of Zurich.

Here are the first 1000 instances of the raw data :

Code

#show the data in a table using reactable
reactable(head(tourism_data_zurich, 1000), 
          searchable = TRUE,
          defaultPageSize = 5,
          sortable = TRUE,
          filterable = TRUE,
          pagination = TRUE,
          bordered = TRUE,
          highlight = TRUE,
          striped = TRUE,
          compact = TRUE,
          showPageSizeOptions = TRUE,
          showSortable = TRUE
          )

Code

table_zh_ph

There are 1869 missing values for the two sub-datasets. These missing values come from the ‘value’ column, creating gaps in the time series. We’ll see later how we’re going to process them to do modelling.

7.9 ZH -Zurich and All visiting countries

The graph shows the monthly number of overnight stays in Zurich from tourists of different countries.

Code

# Display the interactive plot
interactive_plot_zh_all

7.10 ZH - STL

The additive time series decomposition of the monthly overnight stays for tourists coming from the Philippines to the canton of Zurich shows:

Code

stl_zh

The multiplicative time series decomposition of the monthly overnight stays for tourists coming from the Philippines to the canton of Zurich shows:

Code

stl_zh_m

7.11 ZH - Seasonality

Seasonal sub-series plot permit to better visualize the monthly fluctuations of each year, from 2005 to 2023.

Code

# Plot the seasonality in one chart
ggseasonplot(tourism_ts, year.labels = TRUE, year.labels.left = TRUE) +
  scale_color_viridis_d() +
  theme_minimal()

Code

seaso_plot_zh_2

We clearly observe here that the seasonality is not constant over time.

The months of May to July and October seem to have visitor peaks, which may indicate a high tourist season during this period. As we saw before, this is due to their calendar. The years 2022 and 2023 show a significant increase in visitor numbers compared with previous years. In particular, the months from May to October 2022 and 2023 show much higher values. This growth may be due to several factors, such as a post-pandemic recovery in travel or specific initiatives that have attracted more tourists.

7.12 ZH - Naive Forecast

The graph shows the historical trend in the number of tourists from the Philippines to Zurich and the forecasts for the next 15 months using the naive model. We took the graph representing the total number of tourists coming from the Philippines to Zurich.

Code

plot_naive

This naive model predicts that future values will be equal to the last observed value in the time series. It does not take into account the past events like the pandemic and assumes here that the levels observed after this extreme fall will remain unchanged. The model does not take into account trends or seasonality neither, which are very present in our case. It’s a simplified approach.

The blue areas represent the 80% (darker) and 95% (lighter) confidence intervals of the forecasts. The wider the interval, the greater the uncertainty about the long-term forecasts which is the case here.

We can see here the metrics of the naive model :

Code

metrics_naive

Naive Model Metrics
.model	.type	ME	RMSE	MAE	MPE	MAPE	MASE	RMSSE	ACF1
NAIVE	Training	3.44	135	79.2	-18.8	48.3	0.698	0.629	-0.3

7.13 ZH - ETS AAA model

Code

plot_ets_AAA

Clearly see here that the confidence interval is too big, almost like a naive forecast, therefore as suspected we will move to a multiplicative seasonality model.

We see here the metrics of the ETS AAA model :

Code

metrics_ets_AAA

ETS AAA Metrics
.model	.type	ME	RMSE	MAE	MPE	MAPE	MASE	RMSSE	ACF1
ETS_M_seaso	Training	5.5	108	73	-61.7	124	0.643	0.503	0.034

7.14 ZH - ETS M model

Code

plot_ets_AAM_AAdM

Code

metrics_ets

Model Accuracy Metrics with AIC Values
.model	.type	ME	RMSE	MAE	MPE	MAPE	MASE	RMSSE	ACF1	AIC
ETS_M_seaso	Training	2.93	85.7	55.3	-81.5	102	0.487	0.399	0.040	3256
ETS_M_seaso_Ad	Training	2.06	74.7	48.9	-78.7	105	0.431	0.348	0.181	3195

7.15 ZH - ARIMA Models

We apply three ARIMA models and determine which one gives us the best accury in term of prediction :

An ARIMA_auto, which will automatically determine the optimal values of the ARIMA parameters (p, d, q) using AIC criteria.
An ARIMA_auto_seasonal, which works like the ARIMA_auto but takes into account the monthly seasonality of our data.
ARIMA_auto_stepwise, which performs an exhaustive search over all possible combinations of parameters by disabling the stepwise and approximation options. It increases the accuracy of the model and do not take seasonal components into account.

Code

plot_arima_zh

From the graph we can see that the ARIMA_auto_seasonal model gives more negative predictions over the next 15 months than the other two models. The ARIMA_auto_stepwise model predicts the highest values while the ARIMA_auto model lies between the two prediction curves.

Here are the metrics of the ARIMA model :

Code

metrics_arima_table

Model Accuracy Metrics with AIC Values
.model	.type	ME	RMSE	MAE	MPE	MAPE	MASE	RMSSE	ACF1	AIC
ARIMA_auto	Training	6.74	113	67.7	-68.7	123	0.597	0.526	-0.024	2771
ARIMA_auto_seasonal	Training	4.47	105	70.4	-72.8	146	0.620	0.489	0.027	2751
ARIMA_auto_stepwise	Training	5.80	104	67.6	-72.2	143	0.596	0.485	-0.005	2738

According to the table of metrics, the ARIMA_auto_stepwise model appears to be the best choice with the lowest RMSE, MAE and AIC values. This indicates greater accuracy and better model fit despite the computation time involved with this type of model. If computing resources permit, this time cost is recouped by greater accuracy and therefore better fit. ARIMA_auto_seasonal is very close to the stepwise model in terms of metrics but shows good model accuracy.

7.16 ZH - Stringency Index

The OxCGRT provides data on government responses to COVID-19, including a Stringency Index that measures the severity of lockdown measures. The Stringency Index is calculated based on nine key metrics: School closures, Workplace closures, Cancellation of public events, Restrictions on public gatherings, Closures of public transport, Stay-at-home requirements, International travel controls, etc… Each metric is scored from 0 to 100, with higher scores indicating stricter responses. The overall Stringency Index is the average score of these metrics, reflecting the strictest sub-regional policies if they vary.

source - Our World In Data

Code

stringency_zh
stringency_table

7.17 ZH - Multiple Regression Analysis

Simple yet effective, if the relationships between the covariates and the dependent variable (tourist numbers) are linear.

Here is a summary of the effect of the variables on the number of tourists from the Philippines to Zurich :

Code

# Fit a multiple regression model
model <- lm(value ~ avg_stringency_index_PH + avg_stringency_index_SW, data = merged_data)

# Summary of the model
summary(model)
#> 
#> Call:
#> lm(formula = value ~ avg_stringency_index_PH + avg_stringency_index_SW, 
#>     data = merged_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -318.9 -157.3  -69.3   93.7  873.7 
#> 
#> Coefficients:
#>                         Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               246.32      15.85   15.54   <2e-16 ***
#> avg_stringency_index_PH     5.87       2.55    2.30   0.0221 *  
#> avg_stringency_index_SW   -12.19       3.73   -3.26   0.0013 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 220 on 222 degrees of freedom
#> Multiple R-squared:  0.0931, Adjusted R-squared:  0.0849 
#> F-statistic: 11.4 on 2 and 222 DF,  p-value: 1.95e-05

# Forecast the next 24 months
forecast_values <- predict(model, newdata = merged_data)

#us gtsummary to show the summary of the model
model %>%
  gtsummary::tbl_regression()

Characteristic	Beta	95% CI¹	p-value
avg_stringency_index_PH	5.9	0.85, 11	0.022
avg_stringency_index_SW	-12	-20, -4.8	0.001
¹ CI = Confidence Interval

We observe that the p-value are lower for Switzerland than for Philippines. This means that government stringency index in Switzerland has a more significant impact on the number of tourists than in the Philippines. Indeed, the beta of -12 for Switzerland and 5.9 for Philippines shows that the number of tourists in Switzerland decreases by 12 for each unit of stringency index, which seems logical. However the number of tourists in the Philippines increases by 5.9 for each unit of stringency index, which is counter-intuitive. This could be due to the fact that the stringency index is not the only factor that influences the number of tourists from the Philippines to Zurich.

We can observe here the metrics:

Code

#create a table with the metrics of the model and show it as an html
model_metrics <- model %>%
  broom::glance()

# show html table with metrics
model_metrics %>%
  kable("html", caption = "Model Metrics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Model Metrics
r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.093	0.085	220	11.4	0	2	-1531	3070	3084	10734309	222	225

The way higher AIC shows that this model is not the best.

7.18 ZH - Covid

The COVID-19 pandemic is considered a Black Swan event due to its unpredictable nature and global impact. The question is how can you incorporate this event in model predictions ? Adjust forecasts: include covariates that capture the effects of the pandemic. This allows us to adjust our forecasts to take account of the impact of COVID-19. For example, you could include a covariate that captures the effect of closures or travel restrictions on tourism data. Scenario analysis: run scenario simulations to analyse the potential impact of different COVID-19 scenarios on our forecasts. By analysing the different possible outcomes, you can better prepare for the uncertainties associated with the pandemic. Sensitivity analysis: assess the sensitivity of your forecasts to changes in assumptions or key parameters. By performing a sensitivity analysis, you can identify the factors that have the greatest impact on your forecasts and assess the robustness of your models.

7.19 ZH - ARIMAX Model

Code

plot_forecast_arimax

Here are the metrics of the ARIMAX model with the metrics of the ARIMA model for comparison :

Code

metrics_arimax

Model Metrics
Model	MAE	MAPE	RMSE	AIC	BIC
ARIMAX	65	74.9	105	2744	2774

Code

metrics_arima_table

Model Accuracy Metrics with AIC Values
.model	.type	ME	RMSE	MAE	MPE	MAPE	MASE	RMSSE	ACF1	AIC
ARIMA_auto	Training	6.74	113	67.7	-68.7	123	0.597	0.526	-0.024	2771
ARIMA_auto_seasonal	Training	4.47	105	70.4	-72.8	146	0.620	0.489	0.027	2751
ARIMA_auto_stepwise	Training	5.80	104	67.6	-72.2	143	0.596	0.485	-0.005	2738

7.20 ZH - Other Ideas

Other ideas for taking into account the impact of Covid-19 in our prediction models would be: - Create a dummy variable that takes the value 1 during periods of confinement and severe restrictions, and 0 otherwise. This models the direct impact of the pandemic on tourist numbers. If there are missing values, the ARIMA method can be used to interpolate the missing values on the basis of the patterns observed in the available data with the use fill_gaps(). - Implement additional exogenous variables in our ARIMAX model, such as the number of Covid-19 deaths and/or the vaccination rate. - Jump regression models and GARCH models are powerful tools for analysing time series when there are sudden changes or periods of high volatility such as during the Covid-19 pandemic. - Apply machine learning techniques to capture non-linear and complex patterns such as random forest or gradient boosting or recurrent neural networks RNN.

7.21 ZH - Final Forecast Table

Code

final_forecast_table_zh