# Problem

Can we track hurricanes based on past data?

## Applied Skills

MCMC, Metropolis Algorithm, Bayesian Linear Regression

# Introduction

Hurricanes are tropical cyclones whose maximum sustained wind speeds reach 74 mph. Hurricane season starts in early June and continues through the end of November, but hurricanes have been known to occur outside of this range. These natural phenomena are both amazing climatological events and destructive forces to be reckoned with, as was seen with Hurricane Maria in 2017. The ability to predict their trajectory thus represents an important task for not just the climatologist, but the statistician.

This report seeks to explore the use of Monte Carlo Markov Chain (MCMC) methods to predict both hurricane trajectory and damages using a data set on 356 hurricanes in the North Atlantic area since 1989 and another set focusing on damages and costs associated with a smaller subset of 46 hurricanes. This project was the final project of Topics in Advanced Statistical Computing and was a solo effort.

# Proposed Models

## Trajectory

For the $$i^{th}$$ hurricane, we denote $$Y_{i,1}(t), Y_{i,2}(t), Y_{i,3}(t)$$ to represent the latitude, longitude and wind speed at time $$t$$. The prescribed model for predicting a hurricane’s trajectory is an autoregressive model in terms of $$Y_{i,j}$$ as: $Y_{i,j}(t+6) = \mu_{i,j}(t) + \rho_jY_{i,j}(t) + \epsilon_{i,j}$

where $$\mu_{i,j}$$ is the functional mean of the time and type of the hurricane and $$\rho_j$$ is the autoregressive correlation. The $$\epsilon_{i,j}$$ follow a multivariate normal distribution with mean 0 and covariance matrix $$\Sigma$$.

The functional mean is given by: $\mu_{i,j}(t) = \beta_{0,j} + x_{i,1}\beta_{1,j} + x_{i,2}\beta_{2,j} + x_{i,3}\beta_{3,j} + \sum^3_{k=1} \beta_{3+k,j}\Delta_{i,k}(t-6)$

where $$x_{i,1}$$ is the day of the year, $$x_{i,2}$$ is the year, $$x_{i,3}$$ is the nature of the hurricane, and $$\Delta_{i,k}$$ is the change in latitude, longitude, and wind speed between $$t-6$$ and $$t$$.

We will use the following priors:

• $$\pi(\bf{\beta})$$ are jointly normal with mean 0 and variance $$diag(1,p)$$
• $$\pi(\rho_j)$$ follow a truncated normal $$N_{[0,1]}(0.5, 1/5)$$
• $$\pi(\Sigma^{-1})$$ follows a $$Wishart(3, diag(0.1,3))$$

## Damages & Deaths

The second task concerning damages and deaths was left as a more open-ended question. Using the insights from hurricane trajectory, I chose to model this problem in a similar way. The focus of this problem was to figure out which predictors contributed to both hurricane damages and deaths, and I also wanted to know which were significant.

Whereas the trajectory problem simultaneously looked at 3 different outcomes, this problem only had 2. Our predictive model for damages and deaths will thus be represented by $$(W_{i,1},W_{i,2})$$, where $$j=1$$ will represent damages and $$j=2$$ will represent deaths, where $$W_{i,j}$$ is a linear model of the predictors $$X_i$$.

$W_{i,j} = X^T\beta + \epsilon_{i,j}$

where the errors $$\epsilon_{i,j}$$ will follow a multivariate normal distribution with covariance $$\Sigma$$. I usedthe same priors from the trajectory problem and adjusted them to this problem.

The available predictors in the data set are: year, month, hurricane nature, max speed, mean speed, max pressure, mean pressure, hours, total affected population, percent affected that are poor and percent affected that are in the USA.

# Data Cleaning

## Trajectory

The original dataset only contained the date and time of the hurricane in string form, so all of the time variables had to be extracted from it. To reduce the number of unknown $$\beta$$, I converted nature into an ordinal variable. tidyverse and maps are brought in for plotting. MCMCpack, msm and matrixcalc are brought in for convenient functions for our priors.

library(tidyverse)
library(msm)
library(MCMCpack)
library(matrixcalc)
library(maps)

# Cleaning and adding needed columns to the dataset
raw = read_csv("./hurricane356.csv") %>%
mutate(
date = as.Date(substring(time, 2, 9), format = "%y-%m-%d"),
year.start = as.Date(paste(substring(time, 2, 3),"-01-01", sep = ""),
format = "%y-%m-%d"),
d.since.start = date - year.start,
hour = as.numeric(substring(time, 11, 12)),
Nature = factor(Nature, levels = c("NR", "DS", "ET", "SS", "TS")),
Nature = as.numeric(Nature)
) %>%
rename(Year = Season, wspeed = Wind.kt) %>%
dplyr::select(ID, Year, d.since.start, hour, Nature,
Latitude, Longitude, wspeed) %>%
janitor::clean_names()

# Add in lag columns for each of the columns
hurricanes = NULL
dist.hurricanes = as.character(unique(raw$id)) for (i in 1:length(dist.hurricanes)) { df = raw %>% filter(id == dist.hurricanes[i]) df = df %>% mutate( latitude.lag1 = lag(latitude, 1), longitude.lag1 = lag(longitude, 1), wspeed.lag1 = lag(wspeed, 1), latitude.diff = latitude - latitude.lag1, longitude.diff = longitude - longitude.lag1, wspeed.diff = wspeed - wspeed.lag1 ) %>% filter(!is.na(latitude.diff)) %>% dplyr::select(-longitude.lag1, -latitude.lag1, -wspeed.lag1) hurricanes = bind_rows(hurricanes, df) } To assess the usefulness of the resulting model, I used data from 80% of the hurricanes as a training set and the rest as a test set. # Get list of all the distinct hurricanes train.idx = sample(1:length(dist.hurricanes), size = 0.7 * length(dist.hurricanes), replace = FALSE) train.names = dist.hurricanes[train.idx] # Divide up the training set according to the hurricane names train.data = hurricanes %>% filter(id %in% train.names) %>% mutate(intercept = 1) %>% dplyr::select(intercept, d_since_start:nature, latitude.diff:wspeed.diff, latitude:wspeed) train.resp = hurricanes %>% filter(id %in% train.names) %>% dplyr::select(latitude:wspeed) test.data = hurricanes %>% filter(!(id %in% train.names)) %>% mutate(intercept = 1) %>% dplyr::select(intercept, d_since_start:nature, latitude.diff:wspeed.diff, latitude:wspeed) test.resp = hurricanes %>% filter(!(id %in% train.names)) %>% mutate(intercept = 1) %>% dplyr::select(latitude:wspeed) ## Damages & Deaths For this dataset, damages was recoded as a numeric variable, and nature was also recoded as an ordinal variable. # Bring in the data and clean it raw_outcomes = read_csv("./hurricaneoutcome2.csv") outcomes = read_csv("./hurricaneoutcome2.csv") %>% rename(year = Season, hurricane_id = HurricanID) %>% mutate( Damage = as.numeric(substr(Damage, 2, length(Damage))), Nature = factor(Nature, levels = c("NR", "DS", "ET", "SS", "TS")), Nature = as.numeric(Nature), intercept = 1 ) %>% dplyr::select(-Month, -Meanspeed, -Meanpressure) %>% dplyr::select(hurricane_id, Damage, Deaths, everything()) %>% janitor::clean_names() I also wanted to see how this model would fare with unseen data, so I did a training-test split here as well. The dataset for outcomes was much smaller than the trajectory data, so I opted for a 70%:30% split. # Get list of all the distinct hurricanes train.idx = sample(1:nrow(outcomes), size = 0.7 * nrow(outcomes), replace = FALSE) # Put the data into a form that is compatible with matrix multiplication train.data = outcomes[train.idx,] %>% dplyr::select(intercept, year, nature, maxspeed, maxpressure, hours, total_pop, percent_poor, percent_usa) %>% data.matrix(.) train.resp = outcomes[train.idx,] %>% dplyr::select(damage, deaths) %>% as.matrix(.) test.data = outcomes[-train.idx,] %>% dplyr::select(intercept, year, nature, maxspeed, maxpressure, hours, total_pop, percent_poor, percent_usa) %>% data.matrix(.) test.resp = outcomes[-train.idx,] %>% dplyr::select(damage, deaths) %>% as.matrix(.) # Methodology For both models, I estimated the posterior distributions of the parameters using the Metropolis algorithm. For the $$\beta$$ and $$\rho$$ coefficients, I used a uniform transition function. For $$\Sigma$$, I used a Wishart proposal. I implemented a component-wise version of the algorithm to allow myself to better tune the algorithm for each parameter individually. ## Helper functions The following are some helper functions I made to calculate the (log) of the posterior probability: rho.logprior = function(rhos) { # Prior for correlation rho between time points prob = dtnorm(rhos, mean = 0.5, sd = sqrt(0.2), lower = 0, upper = 1) return(sum(log(prob))) } sigma.logprior = function(sigma) { # Prior for the inverse of the error covariance matrix sigma = matrix(sigma, ncol = 3) prob = dwish(solve(sigma), 3, diag(0.1, 3)) return(log(prob)) } beta.logprior = function(betas) { # Priors for all the beta coefficients probs = dnorm(betas, mean = 0, sd = 1) logprobs = log(probs) return(sum(logprobs)) } loglikelihood = function(resp, covariates, betas, sigma, rhos) { # Likelihood for the given linear model # Latitude, longitude & wind speed form a trivariate normal # centered at X * beta y = as.matrix(resp) X = data.matrix(covariates) betas = matrix(betas, ncol = 3) sigma = matrix(sigma, ncol = 3) rho.mat = rhos * diag(1,3) coef.mat = rbind(betas, rho.mat) diff = y - (X %*% coef.mat) loglik = as_tibble(diff) %>% transmute( lik = pmap( list(x = longitude, y = latitude, z = wspeed), .f = function(x, y, z) { -0.5 * t(c(x, y, z)) %*% solve(sigma) %*% c(x, y, z) }) ) %>% unnest(.) return(sum(loglik)) } lpost = function(resp, covariates, rhos, sigma, betas) { # Posterior is proportional to product of # priors & likelihood loglik = loglikelihood(resp, covariates, betas, sigma, rhos) rho.lp = rho.logprior(rhos) sigma.lp = sigma.logprior(sigma) beta.lp = beta.logprior(betas) return(rho.lp + sigma.lp + beta.lp + loglik) } Understanding what the likelihood was in this problem was the biggest challenge. After pondering it for a long time, I realized that the models are just linear functions of our covariates. We also know that the residuals follow a multivariate normal distribution, so the likelihood can be derived from these two facts. ## Implementing component-wise Metropolis Since the two problems have similar implementations, I’ve just included the code for the first problem’s algorithm here, The chosen windows were developed after much trial and error. I chose to do 10,000 iterations to make sure that I had an ample amount after the burn-in. # Implementing the component-wise Metropolis algorithm nreps = 10000 start.betas = rep(0, 21) start.rhos = c(0.9, 0.9, 0.9) start.sigma = c(10, 0, 0, 0, 10, 0, 0, 0, 10) beta.windows = c(1, 0.01, 0.5, 0.1, 0.5, 1, 1, 0.5, 0.1, 1, 0.01, 0.5, 1, 1, 1, 0.1, 0.1, 0.001, 0.5, 0.5, 0.5) rho.windows = c(0.05, 0.001, 0.001) sigma.window = 250 res.betas = beta.mat = start.betas res.rhos = rho.mat = start.rhos res.sigma = sigma.mat = start.sigma for (i in 1:nreps) { p = length(res.betas) q = length(res.rhos) # Iterate through all of the betas first for (j in 1:p) { prop.betas = res.betas prop.betas[j] = res.betas[j] + (runif(1) - 0.5) * 2 * beta.windows[j] num = lpost(train.resp, train.data, res.rhos, res.sigma, prop.betas) denom = lpost(train.resp, train.data, res.rhos, res.sigma, res.betas) if (log(runif(1)) < num - denom) { res.betas = prop.betas } } # Now do all of the rhos for (k in 1:q) { prop.rhos = res.rhos prop.rhos[k] = res.rhos[k] + (runif(1) - 0.5) * 2 * rho.windows[k] num = lpost(train.resp, train.data, prop.rhos, res.sigma, res.betas) denom = lpost(train.resp, train.data, res.rhos, res.sigma, res.betas) if (log(runif(1)) < num - denom) { res.rhos = prop.rhos } } # Finally the sigma prop.sigma = rWishart(1, sigma.window, matrix(res.sigma, ncol = 3) / sigma.window) num = lpost(train.resp, train.data, res.rhos, prop.sigma, res.betas) denom = lpost(train.resp, train.data, res.rhos, res.sigma, res.betas) if (log(runif(1)) < num - denom) { res.sigma = prop.sigma } # Record all of the new data in the iteration beta.mat = rbind(beta.mat, res.betas) rho.mat = rbind(rho.mat, res.rhos) sigma.mat = rbind(sigma.mat, res.sigma) print(paste("Iteration", i, "done")) } colnames(beta.mat) = c(paste("beta", 0:6, rep(1,7), sep = ""), paste("beta", 0:6, rep(2,7), sep = ""), paste("beta", 0:6, rep(3,7), sep = "")) colnames(rho.mat) = c("p1", "p2", "p3") colnames(sigma.mat) = c("11", "12", "13", "21", "22", "23", "31", "32", "33") ## Running diagnostics I ran these functions whenever I finished up a run through the algorithm. I needed to make sure that the acceptance probability for all 25 of the unknown parameters weren’t too small, at least above 30%. Because I had so many iterations, I felt that I was able to tolerate higher-than-typical acceptance probabilities. After looking at the acceptance probabilities, I made another check on the trace plots of each of the coefficients to see if they stably converged. I later found that some actually didn’t converge.. # Check uniqueness of the different parameters print("Beta uniqueness:") matrix(apply(beta.mat, 2, function(x) length(unique(x))/nreps), ncol = 3) print("Rho uniqueness:") matrix(apply(rho.mat, 2, function(x) length(unique(x))/nreps), ncol = 3) print("Sigma uniqueness") sigs = as_tibble(sigma.mat) %>% transmute( vals = paste(11,12,13,21,22,23,31,32,33) ) round(length(unique(sigs$vals))/length(sigs$vals), 4) # Plot the traces for (l in 1:ncol(beta.mat)) { plot(1:nreps, beta.mat[-1,l], type = "o", main = colnames(beta.mat)[l]) } for (m in 1:ncol(rho.mat)) { plot(1:nreps, rho.mat[-1,m], type = "o", main = colnames(rho.mat)[m]) } ## Deriving the model I used the first 5,000 iterations as my burn in and used the latter half as our posterior distribution. To get the estimates for the model, I took the expectation of each of the distributions. After getting these coefficients, I calculated predictions for the test dataset and looked at how it performed. # Burn the chains beta.afterburn = beta.mat[5001:nrow(beta.mat),] rho.afterburn = rho.mat[5001:nrow(rho.mat),] # Calculate the posterior means for each coefficient post.beta.means = apply(beta.afterburn, 2, mean) post.rho.means = apply(rho.afterburn, 2, mean) post.beta.mat = matrix(post.beta.means, ncol = 3) post.rho.mat = post.rho.means * diag(1,3) post.mat = rbind(post.beta.mat, post.rho.mat) # Calculate the test error train.fit = data.matrix(train.data) %*% post.mat train.error = train.resp - train.fit test.fit = data.matrix(test.data) %*% post.mat test.error = test.resp - test.fit # Calculating the mean squared error for each response overall.lat.mse.train = sum(train.error[,1] * train.error[,1]) / nrow(train.error) overall.lon.mse.train = sum(train.error[,2] * train.error[,2]) / nrow(train.error) overall.ws.mse.train = sum(train.error[,3] * train.error[,3]) / nrow(train.error) overall.lat.mse = sum(test.error[,1] * test.error[,1]) / nrow(test.error) overall.lon.mse = sum(test.error[,2] * test.error[,2]) / nrow(test.error) overall.ws.mse = sum(test.error[,3] * test.error[,3]) / nrow(test.error) # Findings ## Trajectory This model performed pretty well! This project was interesting because some hurricanes actually turn and twist, so I was curious to see how a linear model with autoregressive terms would recreate this. The code below helped me to visualize the hurricanes on the map. # Looking at each hurricane test.hurricanes = hurricanes %>% filter(!(id %in% train.names)) %>% dplyr::select(id) hurricane.preds = test.data %>% dplyr::select(latitude, longitude, wspeed) %>% mutate( id = test.hurricanes$id,
lat.pred = test.fit[,1],
lon.pred = test.fit[,2],
ws.pred = test.fit[,3],
lat.sse = test.error[,1],
lon.sse = test.error[,2],
ws.sse = test.error[,3],
)

# Finding the best predicted hurricane and the worse
extremes = hurricane.preds %>%
group_by(id) %>%
summarise(
n = n(),
lat.mse = sum(lat.sse^2) / n(),
lon.mse = sum(lon.sse^2) / n(),
ws.mse = sum(ws.sse^2) / n(),
)

# Find a middling hurricane
ggplot(data = hurricane.preds) +
geom_polygon(data = map_data(map = 'world'), aes(x = long, y = lat, group = group)) +
stat_summary_2d(data = subset(hurricane.preds, id == "ALBERTO.2000"),
aes(x = longitude, y = latitude, z = wspeed),
fun = median, binwidth = c(1, 1), show.legend = TRUE, alpha = 0.75, color = "green") +
stat_summary_2d(data = subset(hurricane.preds, id == "ALBERTO.2000"),
aes(x = lon.pred, y = lat.pred, z = ws.pred),
fun = median, binwidth = c(1, 1), show.legend = TRUE, alpha = 0.75, color = "red") 

Hurricane Alberto of 2000 was the longest lasting hurricane in my test dataset. Below is a view of the actual path of the hurricane (in green) and my model’s predictions (in red). I definitely wouldn’t use this version to track actual hurricanes, but it’s a good proof of concept at least.

## Damages & Deaths

The damages model was fine, the deaths model was… not so fine. The table below lays out the estimated coefficients and their 95% quantile intervals.

Table 2: Model estimates and confidence intervals for damage
Coefficient Damage Est. Damage 2.5th Quantile Damage 97.5th Quantile
Intercept -0.02800 -1.9600 1.86000
Year 0.02200 -0.0855 0.11200
Nature -0.03790 -1.9200 1.91000
Max Speed 0.23200 -0.1650 0.65400
Max Pressure -0.06450 -0.2550 0.15000
Hours -0.01230 -0.1090 0.07700
Total Population 0.00001 0.0000 0.00002
% Poor -0.25400 -0.6980 0.16200
% USA 0.07580 -0.1100 0.25800
Table 3: Model estimates and confidence intervals for deaths
Coefficient Death Est. Death 2.5th Quantile Death 97.5th Quantile
Intercept 0.0356 -1.84000 2.05000
Year 0.6200 -0.68600 1.74000
Nature -0.0135 -2.03000 2.03000
Max Speed 0.0571 -1.90000 1.96000
Max Pressure 0.2490 -1.66000 2.07000
Hours 0.1870 -1.66000 2.18000
Total Population -0.0008 -0.00237 0.00092
% Poor 0.2040 -1.70000 2.17000
% USA -0.0235 -2.05000 1.93000

According to the resulting model, increases to year, max speed, total population and percent USA contribute to increases in damages, while nature, max pressure, hours, and percent poor serve to decrease damages. For damages, the intercept, year, max speed, max pressure, hours, and percent poor were deemed to be significant by the 95% quantile intervals. This result is unexpected, but I surmise that these estimates are inaccurate due to small sample size. For deaths, all the covariates are predicted to decrease the amount of deaths. Unlike damages, all of the estimators for the death model appear to be insignificant, perhaps indicative of poor model specification.

# Conclusions

I’d call the score 2-1 in terms of successes. The runs for the algorithms take forever, so I did my best to account for the limited time while trying to create a good model.

## Personal Takeaways

This project was easily the hardest one I’ve ever undertaken, across my undergrad and Master’s courses. Out of the 2 weeks we had to finish the project, about 10 days was dedicated to reading and research to better understand the processes involved. Although I was disappointed that my model for damages and deaths did not perform well, I remember the triumph of seeing my trajectory predictions following the actual path of the hurricanes.

This project also expanded my brain. In my biostatistics classes, we always have examples from health fields, so it blew my mind to see statistics in other contexts.

Copyright © 2019 Christian B. Pascual. All rights reserved. This site is hosted by Github Pages and is built on R Markdown.