Can we track hurricanes based on past data?

MCMC, Metropolis Algorithm, Bayesian Linear Regression

Hurricanes are tropical cyclones whose maximum sustained wind speeds reach 74 mph. Hurricane season starts in early June and continues through the end of November, but hurricanes have been known to occur outside of this range. These natural phenomena are both amazing climatological events and destructive forces to be reckoned with, as was seen with Hurricane Maria in 2017. The ability to predict their trajectory thus represents an important task for not just the climatologist, but the statistician.

This report seeks to explore the use of Monte Carlo Markov Chain (MCMC) methods to predict both hurricane trajectory and damages using a data set on 356 hurricanes in the North Atlantic area since 1989 and another set focusing on damages and costs associated with a smaller subset of 46 hurricanes. This project was the final project of Topics in Advanced Statistical Computing and was a solo effort.

For the \(i^{th}\) hurricane, we denote \(Y_{i,1}(t), Y_{i,2}(t), Y_{i,3}(t)\) to represent the latitude, longitude and wind speed at time \(t\). The prescribed model for predicting a hurricane’s trajectory is an autoregressive model in terms of \(Y_{i,j}\) as: \[ Y_{i,j}(t+6) = \mu_{i,j}(t) + \rho_jY_{i,j}(t) + \epsilon_{i,j} \]

where \(\mu_{i,j}\) is the functional mean of the time and type of the hurricane and \(\rho_j\) is the autoregressive correlation. The \(\epsilon_{i,j}\) follow a multivariate normal distribution with mean 0 and covariance matrix \(\Sigma\).

The functional mean is given by: \[ \mu_{i,j}(t) = \beta_{0,j} + x_{i,1}\beta_{1,j} + x_{i,2}\beta_{2,j} + x_{i,3}\beta_{3,j} + \sum^3_{k=1} \beta_{3+k,j}\Delta_{i,k}(t-6) \]

where \(x_{i,1}\) is the day of the year, \(x_{i,2}\) is the year, \(x_{i,3}\) is the nature of the hurricane, and \(\Delta_{i,k}\) is the change in latitude, longitude, and wind speed between \(t-6\) and \(t\).

We will use the following priors:

- \(\pi(\bf{\beta})\) are jointly normal with mean 0 and variance \(diag(1,p)\)
- \(\pi(\rho_j)\) follow a truncated normal \(N_{[0,1]}(0.5, 1/5)\)
- \(\pi(\Sigma^{-1})\) follows a \(Wishart(3, diag(0.1,3))\)

The second task concerning damages and deaths was left as a more open-ended question. Using the insights from hurricane trajectory, I chose to model this problem in a similar way. The focus of this problem was to figure out which predictors contributed to both hurricane damages and deaths, and I also wanted to know which were significant.

Whereas the trajectory problem simultaneously looked at 3 different outcomes, this problem only had 2. Our predictive model for damages and deaths will thus be represented by \((W_{i,1},W_{i,2})\), where \(j=1\) will represent damages and \(j=2\) will represent deaths, where \(W_{i,j}\) is a linear model of the predictors \(X_i\).

\[ W_{i,j} = X^T\beta + \epsilon_{i,j} \]

where the errors \(\epsilon_{i,j}\) will follow a multivariate normal distribution with covariance \(\Sigma\). I usedthe same priors from the trajectory problem and adjusted them to this problem.

The available predictors in the data set are: year, month, hurricane nature, max speed, mean speed, max pressure, mean pressure, hours, total affected population, percent affected that are poor and percent affected that are in the USA.

The original dataset only contained the date and time of the hurricane in string form, so all of the time variables had to be extracted from it. To reduce the number of unknown \(\beta\), I converted nature into an ordinal variable. `tidyverse`

and `maps`

are brought in for plotting. `MCMCpack`

, `msm`

and `matrixcalc`

are brought in for convenient functions for our priors.

```
library(tidyverse)
library(msm)
library(MCMCpack)
library(matrixcalc)
library(maps)
# Cleaning and adding needed columns to the dataset
raw = read_csv("./hurricane356.csv") %>%
mutate(
date = as.Date(substring(time, 2, 9), format = "%y-%m-%d"),
year.start = as.Date(paste(substring(time, 2, 3),"-01-01", sep = ""),
format = "%y-%m-%d"),
d.since.start = date - year.start,
hour = as.numeric(substring(time, 11, 12)),
Nature = factor(Nature, levels = c("NR", "DS", "ET", "SS", "TS")),
Nature = as.numeric(Nature)
) %>%
rename(Year = Season, wspeed = Wind.kt) %>%
dplyr::select(ID, Year, d.since.start, hour, Nature,
Latitude, Longitude, wspeed) %>%
janitor::clean_names()
# Add in lag columns for each of the columns
hurricanes = NULL
dist.hurricanes = as.character(unique(raw$id))
for (i in 1:length(dist.hurricanes)) {
df = raw %>% filter(id == dist.hurricanes[i])
df = df %>%
mutate(
latitude.lag1 = lag(latitude, 1),
longitude.lag1 = lag(longitude, 1),
wspeed.lag1 = lag(wspeed, 1),
latitude.diff = latitude - latitude.lag1,
longitude.diff = longitude - longitude.lag1,
wspeed.diff = wspeed - wspeed.lag1
) %>%
filter(!is.na(latitude.diff)) %>%
dplyr::select(-longitude.lag1, -latitude.lag1, -wspeed.lag1)
hurricanes = bind_rows(hurricanes, df)
}
```

To assess the usefulness of the resulting model, I used data from 80% of the hurricanes as a training set and the rest as a test set.

```
# Get list of all the distinct hurricanes
train.idx = sample(1:length(dist.hurricanes),
size = 0.7 * length(dist.hurricanes),
replace = FALSE)
train.names = dist.hurricanes[train.idx]
# Divide up the training set according to the hurricane names
train.data = hurricanes %>%
filter(id %in% train.names) %>%
mutate(intercept = 1) %>%
dplyr::select(intercept, d_since_start:nature,
latitude.diff:wspeed.diff,
latitude:wspeed)
train.resp = hurricanes %>%
filter(id %in% train.names) %>%
dplyr::select(latitude:wspeed)
test.data = hurricanes %>%
filter(!(id %in% train.names)) %>%
mutate(intercept = 1) %>%
dplyr::select(intercept, d_since_start:nature,
latitude.diff:wspeed.diff,
latitude:wspeed)
test.resp = hurricanes %>%
filter(!(id %in% train.names)) %>%
mutate(intercept = 1) %>%
dplyr::select(latitude:wspeed)
```

For this dataset, damages was recoded as a numeric variable, and nature was also recoded as an ordinal variable.

```
# Bring in the data and clean it
raw_outcomes = read_csv("./hurricaneoutcome2.csv")
outcomes = read_csv("./hurricaneoutcome2.csv") %>%
rename(year = Season, hurricane_id = HurricanID) %>%
mutate(
Damage = as.numeric(substr(Damage, 2, length(Damage))),
Nature = factor(Nature, levels = c("NR", "DS", "ET", "SS", "TS")),
Nature = as.numeric(Nature),
intercept = 1
) %>%
dplyr::select(-Month, -Meanspeed, -Meanpressure) %>%
dplyr::select(hurricane_id, Damage, Deaths, everything()) %>%
janitor::clean_names()
```

I also wanted to see how this model would fare with unseen data, so I did a training-test split here as well. The dataset for outcomes was much smaller than the trajectory data, so I opted for a 70%:30% split.

```
# Get list of all the distinct hurricanes
train.idx = sample(1:nrow(outcomes),
size = 0.7 * nrow(outcomes),
replace = FALSE)
# Put the data into a form that is compatible with matrix multiplication
train.data = outcomes[train.idx,] %>%
dplyr::select(intercept, year, nature, maxspeed, maxpressure, hours,
total_pop, percent_poor, percent_usa) %>%
data.matrix(.)
train.resp = outcomes[train.idx,] %>% dplyr::select(damage, deaths) %>% as.matrix(.)
test.data = outcomes[-train.idx,] %>%
dplyr::select(intercept, year, nature, maxspeed, maxpressure, hours,
total_pop, percent_poor, percent_usa) %>%
data.matrix(.)
test.resp = outcomes[-train.idx,] %>% dplyr::select(damage, deaths) %>% as.matrix(.)
```

For both models, I estimated the posterior distributions of the parameters using the Metropolis algorithm. For the \(\beta\) and \(\rho\) coefficients, I used a uniform transition function. For \(\Sigma\), I used a Wishart proposal. I implemented a component-wise version of the algorithm to allow myself to better tune the algorithm for each parameter individually.

The following are some helper functions I made to calculate the (log) of the posterior probability:

```
rho.logprior = function(rhos) {
# Prior for correlation rho between time points
prob = dtnorm(rhos, mean = 0.5, sd = sqrt(0.2), lower = 0, upper = 1)
return(sum(log(prob)))
}
sigma.logprior = function(sigma) {
# Prior for the inverse of the error covariance matrix
sigma = matrix(sigma, ncol = 3)
prob = dwish(solve(sigma), 3, diag(0.1, 3))
return(log(prob))
}
beta.logprior = function(betas) {
# Priors for all the beta coefficients
probs = dnorm(betas, mean = 0, sd = 1)
logprobs = log(probs)
return(sum(logprobs))
}
loglikelihood = function(resp, covariates, betas, sigma, rhos) {
# Likelihood for the given linear model
# Latitude, longitude & wind speed form a trivariate normal
# centered at X * beta
y = as.matrix(resp)
X = data.matrix(covariates)
betas = matrix(betas, ncol = 3)
sigma = matrix(sigma, ncol = 3)
rho.mat = rhos * diag(1,3)
coef.mat = rbind(betas, rho.mat)
diff = y - (X %*% coef.mat)
loglik = as_tibble(diff) %>%
transmute(
lik = pmap(
list(x = longitude,
y = latitude,
z = wspeed),
.f = function(x, y, z) {
-0.5 * t(c(x, y, z)) %*% solve(sigma) %*% c(x, y, z)
})
) %>%
unnest(.)
return(sum(loglik))
}
lpost = function(resp, covariates, rhos, sigma, betas) {
# Posterior is proportional to product of
# priors & likelihood
loglik = loglikelihood(resp, covariates, betas, sigma, rhos)
rho.lp = rho.logprior(rhos)
sigma.lp = sigma.logprior(sigma)
beta.lp = beta.logprior(betas)
return(rho.lp + sigma.lp + beta.lp + loglik)
}
```

Understanding what the likelihood was in this problem was the biggest challenge. After pondering it for a long time, I realized that the models are just linear functions of our covariates. We also know that the residuals follow a multivariate normal distribution, so the likelihood can be derived from these two facts.

Since the two problems have similar implementations, I’ve just included the code for the first problem’s algorithm here, The chosen windows were developed after *much* trial and error. I chose to do 10,000 iterations to make sure that I had an ample amount after the burn-in.

```
# Implementing the component-wise Metropolis algorithm
nreps = 10000
start.betas = rep(0, 21)
start.rhos = c(0.9, 0.9, 0.9)
start.sigma = c(10, 0, 0, 0, 10, 0, 0, 0, 10)
beta.windows = c(1, 0.01, 0.5, 0.1, 0.5, 1, 1,
0.5, 0.1, 1, 0.01, 0.5, 1, 1,
1, 0.1, 0.1, 0.001, 0.5, 0.5, 0.5)
rho.windows = c(0.05, 0.001, 0.001)
sigma.window = 250
res.betas = beta.mat = start.betas
res.rhos = rho.mat = start.rhos
res.sigma = sigma.mat = start.sigma
for (i in 1:nreps) {
p = length(res.betas)
q = length(res.rhos)
# Iterate through all of the betas first
for (j in 1:p) {
prop.betas = res.betas
prop.betas[j] = res.betas[j] + (runif(1) - 0.5) * 2 * beta.windows[j]
num = lpost(train.resp, train.data, res.rhos, res.sigma, prop.betas)
denom = lpost(train.resp, train.data, res.rhos, res.sigma, res.betas)
if (log(runif(1)) < num - denom) {
res.betas = prop.betas
}
}
# Now do all of the rhos
for (k in 1:q) {
prop.rhos = res.rhos
prop.rhos[k] = res.rhos[k] + (runif(1) - 0.5) * 2 * rho.windows[k]
num = lpost(train.resp, train.data, prop.rhos, res.sigma, res.betas)
denom = lpost(train.resp, train.data, res.rhos, res.sigma, res.betas)
if (log(runif(1)) < num - denom) {
res.rhos = prop.rhos
}
}
# Finally the sigma
prop.sigma = rWishart(1, sigma.window, matrix(res.sigma, ncol = 3) / sigma.window)
num = lpost(train.resp, train.data, res.rhos, prop.sigma, res.betas)
denom = lpost(train.resp, train.data, res.rhos, res.sigma, res.betas)
if (log(runif(1)) < num - denom) {
res.sigma = prop.sigma
}
# Record all of the new data in the iteration
beta.mat = rbind(beta.mat, res.betas)
rho.mat = rbind(rho.mat, res.rhos)
sigma.mat = rbind(sigma.mat, res.sigma)
print(paste("Iteration", i, "done"))
}
colnames(beta.mat) = c(paste("beta", 0:6, rep(1,7), sep = ""),
paste("beta", 0:6, rep(2,7), sep = ""),
paste("beta", 0:6, rep(3,7), sep = ""))
colnames(rho.mat) = c("p1", "p2", "p3")
colnames(sigma.mat) = c("11", "12", "13", "21", "22", "23", "31", "32", "33")
```

I ran these functions whenever I finished up a run through the algorithm. I needed to make sure that the acceptance probability for all 25 of the unknown parameters weren’t too small, at least above 30%. Because I had so many iterations, I felt that I was able to tolerate higher-than-typical acceptance probabilities.

After looking at the acceptance probabilities, I made another check on the trace plots of each of the coefficients to see if they stably converged. I later found that some actually didn’t converge..

```
# Check uniqueness of the different parameters
print("Beta uniqueness:")
matrix(apply(beta.mat, 2, function(x) length(unique(x))/nreps), ncol = 3)
print("Rho uniqueness:")
matrix(apply(rho.mat, 2, function(x) length(unique(x))/nreps), ncol = 3)
print("Sigma uniqueness")
sigs = as_tibble(sigma.mat) %>%
transmute(
vals = paste(`11`,`12`,`13`,`21`,`22`,`23`,`31`,`32`,`33`)
)
round(length(unique(sigs$vals))/length(sigs$vals), 4)
# Plot the traces
for (l in 1:ncol(beta.mat)) {
plot(1:nreps, beta.mat[-1,l], type = "o", main = colnames(beta.mat)[l])
}
for (m in 1:ncol(rho.mat)) {
plot(1:nreps, rho.mat[-1,m], type = "o", main = colnames(rho.mat)[m])
}
```

I used the first 5,000 iterations as my burn in and used the latter half as our posterior distribution. To get the estimates for the model, I took the expectation of each of the distributions. After getting these coefficients, I calculated predictions for the test dataset and looked at how it performed.

```
# Burn the chains
beta.afterburn = beta.mat[5001:nrow(beta.mat),]
rho.afterburn = rho.mat[5001:nrow(rho.mat),]
# Calculate the posterior means for each coefficient
post.beta.means = apply(beta.afterburn, 2, mean)
post.rho.means = apply(rho.afterburn, 2, mean)
post.beta.mat = matrix(post.beta.means, ncol = 3)
post.rho.mat = post.rho.means * diag(1,3)
post.mat = rbind(post.beta.mat, post.rho.mat)
# Calculate the test error
train.fit = data.matrix(train.data) %*% post.mat
train.error = train.resp - train.fit
test.fit = data.matrix(test.data) %*% post.mat
test.error = test.resp - test.fit
# Calculating the mean squared error for each response
overall.lat.mse.train = sum(train.error[,1] * train.error[,1]) / nrow(train.error)
overall.lon.mse.train = sum(train.error[,2] * train.error[,2]) / nrow(train.error)
overall.ws.mse.train = sum(train.error[,3] * train.error[,3]) / nrow(train.error)
overall.lat.mse = sum(test.error[,1] * test.error[,1]) / nrow(test.error)
overall.lon.mse = sum(test.error[,2] * test.error[,2]) / nrow(test.error)
overall.ws.mse = sum(test.error[,3] * test.error[,3]) / nrow(test.error)
```

This model performed pretty well! This project was interesting because some hurricanes actually turn and twist, so I was curious to see how a linear model with autoregressive terms would recreate this. The code below helped me to visualize the hurricanes on the map.

```
# Looking at each hurricane
test.hurricanes = hurricanes %>% filter(!(id %in% train.names)) %>% dplyr::select(id)
hurricane.preds = test.data %>%
dplyr::select(latitude, longitude, wspeed) %>%
mutate(
id = test.hurricanes$id,
lat.pred = test.fit[,1],
lon.pred = test.fit[,2],
ws.pred = test.fit[,3],
lat.sse = test.error[,1],
lon.sse = test.error[,2],
ws.sse = test.error[,3],
)
# Finding the best predicted hurricane and the worse
extremes = hurricane.preds %>%
group_by(id) %>%
summarise(
n = n(),
lat.mse = sum(lat.sse^2) / n(),
lon.mse = sum(lon.sse^2) / n(),
ws.mse = sum(ws.sse^2) / n(),
)
# Find a middling hurricane
ggplot(data = hurricane.preds) +
geom_polygon(data = map_data(map = 'world'), aes(x = long, y = lat, group = group)) +
stat_summary_2d(data = subset(hurricane.preds, id == "ALBERTO.2000"),
aes(x = longitude, y = latitude, z = wspeed),
fun = median, binwidth = c(1, 1), show.legend = TRUE, alpha = 0.75, color = "green") +
stat_summary_2d(data = subset(hurricane.preds, id == "ALBERTO.2000"),
aes(x = lon.pred, y = lat.pred, z = ws.pred),
fun = median, binwidth = c(1, 1), show.legend = TRUE, alpha = 0.75, color = "red")
```

Hurricane Alberto of 2000 was the longest lasting hurricane in my test dataset. Below is a view of the actual path of the hurricane (in green) and my model’s predictions (in red). I definitely wouldn’t use this version to track actual hurricanes, but it’s a good proof of concept at least.

The damages model was fine, the deaths model was… not so fine. The table below lays out the estimated coefficients and their 95% quantile intervals.

Coefficient | Damage Est. | Damage 2.5th Quantile | Damage 97.5th Quantile |
---|---|---|---|

Intercept | -0.02800 | -1.9600 | 1.86000 |

Year | 0.02200 | -0.0855 | 0.11200 |

Nature | -0.03790 | -1.9200 | 1.91000 |

Max Speed | 0.23200 | -0.1650 | 0.65400 |

Max Pressure | -0.06450 | -0.2550 | 0.15000 |

Hours | -0.01230 | -0.1090 | 0.07700 |

Total Population | 0.00001 | 0.0000 | 0.00002 |

% Poor | -0.25400 | -0.6980 | 0.16200 |

% USA | 0.07580 | -0.1100 | 0.25800 |

Coefficient | Death Est. | Death 2.5th Quantile | Death 97.5th Quantile |
---|---|---|---|

Intercept | 0.0356 | -1.84000 | 2.05000 |

Year | 0.6200 | -0.68600 | 1.74000 |

Nature | -0.0135 | -2.03000 | 2.03000 |

Max Speed | 0.0571 | -1.90000 | 1.96000 |

Max Pressure | 0.2490 | -1.66000 | 2.07000 |

Hours | 0.1870 | -1.66000 | 2.18000 |

Total Population | -0.0008 | -0.00237 | 0.00092 |

% Poor | 0.2040 | -1.70000 | 2.17000 |

% USA | -0.0235 | -2.05000 | 1.93000 |

According to the resulting model, increases to year, max speed, total population and percent USA contribute to increases in damages, while nature, max pressure, hours, and percent poor serve to decrease damages. For damages, the intercept, year, max speed, max pressure, hours, and percent poor were deemed to be significant by the 95% quantile intervals. This result is unexpected, but I surmise that these estimates are inaccurate due to small sample size. For deaths, all the covariates are predicted to decrease the amount of deaths. Unlike damages, all of the estimators for the death model appear to be insignificant, perhaps indicative of poor model specification.

I’d call the score 2-1 in terms of successes. The runs for the algorithms take forever, so I did my best to account for the limited time while trying to create a good model.

This project was easily the hardest one I’ve ever undertaken, across my undergrad and Master’s courses. Out of the 2 weeks we had to finish the project, about 10 days was dedicated to reading and research to better understand the processes involved. Although I was disappointed that my model for damages and deaths did not perform well, I remember the triumph of seeing my trajectory predictions following the actual path of the hurricanes.

This project also expanded my brain. In my biostatistics classes, we always have examples from health fields, so it blew my mind to see statistics in other contexts.