Can we recreate `glmnet`

’s implementation of logistic-LASSO?

Optimization, Algorithms

The diagnosis of breast cancer at an early stage is an important goal of screening. While many different screening tests exist today, there is still room for improvement. A promising avenue for breast cancer detection is prediction based off of breast cancer tissue images. For example, a study by Kim et al. demonstrated that regression models could be helpful in diagnosing breast cancer using Logistic LASSO and stepwise logistic regression.

This report seeks to validate their findings and use regression methods to predict breast cancer status from image data. The two techniques of interest will be a logistic model using the Newton-Raphson optimization and another using coordinate descent in a LASSO context. This project was a group effort for Topics in Advanced Statistical Computing. I was in charge of implementing logistic-LASSO.

Our data set contains 569 observations and 33 rows. The response is a binary variable, diagnosis, that takes either M for malignant tissue or B for benign tissue. We take “malignant” to take on the 1 value. There are 30 potential predictors that are derived from the image data, including the mean, standard deviation and largest values of various features in the images. Important examples of features included in the data set include cell nuclei radius, texture (derived from gray-scale values), and concavity.

LASSO is sensitive to the scale of the predictors, so the data was standardized. Many of the columns in the data are highly correlated with each other, so we chose a subset such that no two columns had a correlation higher than 0.7.

```
library(tidyverse)
library(glmnet)
library(modelr)
library(matrixcalc)
set.seed(8160)
# Data preparation
standardize = function(col) {
mean = mean(col)
stdev = sd(col)
return((col - mean)/stdev)
}
# just standardize the covariates
raw = read.csv(file = "breast-cancer-1.csv")
resp = raw %>% dplyr::select(diagnosis) %>% mutate(diagnosis = ifelse(diagnosis == "M", 1, 0))
standardized.data = raw %>%
dplyr::select(radius_mean:fractal_dimension_worst) %>%
map_df(.x = ., standardize)
data = cbind(resp, standardized.data)
X = data %>% select(-diagnosis)
y = as.matrix(data$diagnosis)
# Removing columns that have high multicollinearity with others
X = data %>% select(-diagnosis, -perimeter_mean, -area_mean, -concave.points_mean,
-radius_worst, -area_se, -perimeter_worst, -area_worst,
-concave.points_worst, -texture_worst, -smoothness_worst,
-compactness_se, -compactness_mean, -compactness_worst,
-concavity_worst, -fractal_dimension_worst, -perimeter_se,
-concave.points_se, -fractal_dimension_se, -symmetry_worst)
cor.matrix = cor(X)
# Add back in intercept for matrix calculation
X = X %>% mutate(intercept = 1) %>%
dplyr::select(intercept, everything())
```

For this model, we sought to minimize the objective function, derived from the quadratic Taylor approximation to the binomial log-likelihood function:

\[\mathop{min}_{(\beta_0, \mathbf{\beta_1)}} \bigg( \frac{1}{2n} \sum^n_{i=1} \omega_i(z_i -\beta_0 - \mathbf{x}^T_i\mathbf{\beta_1})^2 + \lambda \sum^p_{j=0}|\beta_j| \bigg)\]

\(\omega_i\) is the working weight, \(z_i\) is the working response, \(\beta_0\) is the intercept and \(\mathbf{\beta_1}\) is the set of \(\beta\) coefficients.

This objective function was minimized using coordinate descent. The intercept term was not penalized. Each \(\beta_k\) was optimized using the following equation:

\[\tilde{\beta_j} = \frac{S(\sum_i\omega_ix_{i,j}(y_i - \tilde{y}_i^{(-j)}), \gamma)}{\sum\omega_ix^2_{i,j}}\]

where \(S\) is the weighted soft threshold function, \(y^{-j}\) is the response omitting \(\beta_j\) and \(\gamma\) is the threshold to be used on all the \(\beta_k\).

Our Logistic-LASSO defined convergence to occur when the Frobenius norm between the calculated \(\beta\) and the \(\beta\) from the last iteration to drop below \(1^{-5}\). The same 0.001 vector will also be used to initialize the algorithm.

```
soft = function(beta, gamma) {
### Parameters:
# beta : the original coefficient beta from a regression
# gamma : the desired threshold to limit the betas at
# returns a single adjusted value of the original beta
return(sign(beta) * max(abs(beta) - gamma, 0))
}
calc.cur.p = function(data, betas) {
### Parameters:
# intercept : the intercept term of the betas (scalar)
# data : the associated data for each beta in betas (n x p matrix)
# betas : all the non-intercept beta coefficients (p x 1 array)
# return n x 1 array of current probabilities evaluated with given betas
u = data %*% betas
return(exp(u) / (1 + exp(u)))
}
calc.working.weights = function(p) {
### Parameters:
# p : the working probabilities, calculated by calc.cur.p
# return n x 1 array of working weights for the data
# Check for coefficient divergence, adjust for fitted probabilities 0 & 1
close.to.1 = (1 - p) < 1e-5
close.to.0 = p < 1e-5
w = p * (1 - p)
w[which(close.to.1)] = 1e-5
w[which(close.to.0)] = 1e-5
return(w)
}
calc.working.resp = function(data, resp, betas) {
### Parameters:
# intercept : the intercept term of the betas (scalar)
# data : the associated data for each beta in betas (n x p matrix)
# resp : the response variable of the dataset (n x 1 array)
# betas : all the non-intercept beta coefficients (p x 1 array)
# p : the working probabilities, calculated by calc.cur.p
# return n x 1 array of working responses evaluated with given betas
p = calc.cur.p(data, betas)
w = calc.working.weights(p)
return((data %*% betas) + ((resp - p) / w))
}
calc.obj = function(betas, w, z, data, lambda) {
### Parameters:
# intercept : the intercept term of the betas (scalar)
# data : the associated data for each beta in betas (n x p matrix)
# resp : the response variable of the dataset (n x 1 array)
# betas : all the non-intercept beta coefficients (p x 1 array)
# return the log-likelihood value for a logistic model
LS = (2 * nrow(data))^(-1) * sum(w * (z - (data %*% betas))^2)
beta.penalty = lambda * sum(abs(betas))
return(LS + beta.penalty)
}
compile = function(data, resp, betas) {
# Helper function to contain all the calculations for coordinate logistic regression
p = calc.cur.p(data, betas)
w = calc.working.weights(p)
z = calc.working.resp(data, resp, betas)
return(list(
p = p,
w = w,
z = z
))
}
calc.beta.norm = function(beta1, beta2) {
### Parameters:
# beta1, beta2 : beta vectors to compare
# returns the Frobenius norm between two beta vectors
return(norm(as.matrix(beta1 - beta2), "F"))
}
```

```
LogLASSO.CD = function(X, y, beta, lambda, tol = 1e-10, maxiter = 1000) {
### Parameters:
# X : design matrix, does include an intercept column
# y : response variable (should be binary)
# beta : starting beta coefficients to start from
# lambda : constraining parameter for LASSO penalization
# tol : how precise should our convergence be
# maxiter : how many iterations should be performed before stopping
# return a list containing the matrix of the coefficients and the iteration matrix
# Convert data into matrix for easier manipulation
X = as.matrix(X)
names(beta) = colnames(X) # Assign original covariate names to the betas
# Iteration setup
j = 0
work = compile(X, y, beta)
obj = calc.obj(beta, work$w, work$z, X, lambda)
diff.obj = diff.beta = Inf
path = c(iter = j, obj = obj, beta)
beta[1] = sum(work$w * (work$z - X %*% beta)) / sum(work$w)
while (j < maxiter && diff.obj > tol) {
j = j + 1
prev.obj = obj
prev.beta = beta
# Coordinate descent through all of the betas
for (k in 2:length(beta)) {
work = compile(X, y, beta)
z.rm.k = X[,-k] %*% beta[-k]
val = sum(work$w * X[,k] * (work$z - z.rm.k))
beta[k] = (1/sum(work$w * X[,k]^2)) * soft(val, lambda)
}
# Recalculate the objective
work = compile(X, y, beta)
obj = calc.obj(beta, work$w, work$z, X, lambda)
# Convergence check calculation
# diff.obj = abs(prev.obj - obj) # check difference in log-likelihood
diff.obj = norm(as.matrix(beta - prev.beta), "F")
prev.obj = obj
# Append it to tracking matrix
path = rbind(path, c(iter = j, obj = obj, beta))
}
return(list(
path = as.tibble(path),
coefficients = beta,
iter = j,
obj = obj)
)
}
```

5-fold cross validation will be used to find an optimal \(\lambda\) value for the Logistic-LASSO model. The optimal \(\lambda\) will be defined as the \(\lambda\) that minimizes the average test MSE across the tested \(\lambda\) values.

The estimated coefficients for each model are listed in Table 1. The estimates for the Newton-Raphson algorithm match what is estimated in the \(glm\) implementation of logistic regression. The estimates for the Logistic-LASSO match up closely against the implementation in \(glmnet\). Newton-Raphson reaches convergence much faster than the Logistic LASSO.

Figure 1 illustrates a path of solutions along increasing \(\lambda\)’s. The constant line represents the intercept since we chose not to penalize it in our implementation.

This report sought to explore and compare how two different models are able to predict cancer malignancy given various image data. The optimization of these models requires the use of the Newton-Raphson and coordinate descent algorithms in a logistic regression context. We used cross validation to optimize the regularization parameter \(\lambda\) and to judge the predictive ability of both models. In the end, Logistic-LASSO produced a model that was more effective at predicting cancer malignancy in the data. These types of models have promise in improving healthcare through improved diagnostics.

I definitely learned the pains of having to implement the algorithm by hand here. I only implemented the base parts of the algorithm and not any of the optimizations to speed it up. I have a newfound appreciation for professionally vetted code.