Photo by Markus Spiske on Unsplash

Photo by Markus Spiske on Unsplash

Problem

Can we recreate glmnet’s implementation of logistic-LASSO?

Applied Skills

Optimization, Algorithms

Introduction

The diagnosis of breast cancer at an early stage is an important goal of screening. While many different screening tests exist today, there is still room for improvement. A promising avenue for breast cancer detection is prediction based off of breast cancer tissue images. For example, a study by Kim et al. demonstrated that regression models could be helpful in diagnosing breast cancer using Logistic LASSO and stepwise logistic regression.

This report seeks to validate their findings and use regression methods to predict breast cancer status from image data. The two techniques of interest will be a logistic model using the Newton-Raphson optimization and another using coordinate descent in a LASSO context. This project was a group effort for Topics in Advanced Statistical Computing. I was in charge of implementing logistic-LASSO.

Data Cleaning

Our data set contains 569 observations and 33 rows. The response is a binary variable, diagnosis, that takes either M for malignant tissue or B for benign tissue. We take “malignant” to take on the 1 value. There are 30 potential predictors that are derived from the image data, including the mean, standard deviation and largest values of various features in the images. Important examples of features included in the data set include cell nuclei radius, texture (derived from gray-scale values), and concavity.

LASSO is sensitive to the scale of the predictors, so the data was standardized. Many of the columns in the data are highly correlated with each other, so we chose a subset such that no two columns had a correlation higher than 0.7.

Proposed Model

The Logistic-LASSO

For this model, we sought to minimize the objective function, derived from the quadratic Taylor approximation to the binomial log-likelihood function:

\[\mathop{min}_{(\beta_0, \mathbf{\beta_1)}} \bigg( \frac{1}{2n} \sum^n_{i=1} \omega_i(z_i -\beta_0 - \mathbf{x}^T_i\mathbf{\beta_1})^2 + \lambda \sum^p_{j=0}|\beta_j| \bigg)\]

\(\omega_i\) is the working weight, \(z_i\) is the working response, \(\beta_0\) is the intercept and \(\mathbf{\beta_1}\) is the set of \(\beta\) coefficients.

This objective function was minimized using coordinate descent. The intercept term was not penalized. Each \(\beta_k\) was optimized using the following equation:

\[\tilde{\beta_j} = \frac{S(\sum_i\omega_ix_{i,j}(y_i - \tilde{y}_i^{(-j)}), \gamma)}{\sum\omega_ix^2_{i,j}}\]

where \(S\) is the weighted soft threshold function, \(y^{-j}\) is the response omitting \(\beta_j\) and \(\gamma\) is the threshold to be used on all the \(\beta_k\).

Our Logistic-LASSO defined convergence to occur when the Frobenius norm between the calculated \(\beta\) and the \(\beta\) from the last iteration to drop below \(1^{-5}\). The same 0.001 vector will also be used to initialize the algorithm.

Helper functions

soft = function(beta, gamma) {
  ### Parameters:
  # beta : the original coefficient beta from a regression
  # gamma : the desired threshold to limit the betas at
  
  # returns a single adjusted value of the original beta
  
  return(sign(beta) * max(abs(beta) - gamma, 0))
}

calc.cur.p = function(data, betas) {
  ### Parameters:
  # intercept : the intercept term of the betas (scalar)
  # data : the associated data for each beta in betas (n x p matrix)
  # betas : all the non-intercept beta coefficients (p x 1 array)
  
  # return n x 1 array of current probabilities evaluated with given betas
  
  u = data %*% betas
  return(exp(u) / (1 + exp(u)))
}

calc.working.weights = function(p) {
  ### Parameters:
  # p : the working probabilities, calculated by calc.cur.p
  
  # return n x 1 array of working weights for the data
  
  # Check for coefficient divergence, adjust for fitted probabilities 0 & 1
  close.to.1  = (1 - p) < 1e-5
  close.to.0 = p < 1e-5
  w = p * (1 - p)
  w[which(close.to.1)] = 1e-5
  w[which(close.to.0)] = 1e-5
  
  return(w)
}

calc.working.resp = function(data, resp, betas) {
  ### Parameters: 
  # intercept : the intercept term of the betas (scalar)
  # data : the associated data for each beta in betas (n x p matrix)
  # resp : the response variable of the dataset (n x 1 array)
  # betas : all the non-intercept beta coefficients (p x 1 array)
  # p : the working probabilities, calculated by calc.cur.p
  
  # return n x 1 array of working responses evaluated with given betas
  p = calc.cur.p(data, betas)
  w = calc.working.weights(p)
  return((data %*% betas) + ((resp - p) / w))
}

calc.obj = function(betas, w, z, data, lambda) {
  ### Parameters:
  # intercept : the intercept term of the betas (scalar)
  # data : the associated data for each beta in betas (n x p matrix)
  # resp : the response variable of the dataset (n x 1 array)
  # betas : all the non-intercept beta coefficients (p x 1 array)

  # return the log-likelihood value for a logistic model
  LS = (2 * nrow(data))^(-1) * sum(w * (z - (data %*% betas))^2)
  beta.penalty = lambda * sum(abs(betas))
  return(LS + beta.penalty)
}

compile = function(data, resp, betas) {
  # Helper function to contain all the calculations for coordinate logistic regression
  p = calc.cur.p(data, betas)
  w = calc.working.weights(p)
  z = calc.working.resp(data, resp, betas)
  return(list(
    p = p,
    w = w,
    z = z
  ))
}

calc.beta.norm = function(beta1, beta2) {
  ### Parameters:
  # beta1, beta2 : beta vectors to compare
  
  # returns the Frobenius norm between two beta vectors
  return(norm(as.matrix(beta1 - beta2), "F"))
}

The Algorithm

LogLASSO.CD = function(X, y, beta, lambda, tol = 1e-10, maxiter = 1000) {
  ### Parameters:
  # X : design matrix, does include an intercept column                                                 
  # y : response variable (should be binary)                          
  # beta : starting beta coefficients to start from                   
  # lambda : constraining parameter for LASSO penalization            
  # tol : how precise should our convergence be                       
  # maxiter : how many iterations should be performed before stopping 
  
  # return a list containing the matrix of the coefficients and the iteration matrix
  
  # Convert data into matrix for easier manipulation
  X = as.matrix(X)
  names(beta) = colnames(X) # Assign original covariate names to the betas

  # Iteration setup
  j = 0
  work = compile(X, y, beta)
  obj = calc.obj(beta, work$w, work$z, X, lambda)
  diff.obj = diff.beta = Inf
  path = c(iter = j, obj = obj, beta)
  beta[1] = sum(work$w * (work$z - X %*% beta)) / sum(work$w)
  while (j < maxiter && diff.obj > tol) {
    
    j = j + 1
    prev.obj = obj
    prev.beta = beta
    
    # Coordinate descent through all of the betas 
    for (k in 2:length(beta)) {
      work = compile(X, y, beta)
      z.rm.k = X[,-k] %*% beta[-k]
      val = sum(work$w * X[,k] * (work$z - z.rm.k))
      beta[k] = (1/sum(work$w * X[,k]^2)) * soft(val, lambda)  
    }
    
    # Recalculate the objective
    work = compile(X, y, beta)
    obj = calc.obj(beta, work$w, work$z, X, lambda)
    
    # Convergence check calculation
    # diff.obj = abs(prev.obj - obj) # check difference in log-likelihood
    diff.obj = norm(as.matrix(beta - prev.beta), "F")
    prev.obj = obj
    
    # Append it to tracking matrix
    path = rbind(path, c(iter = j, obj = obj, beta))
    } 
  
  return(list(
    path = as.tibble(path),
    coefficients = beta,
    iter = j,
    obj = obj)
    )
  }

Assessing predictive ability

5-fold cross validation will be used to find an optimal \(\lambda\) value for the Logistic-LASSO model. The optimal \(\lambda\) will be defined as the \(\lambda\) that minimizes the average test MSE across the tested \(\lambda\) values.

Findings

The estimated coefficients for each model are listed in Table 1. The estimates for the Newton-Raphson algorithm match what is estimated in the \(glm\) implementation of logistic regression. The estimates for the Logistic-LASSO match up closely against the implementation in \(glmnet\). Newton-Raphson reaches convergence much faster than the Logistic LASSO.

Table 1: estimated model coefficients

Table 1: estimated model coefficients

Figure 1 illustrates a path of solutions along increasing \(\lambda\)’s. The constant line represents the intercept since we chose not to penalize it in our implementation.

Figure 1: Coefficient paths by \lambda values

Figure 1: Coefficient paths by \(\lambda\) values

Conclusion

This report sought to explore and compare how two different models are able to predict cancer malignancy given various image data. The optimization of these models requires the use of the Newton-Raphson and coordinate descent algorithms in a logistic regression context. We used cross validation to optimize the regularization parameter \(\lambda\) and to judge the predictive ability of both models. In the end, Logistic-LASSO produced a model that was more effective at predicting cancer malignancy in the data. These types of models have promise in improving healthcare through improved diagnostics.

Personal Takeaways

I definitely learned the pains of having to implement the algorithm by hand here. I only implemented the base parts of the algorithm and not any of the optimizations to speed it up. I have a newfound appreciation for professionally vetted code.

Copyright © 2019 Christian B. Pascual. All rights reserved. This site is hosted by Github Pages and is built on R Markdown.