 Photo by Markus Spiske on Unsplash

Problem

Can we recreate glmnet’s implementation of logistic-LASSO?

Applied Skills

Optimization, Algorithms

Introduction

The diagnosis of breast cancer at an early stage is an important goal of screening. While many different screening tests exist today, there is still room for improvement. A promising avenue for breast cancer detection is prediction based off of breast cancer tissue images. For example, a study by Kim et al. demonstrated that regression models could be helpful in diagnosing breast cancer using Logistic LASSO and stepwise logistic regression.

This report seeks to validate their findings and use regression methods to predict breast cancer status from image data. The two techniques of interest will be a logistic model using the Newton-Raphson optimization and another using coordinate descent in a LASSO context. This project was a group effort for Topics in Advanced Statistical Computing. I was in charge of implementing logistic-LASSO.

Data Cleaning

Our data set contains 569 observations and 33 rows. The response is a binary variable, diagnosis, that takes either M for malignant tissue or B for benign tissue. We take “malignant” to take on the 1 value. There are 30 potential predictors that are derived from the image data, including the mean, standard deviation and largest values of various features in the images. Important examples of features included in the data set include cell nuclei radius, texture (derived from gray-scale values), and concavity.

LASSO is sensitive to the scale of the predictors, so the data was standardized. Many of the columns in the data are highly correlated with each other, so we chose a subset such that no two columns had a correlation higher than 0.7.

Proposed Model

The Logistic-LASSO

For this model, we sought to minimize the objective function, derived from the quadratic Taylor approximation to the binomial log-likelihood function:

$\mathop{min}_{(\beta_0, \mathbf{\beta_1)}} \bigg( \frac{1}{2n} \sum^n_{i=1} \omega_i(z_i -\beta_0 - \mathbf{x}^T_i\mathbf{\beta_1})^2 + \lambda \sum^p_{j=0}|\beta_j| \bigg)$

$$\omega_i$$ is the working weight, $$z_i$$ is the working response, $$\beta_0$$ is the intercept and $$\mathbf{\beta_1}$$ is the set of $$\beta$$ coefficients.

This objective function was minimized using coordinate descent. The intercept term was not penalized. Each $$\beta_k$$ was optimized using the following equation:

$\tilde{\beta_j} = \frac{S(\sum_i\omega_ix_{i,j}(y_i - \tilde{y}_i^{(-j)}), \gamma)}{\sum\omega_ix^2_{i,j}}$

where $$S$$ is the weighted soft threshold function, $$y^{-j}$$ is the response omitting $$\beta_j$$ and $$\gamma$$ is the threshold to be used on all the $$\beta_k$$.

Our Logistic-LASSO defined convergence to occur when the Frobenius norm between the calculated $$\beta$$ and the $$\beta$$ from the last iteration to drop below $$1^{-5}$$. The same 0.001 vector will also be used to initialize the algorithm.

Assessing predictive ability

5-fold cross validation will be used to find an optimal $$\lambda$$ value for the Logistic-LASSO model. The optimal $$\lambda$$ will be defined as the $$\lambda$$ that minimizes the average test MSE across the tested $$\lambda$$ values.

Findings

The estimated coefficients for each model are listed in Table 1. The estimates for the Newton-Raphson algorithm match what is estimated in the $$glm$$ implementation of logistic regression. The estimates for the Logistic-LASSO match up closely against the implementation in $$glmnet$$. Newton-Raphson reaches convergence much faster than the Logistic LASSO. Table 1: estimated model coefficients

Figure 1 illustrates a path of solutions along increasing $$\lambda$$’s. The constant line represents the intercept since we chose not to penalize it in our implementation. Figure 1: Coefficient paths by $$\lambda$$ values

Conclusion

This report sought to explore and compare how two different models are able to predict cancer malignancy given various image data. The optimization of these models requires the use of the Newton-Raphson and coordinate descent algorithms in a logistic regression context. We used cross validation to optimize the regularization parameter $$\lambda$$ and to judge the predictive ability of both models. In the end, Logistic-LASSO produced a model that was more effective at predicting cancer malignancy in the data. These types of models have promise in improving healthcare through improved diagnostics.

Personal Takeaways

I definitely learned the pains of having to implement the algorithm by hand here. I only implemented the base parts of the algorithm and not any of the optimizations to speed it up. I have a newfound appreciation for professionally vetted code.

Copyright © 2019 Christian B. Pascual. All rights reserved. This site is hosted by Github Pages and is built on R Markdown.