What’s a good way to have someone play around with logistic regression?

Applied Skills



Colectomies are surgical procedures that remove all of part of your large intestine. These surgeries are performed for various reasons, ranging from bowel obstruction to colon cancer. Over 250,000 colectomies are performed each year in the United States alone, representing an estimated 10% of the total volume of general surgeries. Given the prolific nature of the surgery, the rate of post-operation complication is astounding: the average rate of complication approached 30% in the last 10 years. 1

Project Motivation

For our project, we want to investigate what factors contribute to increasing or decreasing the risk of post-operative complication using a dataset on colectomies performed from 2014 - 2016 from multiple hospitals in Michigan. Using this data, we will perform a regression analysis to figure out which variables have an impact on post-surgery complication.

Project Roadmap

We needed to figure out if prior research has been done for this question. If risk factors have already been established, then they theoretically should truly contribute to our regression. Next, we figured out what our outcome could be since “complication” after a surgery can be defined in so many ways. After a proper literature review, we had a list of candidate factors for the regression. Before applying the regression, we specifically looked at some of these factors to see if there was an association with colectomy complications. Finally, with all this preparation, we entered these variables into an automatic procedure to get our model.

While we did a literature review to choose our covariates, we felt it would be interesting if we could enable others to create a modelling application to recreate what we did. This application would enable interested users to use a subset of the data to create their own models. They may choose any variables they might find useful or that we may have not considered in our model.

Data Characterization

In addition to tidyverse, we created a set of helper functions and variables that we’ve stored in utils.R to save coding space. Our data will initially be stored in the colectomies variable.

In its raw form, the data has 10868 rows and 992 columns. Each row corresponds to a single colectomy and an incredible amount of information, including a bevy of laboratory, disease, surgery and patient data.

Despite the seemingly sheer size of the data, it has its limitations. A brief glance at the data showed us that many of the columns are useless for our analysis; many columns contain mostly or only missing data. Before we can start variable selection for our model, we need to tidy up the dataset.

Data Cleaning

To start the cleaning, we want to remove all columns that contain more than 50% missing values (either blank cells or NA). We chose 50% since we’ve seen this cutoff used in other regression studies. Further adding to the clutter, some columns actually contain duplicate information from others. We found that these duplicate columns always start with flg or e.

Our function, is_mostly_intact, helps us identify columns that are more than 50% missing or blank. prettify_names helps us mass-change the column names in the dataset in a succinct way, leaving variables that are more informative to code with. After this purge, our reduced dataset now stands at 233 columns.

Data Cleaning

Many of the columns in the data appear to be numerical, but are in fact, categorical. We performed the retyping with our catfactory function in utils.R.

The outcome of interest we will focus on is surgical site infection (SSI). After researching more into colectomies, we found that infection was the one of the most common types of complication. We considered mortality in the beginning, but given its rarity, we decided to drop it. We believe that focusing on this aspect of post-operation will allow us to narrow down the scope of our analysis while allowing for the greatest breadth of “complication”.

Surgical site infection (SSI) is contained in 3 particular columns (postop_ssi_super, postop_ssi_deep and postop_ssi_organspace) in the dataset, which we compile into one summary variable: any_ssi.

Despite the heavy data reduction, too many variables remain to know to just toss into a stepwise regression. We performed a literature review to see what relationships have been established.

Literature Review

Thankfully, several factors have been identified as risk factors for complications in colorectal surgery. Kirchoff established many risk factors in a 2010 paper on the subject. The paper found that age, gender, prior surgery, obesity, nutritional status and body weight loss were patient-related factors. Factors that were related to the surgery itself included: open access to abdominal cavity, blood loss, surgical approach switches and length of operating time. 2

Ko et. al found that certain diseases were associated with increased risk of death post-colectomy, including venous thromboembolism (VTE), sepsis, acute myocardial infarction, pneumonia, respiratory failure and shock. 3

In another paper, Tang et. al focused on risk factors associated with surgical site infection, our outcome of interest. Dr. Tang identified ASA score, blood transfusion, drainage use, and sex. 4

With these papers in mind, we know it would be best to include these variables. We selected a few of these variables to visualize their relationship with SSI in the cleaned dataset.


SSI and insurance status

One variable we thought would have a relationship to SSI was insurance status. We believed that patients with little to no coverage would be forced to go to less experienced hospitals and experience more SSI. To confirm or deny this belief, we looked at how SSIs were distributed by insurance staus.

In the first table, we can see that most SSIs are in patients with Medicare, Medicare + Medicare Supplemental Plan/Medigap Insurance, or the uninsured. However, these groups also represent the greatest number of patients, so we can’t solely rely on the absolute SSI counts to tell us anything useful. We must look at the percentages instead.

In the second plot, we can see the percentage of SSI occurences out of the total number of colectomies performed based on insurance type. There is generally no distinction between the insurance types, with most of the percentages hovering between 6 to 12 percent. Noticeably, patients with private insurance had a much higher percentage of SSIs at above 25 percent while patients with other insurance had 0 percent. However, this is most likely an artifact of the small sample size of this type of insurance.

In the end, our hypotheses was incorrect; there was seemingly no relationship between insurance type and occurence of SSIs. This result indicates it may be a poor candidate for predicting SSI.

Association to each health condition

To see how often a condition is related to SSI, we use a heatmap to look for any particularly high rates of association. We also looked at death as an outcome to see if any of these diseases are associated with it.

Comparing number of death and SSI cases side by side with each health condition, we see a nearly symmetric heatmap. There are fewer cases under patients that actually got SSI or died from the operation, so it is difficult to conclude any direct relationship between health condition and surgery outcome. Intrestingly, we can see that there are a great number of successful cases for patients with sleep apnea, specific carbohydrate diet, and hypertension related ot SSI.

Our subanalyses demonstrate that many of the factors we found in the literature review are seemingly not associated with SSI. We believe that this may be due to differences in types of infection in our dataset and those studied in the papers. In light of our subanalyses, we will still include them in the regression analysis.

Regression Analysis

With our literature review and personal exploration of the data, we have a set of variables to use for our logistic regression. After much debate we decided to analyze a subset of 20 variables. These candidate variables have been shown to be associated with SSI, so we’ll use automatic procedures to further cut down on covariates and attempt to get to a more robust, parsimonious model.

The resulting stepwise regression results in a model with 23 covariates, with many of them stemming from categorical factors chosen. Out of these covariates, length_of_stay, surgical_approaches 2, 3 and 4, age, ASA level 3, admission to ICU, surgery time, and BMI remained statistically significant.


We are pleased to see that many of the variables that we found during our literature found themselves in the model and were statistically significant. However, many of the variables that were found to be risk factors in SSI did not end up in the final model or were not statistically significant. Given the large amount of covariates and categorical variables we had to consider, it is highly likely that many elements that would have been included in a more parsimonious model were excluded thanks to the noise introduced by starting with 20 candidates.

Looking at the model coefficients, we can comment on how each factor affects the odds of SSI. High BMI, admission to the ICU, ASA class 3, being a smoker and long surgery times have positive coefficients, meaning that the odds of SSI in patients with these characteristics are higher than those without. Older patients, length of stay and surgical approaches 2 and 3 correspond with negative coefficients, indicating that odds of SSI are reduced in patients with these qualities.

Although far from perfect, our model gives us some insight into what pre-surgical disease/traits contribute to SSI and supports some of the research done on this particular subject. We believe that our results can be improved by starting with more parsimonious models and building the stepwise regression from there.

We believe this analysis could be improved by focusing on particular subsets of patients who contract infections. We know that colectomies are performed for many different conditions, so different diseases may contribute differently to how patients may be infected.

Changes Mid-Report

During the course of the project, we were unsure whether we should use SSI or death as an outcome. As we started wrangling with the data and reviewing the literature, we realized that more research has been done on SSI. Furthermore, we also found out that death due to colectomy was a rare outcome. These two factors led us to focus on SSI in the subanalyses and regression.

One particularly contentious issue for our project was actually our Shiny application. Given that we had data from multiple hospitals in Michigan, we wanted to create a tool to recommend hospitals to patients given some location data. However, we quickly ran into a roadblock: there was no location information in our dataset. Since we were doing a regression analysis, we decided to opt for a more educational tool instead. Now our Shiny app just allows a user to create their own regression tool and see what the results are.

The Widget

We created a Shiny app that would enable a user to pick variables to include into a logistic regression. After picking factors, the user can see the magnitude of each of the coefficients. You can find the widget here. The widget has to load data, so it may take a while to work.

We also created another basic site to host a shorted version of the analysis


Copyright © 2019 Christian B. Pascual. All rights reserved. This site is hosted by Github Pages and is built on R Markdown.