Photo by Rostyslav Savchyn on Unsplash

Photo by Rostyslav Savchyn on Unsplash

Problem

Applied Skills

Bootstrapping, Parallel Processing

Introduction

Down syndrome is a genetic disorder caused by the presence of a third copy of chromosome 21 associated with physical growth delay and intellectual disability. As a genetic disorder, Down syndrome causes changes to the expression of several proteins within the body. Although screening tests exist for Down syndrome, more invasive procedures are needed to confirm the diagnosis. Thus, a non-invasive method to predict Down syndrome would be clinically useful.

We have data on the expression levels of 77 proteins/protein modifications produced in the cortex in 1080 mice. This data is a modified version of the same dataset found on Kaggle. Some mice are controls and others are confirmed to have Down syndrome. The goal of this project is to create and train predictive models on this data. Since the number of predictor proteins is high, we plan to use and compare two models: a logistic-LASSO and a bootstrap-smoothed version of the logistic-LASSO to predict Down syndrome. This project was a group project for Topics in Advanced Statistical Computing. I was in charge of implementing the bootstrap smoothing, so I will focus on that part here.

Data Cleaning

For the most part, the data is already in a good state to be used by the model, so the only major cleaning done was to standardize it. There’s a few mice and proteins that have lots of missing data, so they were removed. tidyverse is for nice plotting, glmnet is to have fast logistic regression and the other libraries are to help with paralell processing.

Methodology

Efron et. al laid out a methodology for using bagging to create smoothed estimators, standard errors and confidence intervals. We applied Efron’s methodology to logistic-LASSO to create another predictive logistic model. A logistic-LASSO model was created for each of 1000 bootstrap samples using the glmnet package. We opted to use glmnet here for the purposes of computational speed. Each of these 1000 models forms a prediction, and a final prediction is made using a majority vote.

In order to create a smoothed estimate and confidence interval for each of the regression coefficients, each coefficient was averaged among the 1000 models per Efron’s bagging methodology.

\[ s(y) = \frac{1}{B}\sum^B_{i=1}t(y^*) \]

where \(t(y^*)\) represents an indiviaul regression coefficient. We calculated a smoothed standard error according to Efron’s paper. The smoothed standard error is given by: \[ \widetilde{sd_B} = \bigg[ \sum^n_{j=1} \widetilde{cov_j}^2 \bigg]^{1/2} \]

where \(\widetilde{cov_j} = \frac{\sum^B_{i=1}(Y^*_{ij} - Y^*_{.j})(t_i^* - t_.^*)}{B}\), bootstrap covariance between \(Y^*_{ij}\) and \(t^*_{i}\). \(Y_{ij}\) represents the number of elements in the bootstrap sample \(y^*_i\) contains the original data point \(y_j\). \(t_.^*\) represents the smoothed bootstrap estimate for the statistic \(t^*_i\). With the smoothed standard error, we can calculate a smoothed confidence interval: \[ s(y) \pm 1.96 \times \widetilde{sd_B} \]

Efron notes that this smoothed confidence interval is smaller than the traditional bootstrap confidence intervals. If this interval didn’t contain 0, we would consider the protein associated with the coefficient to be significantly associated with Down syndrome. For comparison, we’ll also calculate the bootstrap quantile intervals.

Findings

Significant Associations

Our smoothed confidence interval approach found 16 proteins to be significantly associated with Down syndrome. There was about a 50-50 split in terms of whether a significantly associated protein increased or decreased the odds of a mouse having Down syndrome. Figure 1 illustrates the confidence intervals for each of the 77 proteins and an intercept term. Table 1 delineates which of the proteins in the data set were significant.

Smoothed confidence intervals

Smoothed confidence intervals

Proteins and associated smoothed confidence intervals

Proteins and associated smoothed confidence intervals

When we matched these proteins against the features selected by LASSO, we found that all but 2 of them were selected when using the entire data set. This result makes sense since proteins with a true association to Down syndrome would also be more likely to be picked up in the feature selection.

While the smoothed confidence intervals found some significant proteins in the data set, it was contradicted by the percentile intervals. All of the percentile intervals contained zero, so none of the proteins were significant according to this rule. Figure 2 shows how the percentile intervals were distributed for each protein.

Percentile intervals

Percentile intervals

Something I expected to happen was that our smoothed confidence intervals would actually be a bit larger than the normal confidence interval. This discrepancy comes from the amount of bootstrap samples we would need to achieve the shortened interval. Efron notes that ideally we would have all \(n^n\) bootstrap samples to calculate the interval, but we had neither time nor resources to do this. Figure 3 shows the smoothed confidence intervals (blue) against their normal counter parts (green).

Standard vs smoothed

Standard vs smoothed

Conclusion

Bootstrapping is extremely useful in situations where the distribution of the test statistics in question are difficult to derive or lacking closed forms. This report utilizes a helpful methodology for creating smoothed estimates and confidence intervals from Efron’s paper on model selection. We demonstrate that bootstrapping models can produce effective predictions and point out significant associations in the data.

References

[1] B. Efron. Model selection, estimation, and bootstrap smoothing Technical Report Stanford University, (262), 2012.

Copyright © 2019 Christian B. Pascual. All rights reserved. This site is hosted by Github Pages and is built on R Markdown.