Can we figure out which common factors contribute the most to admission to Master’s programs?

Exploratory Data Analysis, Visualization, Linear Models

Applying to graduate programs is a harrowing process for any student. Between the statement of purpose, letters of recommendation, GRE and GPA, there’s a lot of factors that can influence the acceptance or rejection of a hopeful student. Given data on the application components, predicting a student’s chance of admission poses an interesting regression problem. From a student perspective, the ability to predict chances of admission would allow them to ground their expectations and plan for the future.

The *Graduate Admissions* dataset from Kaggle contains data on about 500 Indian students applying for Master’s programs in the United States. The response variable we hope to predict is `Chance of Admit`

. The potential predictors are various Master’s application components converted into continuous or categorical form, including: GRE score, TOEFL score, university rating, statement of purpose strength, letter of recommendations strength, cumulative GPA and the presence of research experience. This project was a group effort, and I was in charge of the regularized models.

In its raw form, the dataset only needs minimal formatting before we can start modeling. GRE, CGPA and TOEFL are already continuous and don’t need coercing, but they will be centered and scaled. For university rating, statement of purpose, and letter of recommendation strength, these variables are seemingly categorical, but they will be treated as continuous for easier interpretation and to reduce the number of dummy predictors. The presence of research was recoded as a proper binary variable.

```
library(tidyverse)
library(glmnet) # LASSO
library(mgcv) # GAMs
library(modelr) # cross-validation help
library(earth) # MARS
library(caret) # cross-validation help
library(kableExtra)
library(corrplot)
# Loading in the dataset
admit.data = read.csv(file = "./admission_predict.csv") %>%
janitor::clean_names() %>%
dplyr::mutate(
gre.std = (gre_score - mean(gre_score)) / sd(gre_score),
toefl.std = (toefl_score - mean(toefl_score)) / sd(toefl_score),
cgpa.std = (cgpa - mean(cgpa)) / sd(cgpa),
uni.rating = university_rating,
sop.strength = sop,
lor.strength = lor
) %>%
dplyr::select(gre.std:lor.strength, research, chance_of_admit)
```

This project seeks to create these different models, compare them, and hopefully recommend one for use by future college hopefuls.

Before starting the modeling, we explored our dataset to see how each of the predictors related to the response and to each other. Our findings here motivate the inclusion of some of the models in our report.

The correlation matrix above shows that all predictors have moderately high correlation between each other (\(\rho\) > 0.5). The highest correlation occurs between `gre.std`

, `toefl.std`

, and `cgpa.std`

. We expected this result since applicants with high GPAs are also more likely to score well on other tests. Including all three `gre.std`

, `toefl.std`

, and `cgpa.std`

predictors may result in multicollinearity and inflated coefficient estimates. In response to this, we decided to explore how a LASSO and ridge regression could curb the effect of this correlation and improve predictions.

The scatterplots above indicate the existence of a positive relationship between all predictors and `Chance of Admit`

. We also saw that the the range of our desired response value only goes from about 40% to 95% in the dataset. Statement of purpose strength, letter of recommendation strength, and university rating all have approximately linear relationships with the chance of admission. With GPA, GRE, and TOEFL scores, we noticed a slight plateauing of admission chances at the higher end of the scores. With this non-linearity in the data, we hope that the use of a GAM and MARS model will capture this subtlety and improve prediction.

We will attempt to predict `Chance of Admit`

using a total of 5 models: 3 linear models, ordinary linear regression, ridge regression and LASSO, and 2 non-linear models, a generalized additive and MARS model. We will be using the implementations in `lm`

, `glmnet`

, `gam`

and `earth`

to do the modeling.

Our dataset only contains 7 predictors, so we will incorporate all of them into each of our 4 models. Each of these factors are requested in most Master’s program applications, so we will assume all will have an impact on predicting the chance of admission.

We plan to use ordinary linear regression as a *baseline* model to compare the other 4 models with. Although we suspect that all the variables will have an appreciable impact on the chance of admission, we still want to allow for the possibility that some predictors do not truly contribute. Furthermore, we also found high correlation between many of our predictors in our exploratory data analysis, so we also want to adjust for potential inflation of our regression coefficients. Therefore, we plan to use the ridge and LASSO models to shrink the coefficients and perform variable selection to adjust for these findings and hopefully improve predictions.

In our exploratory data analysis, we also found that some of our predictors (CGPA and TOEFL) had a slight nonlinear relationship with the chance of admission. We hope to use a GAM model and a MARS model to better capture these nonlinear relationships and produce improved predictions as a result.

We used `cv.glmnet`

to find the optimal \(\lambda\) for both the ridge and LASSO models via 5-fold cross-validation. For the GAM, we found the optimal smoothing parameters via generalized cross-validation. In order to tune our MARS model, we created a tuning grid spanning from 1 to 3 degrees and from 1 to 50 retained terms. Then, we used `caret`

to choose the best set from this grid.

For each of the models, we used 10-fold cross-validation to evaluate the average test fold MSE as a measurement of predictive ability. Table 1 shows a comparison of the average test fold MSE for each of the 5 models.

In our basic linear and LASSO model, CGPA, GRE score and research exert the strongest influences over a student’s chance of admission. The ridge model resulted in coefficients that suggest that all variables affect the chance of admission on a similar scale; each of the variables increases the predicted chance of admission by about 1%. Conversely, the strength of a student’s statement of purpose and the rating of the university have the least influence on the chance of admisison. Table 2 summarizes the coefficients with the linear models.

The ridge model, GAM and MARS model were chosen to try to maximize the predictive ability, but each of these models are limited in interpretability. Attempting to explain what each coefficient means would be difficult, compared to the ordinary linear regression. Our data contained predictors that had approximately linear relationships with chance of admission, so we believe that our models were flexible enough to capture the complexity in the data.

We believe that all of our models are also limited by the range of responses in our dataset. The range for chance of admission only range from 40% to 95%, so we cannot guarantee that our model would be predictive of out-of-range probabilities.

The LASSO model performed the best out of the 5 models, although all of them are comparable since they have similar average test fold MSEs. The difference between the best and worst models is on the \(1^{-5}\) scale, which would have a negligible difference in terms of predicting a student’s chance of admission. We would recommend using the LASSO model for its simplicity, interpretability, and prediction accuracy.

We predicted that the GAM would perform the best since it would better capture the plateauing seen in GRE and CGPA. These nonlinearities were still well approximated by the linear models. Our model suggests that hopeful Masters applicants should focus on maximizing their GRE, GPA and research experience on their applications, rather than putting that effort into their statement of purposes. Our findings confirm conventional wisdom that GPA and GRE scores are the most efficient mechanism for universities to single out capable applicants from large student pools.

In this report, we’ve developed predictive models to predict a student’s chance of admission to a Master’s program. We hope that our model will give graduate school applicants a better idea of what to expect in terms of their chances of admission given their specific metrics.