Alumni Donation Rate - A Case Study in Linear Regression

Introduction

Alumni donations are an important source of revenue for colleges and universities. Studying the effects of different variables on the donation could help predict a university’s revenue for a particular year. Also, understanding the factors that influence increases in the percentage of alumni who make a donation could help administrators implement policies that could lead to increased revenues.

For example, research shows that students who are more satisfied with their contact with teachers are more likely to graduate and give back to their alma mater. Another factor could be whether or not the university is public. The national ranking of the university could also influence the alumni to donate. Ivy league colleges have always had a high alumni donation rate and the state in which a school is located could affect the revenue as well.

This report quantifies the effect of the above-mentioned variables on the alumni donation rate. After exploratory analyses, a linear regression model is built and the flaws in the model are remedied with model diagnostic measures.

Variable Descriptions

Variables Description
school Name of the school
percent_of_classes_under_20 The percentage of classes with fewer than 20 students
student_faculty_ratio Ratio of students and faculty
alumni_giving_rate Percentage of alumni that made a donation to the university
private Whether or not the school is private
state State the school is located in
ranking US World News ranking for 2018

Univariate Analyses

The response variable of the model is alumni_giving_rate. Below is a summary of all the predictor variables.

percent_of_classes_under_20

The number of classes with less than 20 students varies from 29 to 77, averaging at 55.73 across all the schools.

The univariate regression model for this variables gives the following coefficients. With this low a p-value, we could expect this variable to influence the alumni giving rate.

##                               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)                 -7.3860676  6.5654723 -1.124986 2.664307e-01
## percent_of_classes_under_20  0.6577687  0.1147048  5.734448 7.228121e-07

The donation rate increases by 0.66 points for every unit increase in the class percentage.

private

There are 33 private schools and 15 public schools in the dataset.
As seen in the graph below and the coefficients of the regression model, it is evident that whether or not a school is private affects the donation it receives from the alumni.

##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 15.66667   2.540545 6.166655 1.628283e-07
## private1    19.78788   3.064013 6.458158 5.938532e-08

What the summary above means is that if a school is public, it receives an average 15.67 points of donation. But private schools get an additional 19.79 points.

ranking

There is clearly a linear relationship between the rankings and the alumni donations. The higher ranked schools are more likely to get donations from the alumni.
This is further evident from the univariate model coefficients below.

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 43.1998639 2.90195144 14.886488 3.298133e-19
## ranking     -0.5422494 0.09647412 -5.620672 1.068157e-06

For every unit decrease in rank, the alumni donation increases on an average by 0.54

Modelling

Basic Model

Modelling with all the variables that were found to be influential on the response donation rate.

## 
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio + percent_of_classes_under_20 + 
##     private + ranking, data = alumni)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7557  -5.5338  -0.5836   4.7865  22.3672 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  51.8087    14.5213   3.568   0.0009 ***
## student_faculty_ratio        -1.2424     0.4911  -2.530   0.0152 *  
## percent_of_classes_under_20  -0.1100     0.1881  -0.585   0.5618    
## private1                      6.7788     5.1073   1.327   0.1914    
## ranking                      -0.2619     0.1119  -2.341   0.0240 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.632 on 43 degrees of freedom
## Multiple R-squared:  0.6227, Adjusted R-squared:  0.5876 
## F-statistic: 17.74 on 4 and 43 DF,  p-value: 1.138e-08

From the p-values of the coefficients, we observe that percent_of_classes_under_20 and private variables do not have a significant influence on the model. We observed the contrary during the univariate analyses. This can probably be explained by multicollinearity between the two variables. The differnece between the \(R^2\) and \(adjusted \: R^2\) also indicate the same reason. The \(MSE\) is observed to be 69.65.

##       student_faculty_ratio percent_of_classes_under_20 
##                    3.580330                    3.886828 
##                     private                     ranking 
##                    3.610477                    1.976081

But, on checking the Variance Inflation Factor, we do not observe any values above 10, disproving the multicollinearity theory. Hence, the two variables are ignored from the model for now and will be explored later.

Variable Selection

With the two variables removed, the summary of the new model is

## 
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio + ranking, 
##     data = alumni)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9086  -6.1546   0.2597   4.4597  21.2957 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           53.74558    3.24880  16.543  < 2e-16 ***
## student_faculty_ratio -1.55767    0.32534  -4.788 1.86e-05 ***
## ranking               -0.25291    0.09978  -2.535   0.0148 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.609 on 45 degrees of freedom
## Multiple R-squared:  0.6072, Adjusted R-squared:  0.5898 
## F-statistic: 34.79 on 2 and 45 DF,  p-value: 7.38e-10

Only significant variables remain in the model but we do not see improvement in the \(adjusted \: R^2\) at 0.59 and \(MSE\) at 72.5.

Residual Diagnostics

Checking the fit of the model with the following residual analysis plots.

model2Augment <- augment(model2) %>% mutate(row_num = 1:n())

# Plot of residuals against fitted values (non-constant variance and non-linearity)
p1 <- ggplot(model2Augment, aes(x = .fitted, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  xlab("Fitted value") +
  ylab("Standardized residuals") + 
  ggtitle("Non-constant variance & non-linearity test\nFitted values - equally spread around 0") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Plot of residuals against predictor variable (checking non-linearity).
p2 <- ggplot(model2Augment, aes(x = ranking, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX follows a slight exponential curve") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p3 <- ggplot(model2Augment, aes(x = student_faculty_ratio, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX follows a curve") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Boxplot of residuals (outlying observations)
p4 <- ggplot(model2Augment, aes(y = .std.resid)) +
  geom_boxplot() +
  ggtitle("Outlying observations test\nOne outlier") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Normal probability Q-Q plot of residuals (non-normality).
p5 <- ggplot(model2Augment, aes(sample = .std.resid)) +
  geom_qq(alpha = 0.3) +
  geom_qq_line(linetype = "dashed", color = "red2") +
  xlab("Theoretical quantile") +
  ylab("Sample quantile") +
  ggtitle("Non-normality test\nThe residuals almost follow a normal distribution") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Plot of residuals against time or another sequence (non-independence)
p6 <- ggplot(model2Augment, aes(x = row_num, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  xlab("Index") +
  ylab("Standardized residual") +
  ggtitle("Non-independance test\nNo residual pattern across row numbers") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

ggarrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)

The exponential curve on the ranking variable can be improved by applying a log transformation.

Log Transformation

The new model after the transformation.

## 
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio + ranking2, 
##     data = alumni)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3976  -5.7894   0.6323   4.2420  21.5566 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            61.1314     3.9543  15.459  < 2e-16 ***
## student_faculty_ratio  -1.4228     0.3136  -4.538 4.21e-05 ***
## ranking2               -5.2561     1.5878  -3.310  0.00184 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.253 on 45 degrees of freedom
## Multiple R-squared:  0.639,  Adjusted R-squared:  0.623 
## F-statistic: 39.83 on 2 and 45 DF,  p-value: 1.103e-10

It’s observed that the \(adjusted \: R^2\) increases to 0.62 and \(MSE\) reduces to 66.63.

Residual Diagnostics

model4Augment <- augment(model4) %>% mutate(row_num = 1:n())

# Plot of residuals against fitted values (non-constant variance and non-linearity)
p1 <- ggplot(model4Augment, aes(x = .fitted, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  xlab("Fitted value") +
  ylab("Standardized residuals") + 
  ggtitle("Non-constant variance & non-linearity test\nFitted values - equally spread around 0") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Plot of residuals against predictor variable (checking non-linearity).
p2 <- ggplot(model4Augment, aes(x = ranking2, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX not exponential anymore") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p3 <- ggplot(model4Augment, aes(x = student_faculty_ratio, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX follows a curve") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Boxplot of residuals (outlying observations)
p4 <- ggplot(model4Augment, aes(y = .std.resid)) +
  geom_boxplot() +
  ggtitle("Outlying observations test\nOne outlier") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Normal probability Q-Q plot of residuals (non-normality).
p5 <- ggplot(model4Augment, aes(sample = .std.resid)) +
  geom_qq(alpha = 0.3) +
  geom_qq_line(linetype = "dashed", color = "red2") +
  xlab("Theoretical quantile") +
  ylab("Sample quantile") +
  ggtitle("Non-normality test\nThe residuals almost follow a normal distribution") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Plot of residuals against time or another sequence (non-independence)
p6 <- ggplot(model4Augment, aes(x = row_num, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  xlab("Index") +
  ylab("Standardized residual") +
  ggtitle("Non-independance test\nNo residual pattern across row numbers") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

ggarrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)

Box-Cox Transformation

The model can be further improved by applying a Box-Cox transformation on the response vairable. The new model is presented below.

## 
## Call:
## lm(formula = alumni_giving_rate2 ~ student_faculty_ratio + ranking2, 
##     data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5503 -1.1945  0.1262  0.8061  3.9072 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           14.44137    0.75593  19.104  < 2e-16 ***
## student_faculty_ratio -0.31443    0.05994  -5.246 4.04e-06 ***
## ranking2              -0.78227    0.30354  -2.577   0.0133 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.578 on 45 degrees of freedom
## Multiple R-squared:  0.6426, Adjusted R-squared:  0.6267 
## F-statistic: 40.45 on 2 and 45 DF,  p-value: 8.83e-11

The \(adjusted \: R^2\) increases further to 0.63 and \(MSE\) reduces significantly to 2.43.

Residual Diagnostics

Applying the same residual diagnostics on the new model.

model3Augment <- augment(model3) %>% mutate(row_num = 1:n())

# Plot of residuals against fitted values (non-constant variance and non-linearity)
p1 <- ggplot(model3Augment, aes(x = .fitted, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  xlab("Fitted value") +
  ylab("Standardized residuals") + 
  ggtitle("Non-constant variance & non-linearity test\nFitted values - equally spread around 0") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Plot of residuals against predictor variable (checking non-linearity).
p2 <- ggplot(model3Augment, aes(x = ranking2, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX follows a slight exponential curve, but improved") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p3 <- ggplot(model3Augment, aes(x = student_faculty_ratio, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX still follows a curve, but improved") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Boxplot of residuals (outlying observations)
p4 <- ggplot(model3Augment, aes(y = .std.resid)) +
  geom_boxplot() +
  ggtitle("Outlying observations test\nOne outlier") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Normal probability Q-Q plot of residuals (non-normality).
p5 <- ggplot(model3Augment, aes(sample = .std.resid)) +
  geom_qq(alpha = 0.3) +
  geom_qq_line(linetype = "dashed", color = "red2") +
  xlab("Theoretical quantile") +
  ylab("Sample quantile") +
  ggtitle("Non-normality test\nThe residuals follow normal distribution much better") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Plot of residuals against time or another sequence (non-independence)
p6 <- ggplot(model3Augment, aes(x = row_num, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  xlab("Index") +
  ylab("Standardized residual") +
  ggtitle("Non-independance test\nNo residual pattern across row numbers") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

ggarrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)

Variable Selection 2

Insert variable private

Going back to the variabels that were ignored, private is fit into the final model and found to improve.

## 
## Call:
## lm(formula = alumni_giving_rate2 ~ student_faculty_ratio + ranking2 + 
##     private, data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2235 -1.2151  0.0255  0.8514  3.1533 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           11.68523    1.47088   7.944 4.85e-10 ***
## student_faculty_ratio -0.17379    0.08707  -1.996  0.05214 .  
## ranking2              -0.80677    0.29216  -2.761  0.00836 ** 
## private1               1.75265    0.81309   2.156  0.03663 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.517 on 44 degrees of freedom
## Multiple R-squared:  0.6767, Adjusted R-squared:  0.6547 
## F-statistic:  30.7 on 3 and 44 DF,  p-value: 7.259e-11

The \(adjusted \: R^2\) increases again to 0.65 and \(MSE\) reduces further to 2.2.

But the significance of the student_faculty_ratio variable dips. The increase in \(adjusted \: R^2\) and \(MSE\) is not high enough to justify accepting the new model.

Insert variable percent_of_classes_under_20

This variable however does the opposite and decreases \(adjusted \: R^2\) and \(MSE\).

## 
## Call:
## lm(formula = alumni_giving_rate2 ~ student_faculty_ratio + ranking2 + 
##     percent_of_classes_under_20, data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5751 -1.1999  0.1094  0.8352  3.9101 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 13.739852   2.666998   5.152 5.83e-06 ***
## student_faculty_ratio       -0.300137   0.079863  -3.758 0.000501 ***
## ranking2                    -0.755347   0.321999  -2.346 0.023557 *  
## percent_of_classes_under_20  0.008209   0.029902   0.275 0.784957    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.594 on 44 degrees of freedom
## Multiple R-squared:  0.6432, Adjusted R-squared:  0.6189 
## F-statistic: 26.44 on 3 and 44 DF,  p-value: 6.213e-10

Summary

The final model shows that the alumni_giving_rate depends on the ranking and the student_faculty_ratio variables.

\[alumniGivingRate^.505 - 1.98 = 14.636 - (0.319*studentFacultyRatio) - (0.798*log(ranking))\]

The \(adjusted \: R^2\) explains 62.67 of the alumni_giving_rate with a low mean-squared error of 2.43.
The negative relationship of the two predictor variables can be explained by the negative linear relationship that was observed earlier.

Harish M