Alumni Donation Rate - A Case Study in Linear Regression

Introduction
Required Packages
Variable Descriptions
Univariate Analyses
Modelling
Summary

Introduction

Alumni donations are an important source of revenue for colleges and universities. Studying the effects of different variables on the donation could help predict a university’s revenue for a particular year. Also, understanding the factors that influence increases in the percentage of alumni who make a donation could help administrators implement policies that could lead to increased revenues.

For example, research shows that students who are more satisfied with their contact with teachers are more likely to graduate and give back to their alma mater. Another factor could be whether or not the university is public. The national ranking of the university could also influence the alumni to donate. Ivy league colleges have always had a high alumni donation rate and the state in which a school is located could affect the revenue as well.

This report quantifies the effect of the above-mentioned variables on the alumni donation rate. After exploratory analyses, a linear regression model is built and the flaws in the model are remedied with model diagnostic measures.

Required Packages

library(knitr)
library(tidyverse)
library(ggpubr)
library(broom)
library(DT)
library(car)

opts_chunk$set(message = FALSE, warning = FALSE)

The data that is being analyzed in this report contains donation records of 48 national universities. It comes from “America’s Best Colleges, Year 2000 Edition”. The data is appended with the state the school is in and it’s USNews ranking.

url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
alumni$private <- as.factor(alumni$private)
alumni$state <- c("MA", "MA", "RI", "CA", "PA", "OH", "VA", "NY", "NY", "NH",
                  "NC", "GA", "DC", "MA", "MD", "PA", "MA", "NY", "IL", "PA",
                  "NJ", "TX", "CA", "MA", "LA", "CA", "CA", "CA", "CA", "CA",
                  "CA", "IL", "FL", "IL", "MI", "NC", "IN", "PA", "MN", "CA",
                  "TX", "VA", "WA", "WI", "TN", "NC", "MO", "CT")
alumni$state <- as.factor(alumni$state)
alumni$ranking <- c(38, 35, 14, 12, 25, 42, 38, 3, 16, 12,
                    8, 21, 22, 2, 10, 53, 3, 30, 10, 59,
                    1, 16, 27, 44, 22, 42, 38, 33, 19, 41,
                    30, 3, 35, 46, 27, 30, 18, 8, 33, 22,
                    49, 25, 59, 49, 14, 27, 19, 3)

Variable Descriptions

varDesc = c("Name of the school",
            "The percentage of classes with fewer than 20 students",
            "Ratio of students and faculty",
            "Percentage of alumni that made a donation to the university",
            "Whether or not the school is private",
            "State the school is located in",
            "US World News ranking for 2018")

colnames(alumni) %>% 
  as.data.frame() %>% select( "Variables" = 1) %>% 
  bind_cols(Description = varDesc) %>% kable()

Variables	Description
school	Name of the school
percent_of_classes_under_20	The percentage of classes with fewer than 20 students
student_faculty_ratio	Ratio of students and faculty
alumni_giving_rate	Percentage of alumni that made a donation to the university
private	Whether or not the school is private
state	State the school is located in
ranking	US World News ranking for 2018

Univariate Analyses

The response variable of the model is alumni_giving_rate. Below is a summary of all the predictor variables.

percent_of_classes_under_20

The number of classes with less than 20 students varies from 29 to 77, averaging at 55.73 across all the schools.

plot1 <- ggplot(alumni, aes(percent_of_classes_under_20)) + geom_histogram() + 
  ggtitle("Histogram of the variable") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

plot2 <- ggplot(alumni, aes(y = percent_of_classes_under_20)) + geom_boxplot() + 
  ggtitle("Boxplot to look for outliers") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

plot3 <- ggplot(alumni, aes(percent_of_classes_under_20, alumni_giving_rate)) + 
  geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") +
  ggtitle("Scatter plot relating the two variables") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

ggarrange(plot1, plot2, plot3, nrow = 1, ncol = 3)

The univariate regression model for this variables gives the following coefficients. With this low a p-value, we could expect this variable to influence the alumni giving rate.

lm(alumni_giving_rate ~ percent_of_classes_under_20, alumni) %>% 
  summary() %>% .$coefficients

##                               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)                 -7.3860676  6.5654723 -1.124986 2.664307e-01
## percent_of_classes_under_20  0.6577687  0.1147048  5.734448 7.228121e-07

The donation rate increases by 0.66 points for every unit increase in the class percentage.

student_faculty_ratio

The student_faculty ratio varies from 3 students to 1 faculty to 23 students to 1 faculty. The average is 11.54 students.

plot1 <- ggplot(alumni, aes(student_faculty_ratio)) + geom_histogram() + 
  ggtitle("Histogram of student vs faculty ratio") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

plot2 <- ggplot(alumni, aes(y = student_faculty_ratio)) + geom_boxplot() + 
  ggtitle("Boxplot to look for outliers") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

plot3 <- ggplot(alumni, aes(student_faculty_ratio, alumni_giving_rate)) + 
  geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") +
  ggtitle("Scatter plot relating the two variables") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

ggarrange(plot1, plot2, plot3, nrow = 1, ncol = 3)

The results of a univariate model is available below. Again, we can expect this variable to highly influence the predictor.

lm(alumni_giving_rate ~ student_faculty_ratio, alumni) %>% 
  summary() %>% .$coefficients

##                        Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)           53.013827   3.421450 15.494548 7.058813e-20
## student_faculty_ratio -2.057155   0.273716 -7.515653 1.544232e-09

The donation rate decreases by 2.06 points for every unit increase in the faculty ratio.

private

There are 33 private schools and 15 public schools in the dataset.
As seen in the graph below and the coefficients of the regression model, it is evident that whether or not a school is private affects the donation it receives from the alumni.

plot1 <- ggplot(alumni, aes(private)) + geom_bar() + 
  ggtitle("Barchart showing the count of private schools") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))


plot2 <- ggplot(alumni, aes(x = alumni_giving_rate, y = student_faculty_ratio, 
                            group = private, color = private)) + 
  geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") +
  ggtitle("The public and private schools have differing donation rates") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

ggarrange(plot1, plot2, nrow = 1, ncol = 2)

lm(alumni_giving_rate ~ private, alumni) %>% 
  summary() %>% .$coefficients

##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 15.66667   2.540545 6.166655 1.628283e-07
## private1    19.78788   3.064013 6.458158 5.938532e-08

What the summary above means is that if a school is public, it receives an average 15.67 points of donation. But private schools get an additional 19.79 points.

state

The data contains schools in 25 states. Although it is clear in the first graph that the average donation varies greatly across the states, the second graph shows that the number of records of data we have is too small to analyze this difference. Hence it wouldn’t be a good idea to include this variable in the final model.

plot1 <- alumni %>% group_by(state) %>% 
  summarise(avgDonation = mean(alumni_giving_rate)) %>% 
  ggplot(aes(state, avgDonation)) + geom_point() + 
  ggtitle("Distribution of averate donation across states") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

plot2 <- ggplot(alumni, aes(x = state, y = alumni_giving_rate, color = state)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") +
  ggtitle("Donation rates grouped by state") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8),
          legend.position = "none")

ggarrange(plot1, plot2, nrow = 2, ncol = 1)

ranking

There is clearly a linear relationship between the rankings and the alumni donations. The higher ranked schools are more likely to get donations from the alumni.
This is further evident from the univariate model coefficients below.

plot1 <- alumni %>% arrange(by = ranking) %>% 
  ggplot(aes(x = 1:nrow(alumni), y = ranking)) + geom_point() + 
  ggtitle("The rankings of the 48 schools sorted in ascending order") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

plot2 <- ggplot(alumni, aes(ranking, alumni_giving_rate)) + 
  geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") +
  ggtitle("Scatter plot relating the two variables") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

ggarrange(plot1, plot2, nrow = 2, ncol = 1)

lm(alumni_giving_rate ~ ranking, alumni) %>% 
  summary() %>% .$coefficients

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 43.1998639 2.90195144 14.886488 3.298133e-19
## ranking     -0.5422494 0.09647412 -5.620672 1.068157e-06

For every unit decrease in rank, the alumni donation increases on an average by 0.54

Modelling

Basic Model

Modelling with all the variables that were found to be influential on the response donation rate.

model1 <- lm(alumni_giving_rate ~ student_faculty_ratio + 
               percent_of_classes_under_20 + private + ranking, 
              data = alumni)  
model1 %>% summary()

## 
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio + percent_of_classes_under_20 + 
##     private + ranking, data = alumni)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7557  -5.5338  -0.5836   4.7865  22.3672 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  51.8087    14.5213   3.568   0.0009 ***
## student_faculty_ratio        -1.2424     0.4911  -2.530   0.0152 *  
## percent_of_classes_under_20  -0.1100     0.1881  -0.585   0.5618    
## private1                      6.7788     5.1073   1.327   0.1914    
## ranking                      -0.2619     0.1119  -2.341   0.0240 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.632 on 43 degrees of freedom
## Multiple R-squared:  0.6227, Adjusted R-squared:  0.5876 
## F-statistic: 17.74 on 4 and 43 DF,  p-value: 1.138e-08

From the p-values of the coefficients, we observe that percent_of_classes_under_20 and private variables do not have a significant influence on the model. We observed the contrary during the univariate analyses. This can probably be explained by multicollinearity between the two variables. The differnece between the \(R^2\) and \(adjusted \: R^2\) also indicate the same reason. The \(MSE\) is observed to be 69.65.

vif(model1)

##       student_faculty_ratio percent_of_classes_under_20 
##                    3.580330                    3.886828 
##                     private                     ranking 
##                    3.610477                    1.976081

But, on checking the Variance Inflation Factor, we do not observe any values above 10, disproving the multicollinearity theory. Hence, the two variables are ignored from the model for now and will be explored later.

Variable Selection

With the two variables removed, the summary of the new model is

model2 <- lm(alumni_giving_rate ~ student_faculty_ratio + ranking,
             data = alumni)  
summary(model2)

## 
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio + ranking, 
##     data = alumni)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9086  -6.1546   0.2597   4.4597  21.2957 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           53.74558    3.24880  16.543  < 2e-16 ***
## student_faculty_ratio -1.55767    0.32534  -4.788 1.86e-05 ***
## ranking               -0.25291    0.09978  -2.535   0.0148 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.609 on 45 degrees of freedom
## Multiple R-squared:  0.6072, Adjusted R-squared:  0.5898 
## F-statistic: 34.79 on 2 and 45 DF,  p-value: 7.38e-10

Only significant variables remain in the model but we do not see improvement in the \(adjusted \: R^2\) at 0.59 and \(MSE\) at 72.5.

Residual Diagnostics

Checking the fit of the model with the following residual analysis plots.

model2Augment <- augment(model2) %>% mutate(row_num = 1:n())

# Plot of residuals against fitted values (non-constant variance and non-linearity)
p1 <- ggplot(model2Augment, aes(x = .fitted, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  xlab("Fitted value") +
  ylab("Standardized residuals") + 
  ggtitle("Non-constant variance & non-linearity test\nFitted values - equally spread around 0") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Plot of residuals against predictor variable (checking non-linearity).
p2 <- ggplot(model2Augment, aes(x = ranking, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX follows a slight exponential curve") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p3 <- ggplot(model2Augment, aes(x = student_faculty_ratio, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX follows a curve") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Boxplot of residuals (outlying observations)
p4 <- ggplot(model2Augment, aes(y = .std.resid)) +
  geom_boxplot() +
  ggtitle("Outlying observations test\nOne outlier") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Normal probability Q-Q plot of residuals (non-normality).
p5 <- ggplot(model2Augment, aes(sample = .std.resid)) +
  geom_qq(alpha = 0.3) +
  geom_qq_line(linetype = "dashed", color = "red2") +
  xlab("Theoretical quantile") +
  ylab("Sample quantile") +
  ggtitle("Non-normality test\nThe residuals almost follow a normal distribution") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Plot of residuals against time or another sequence (non-independence)
p6 <- ggplot(model2Augment, aes(x = row_num, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  xlab("Index") +
  ylab("Standardized residual") +
  ggtitle("Non-independance test\nNo residual pattern across row numbers") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

ggarrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)

The exponential curve on the ranking variable can be improved by applying a log transformation.

Log Transformation

The new model after the transformation.

# Transforming ranking
alumni$ranking2 <- log(alumni$ranking)

model4 <- lm(alumni_giving_rate ~ student_faculty_ratio + ranking2,
             data = alumni)  

summary(model4)

## 
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio + ranking2, 
##     data = alumni)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3976  -5.7894   0.6323   4.2420  21.5566 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            61.1314     3.9543  15.459  < 2e-16 ***
## student_faculty_ratio  -1.4228     0.3136  -4.538 4.21e-05 ***
## ranking2               -5.2561     1.5878  -3.310  0.00184 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.253 on 45 degrees of freedom
## Multiple R-squared:  0.639,  Adjusted R-squared:  0.623 
## F-statistic: 39.83 on 2 and 45 DF,  p-value: 1.103e-10

It’s observed that the \(adjusted \: R^2\) increases to 0.62 and \(MSE\) reduces to 66.63.

Residual Diagnostics

model4Augment <- augment(model4) %>% mutate(row_num = 1:n())

# Plot of residuals against fitted values (non-constant variance and non-linearity)
p1 <- ggplot(model4Augment, aes(x = .fitted, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  xlab("Fitted value") +
  ylab("Standardized residuals") + 
  ggtitle("Non-constant variance & non-linearity test\nFitted values - equally spread around 0") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Plot of residuals against predictor variable (checking non-linearity).
p2 <- ggplot(model4Augment, aes(x = ranking2, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX not exponential anymore") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p3 <- ggplot(model4Augment, aes(x = student_faculty_ratio, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX follows a curve") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Boxplot of residuals (outlying observations)
p4 <- ggplot(model4Augment, aes(y = .std.resid)) +
  geom_boxplot() +
  ggtitle("Outlying observations test\nOne outlier") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Normal probability Q-Q plot of residuals (non-normality).
p5 <- ggplot(model4Augment, aes(sample = .std.resid)) +
  geom_qq(alpha = 0.3) +
  geom_qq_line(linetype = "dashed", color = "red2") +
  xlab("Theoretical quantile") +
  ylab("Sample quantile") +
  ggtitle("Non-normality test\nThe residuals almost follow a normal distribution") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Plot of residuals against time or another sequence (non-independence)
p6 <- ggplot(model4Augment, aes(x = row_num, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  xlab("Index") +
  ylab("Standardized residual") +
  ggtitle("Non-independance test\nNo residual pattern across row numbers") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

ggarrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)

Box-Cox Transformation

The model can be further improved by applying a Box-Cox transformation on the response vairable. The new model is presented below.

# Calculating the lambda for the transformation
bc <- MASS::boxcox(model2, plotit = FALSE)
lambda <- bc$x[which.max(bc$y)]

# Applying the transformation
alumni$alumni_giving_rate2 <- (alumni$alumni_giving_rate ^ lambda - 1) / lambda

model3 <- lm(alumni_giving_rate2 ~ student_faculty_ratio + ranking2,
             data = alumni)  

summary(model3)

## 
## Call:
## lm(formula = alumni_giving_rate2 ~ student_faculty_ratio + ranking2, 
##     data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5503 -1.1945  0.1262  0.8061  3.9072 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           14.44137    0.75593  19.104  < 2e-16 ***
## student_faculty_ratio -0.31443    0.05994  -5.246 4.04e-06 ***
## ranking2              -0.78227    0.30354  -2.577   0.0133 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.578 on 45 degrees of freedom
## Multiple R-squared:  0.6426, Adjusted R-squared:  0.6267 
## F-statistic: 40.45 on 2 and 45 DF,  p-value: 8.83e-11

The \(adjusted \: R^2\) increases further to 0.63 and \(MSE\) reduces significantly to 2.43.

Residual Diagnostics

Applying the same residual diagnostics on the new model.

model3Augment <- augment(model3) %>% mutate(row_num = 1:n())

# Plot of residuals against fitted values (non-constant variance and non-linearity)
p1 <- ggplot(model3Augment, aes(x = .fitted, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  xlab("Fitted value") +
  ylab("Standardized residuals") + 
  ggtitle("Non-constant variance & non-linearity test\nFitted values - equally spread around 0") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Plot of residuals against predictor variable (checking non-linearity).
p2 <- ggplot(model3Augment, aes(x = ranking2, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX follows a slight exponential curve, but improved") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p3 <- ggplot(model3Augment, aes(x = student_faculty_ratio, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX still follows a curve, but improved") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Boxplot of residuals (outlying observations)
p4 <- ggplot(model3Augment, aes(y = .std.resid)) +
  geom_boxplot() +
  ggtitle("Outlying observations test\nOne outlier") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Normal probability Q-Q plot of residuals (non-normality).
p5 <- ggplot(model3Augment, aes(sample = .std.resid)) +
  geom_qq(alpha = 0.3) +
  geom_qq_line(linetype = "dashed", color = "red2") +
  xlab("Theoretical quantile") +
  ylab("Sample quantile") +
  ggtitle("Non-normality test\nThe residuals follow normal distribution much better") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

# Plot of residuals against time or another sequence (non-independence)
p6 <- ggplot(model3Augment, aes(x = row_num, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  xlab("Index") +
  ylab("Standardized residual") +
  ggtitle("Non-independance test\nNo residual pattern across row numbers") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

ggarrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)

Variable Selection 2

Insert variable `private`

Going back to the variabels that were ignored, private is fit into the final model and found to improve.

model5 <- lm(alumni_giving_rate2 ~ student_faculty_ratio + ranking2 + private,
             data = alumni)  
summary(model5)

## 
## Call:
## lm(formula = alumni_giving_rate2 ~ student_faculty_ratio + ranking2 + 
##     private, data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2235 -1.2151  0.0255  0.8514  3.1533 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           11.68523    1.47088   7.944 4.85e-10 ***
## student_faculty_ratio -0.17379    0.08707  -1.996  0.05214 .  
## ranking2              -0.80677    0.29216  -2.761  0.00836 ** 
## private1               1.75265    0.81309   2.156  0.03663 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.517 on 44 degrees of freedom
## Multiple R-squared:  0.6767, Adjusted R-squared:  0.6547 
## F-statistic:  30.7 on 3 and 44 DF,  p-value: 7.259e-11

The \(adjusted \: R^2\) increases again to 0.65 and \(MSE\) reduces further to 2.2.

But the significance of the student_faculty_ratio variable dips. The increase in \(adjusted \: R^2\) and \(MSE\) is not high enough to justify accepting the new model.

Insert variable `percent_of_classes_under_20`

This variable however does the opposite and decreases \(adjusted \: R^2\) and \(MSE\).

model6 <- lm(alumni_giving_rate2 ~ student_faculty_ratio + ranking2 +
               percent_of_classes_under_20,
             data = alumni)  
summary(model6)

## 
## Call:
## lm(formula = alumni_giving_rate2 ~ student_faculty_ratio + ranking2 + 
##     percent_of_classes_under_20, data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5751 -1.1999  0.1094  0.8352  3.9101 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 13.739852   2.666998   5.152 5.83e-06 ***
## student_faculty_ratio       -0.300137   0.079863  -3.758 0.000501 ***
## ranking2                    -0.755347   0.321999  -2.346 0.023557 *  
## percent_of_classes_under_20  0.008209   0.029902   0.275 0.784957    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.594 on 44 degrees of freedom
## Multiple R-squared:  0.6432, Adjusted R-squared:  0.6189 
## F-statistic: 26.44 on 3 and 44 DF,  p-value: 6.213e-10

Summary

The final model shows that the alumni_giving_rate depends on the ranking and the student_faculty_ratio variables.

\[alumniGivingRate^.505 - 1.98 = 14.636 - (0.319*studentFacultyRatio) - (0.798*log(ranking))\]

The \(adjusted \: R^2\) explains 62.67 of the alumni_giving_rate with a low mean-squared error of 2.43.
The negative relationship of the two predictor variables can be explained by the negative linear relationship that was observed earlier.

Alumni Donation Rate - A Case Study in Linear Regression

Alumni Donation Rate - A Case Study in Linear Regression

Introduction

Required Packages

Variable Descriptions

Univariate Analyses

percent_of_classes_under_20

student_faculty_ratio

private

state

ranking

Modelling

Basic Model

Variable Selection

Residual Diagnostics

Log Transformation

Residual Diagnostics

Box-Cox Transformation

Residual Diagnostics

Variable Selection 2

Insert variable private

Insert variable percent_of_classes_under_20

Summary

Insert variable `private`

Insert variable `percent_of_classes_under_20`