Alumni Donation Rate - A Case Study in Linear Regression
Introduction
Alumni donations are an important source of revenue for colleges and universities. Studying the effects of different variables on the donation could help predict a university’s revenue for a particular year. Also, understanding the factors that influence increases in the percentage of alumni who make a donation could help administrators implement policies that could lead to increased revenues.
For example, research shows that students who are more satisfied with their contact with teachers are more likely to graduate and give back to their alma mater. Another factor could be whether or not the university is public. The national ranking of the university could also influence the alumni to donate. Ivy league colleges have always had a high alumni donation rate and the state in which a school is located could affect the revenue as well.
This report quantifies the effect of the above-mentioned variables on the alumni donation rate. After exploratory analyses, a linear regression model is built and the flaws in the model are remedied with model diagnostic measures.
Required Packages
library(knitr)
library(tidyverse)
library(ggpubr)
library(broom)
library(DT)
library(car)
opts_chunk$set(message = FALSE, warning = FALSE)
The data that is being analyzed in this report contains donation records of 48 national universities. It comes from “America’s Best Colleges, Year 2000 Edition”. The data is appended with the state the school is in and it’s USNews ranking.
url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
alumni$private <- as.factor(alumni$private)
alumni$state <- c("MA", "MA", "RI", "CA", "PA", "OH", "VA", "NY", "NY", "NH",
"NC", "GA", "DC", "MA", "MD", "PA", "MA", "NY", "IL", "PA",
"NJ", "TX", "CA", "MA", "LA", "CA", "CA", "CA", "CA", "CA",
"CA", "IL", "FL", "IL", "MI", "NC", "IN", "PA", "MN", "CA",
"TX", "VA", "WA", "WI", "TN", "NC", "MO", "CT")
alumni$state <- as.factor(alumni$state)
alumni$ranking <- c(38, 35, 14, 12, 25, 42, 38, 3, 16, 12,
8, 21, 22, 2, 10, 53, 3, 30, 10, 59,
1, 16, 27, 44, 22, 42, 38, 33, 19, 41,
30, 3, 35, 46, 27, 30, 18, 8, 33, 22,
49, 25, 59, 49, 14, 27, 19, 3)
Variable Descriptions
varDesc = c("Name of the school",
"The percentage of classes with fewer than 20 students",
"Ratio of students and faculty",
"Percentage of alumni that made a donation to the university",
"Whether or not the school is private",
"State the school is located in",
"US World News ranking for 2018")
colnames(alumni) %>%
as.data.frame() %>% select( "Variables" = 1) %>%
bind_cols(Description = varDesc) %>% kable()
Variables | Description |
---|---|
school | Name of the school |
percent_of_classes_under_20 | The percentage of classes with fewer than 20 students |
student_faculty_ratio | Ratio of students and faculty |
alumni_giving_rate | Percentage of alumni that made a donation to the university |
private | Whether or not the school is private |
state | State the school is located in |
ranking | US World News ranking for 2018 |
Univariate Analyses
The response variable of the model is alumni_giving_rate. Below is a summary of all the predictor variables.
percent_of_classes_under_20
The number of classes with less than 20 students varies from 29 to 77, averaging at 55.73 across all the schools.
plot1 <- ggplot(alumni, aes(percent_of_classes_under_20)) + geom_histogram() +
ggtitle("Histogram of the variable") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
plot2 <- ggplot(alumni, aes(y = percent_of_classes_under_20)) + geom_boxplot() +
ggtitle("Boxplot to look for outliers") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
plot3 <- ggplot(alumni, aes(percent_of_classes_under_20, alumni_giving_rate)) +
geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") +
ggtitle("Scatter plot relating the two variables") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
ggarrange(plot1, plot2, plot3, nrow = 1, ncol = 3)
The univariate regression model for this variables gives the following coefficients. With this low a p-value, we could expect this variable to influence the alumni giving rate.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.3860676 6.5654723 -1.124986 2.664307e-01
## percent_of_classes_under_20 0.6577687 0.1147048 5.734448 7.228121e-07
The donation rate increases by 0.66 points for every unit increase in the class percentage.
student_faculty_ratio
The student_faculty ratio varies from 3 students to 1 faculty to 23 students to 1 faculty. The average is 11.54 students.
plot1 <- ggplot(alumni, aes(student_faculty_ratio)) + geom_histogram() +
ggtitle("Histogram of student vs faculty ratio") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
plot2 <- ggplot(alumni, aes(y = student_faculty_ratio)) + geom_boxplot() +
ggtitle("Boxplot to look for outliers") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
plot3 <- ggplot(alumni, aes(student_faculty_ratio, alumni_giving_rate)) +
geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") +
ggtitle("Scatter plot relating the two variables") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
ggarrange(plot1, plot2, plot3, nrow = 1, ncol = 3)
The results of a univariate model is available below. Again, we can expect this variable to highly influence the predictor.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.013827 3.421450 15.494548 7.058813e-20
## student_faculty_ratio -2.057155 0.273716 -7.515653 1.544232e-09
The donation rate decreases by 2.06 points for every unit increase in the faculty ratio.
private
There are 33 private schools and 15 public schools in the dataset.
As seen in the graph below and the coefficients of the regression model, it is evident that whether or not a school is private affects the donation it receives from the alumni.
plot1 <- ggplot(alumni, aes(private)) + geom_bar() +
ggtitle("Barchart showing the count of private schools") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
plot2 <- ggplot(alumni, aes(x = alumni_giving_rate, y = student_faculty_ratio,
group = private, color = private)) +
geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") +
ggtitle("The public and private schools have differing donation rates") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
ggarrange(plot1, plot2, nrow = 1, ncol = 2)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.66667 2.540545 6.166655 1.628283e-07
## private1 19.78788 3.064013 6.458158 5.938532e-08
What the summary above means is that if a school is public, it receives an average 15.67 points of donation. But private schools get an additional 19.79 points.
state
The data contains schools in 25 states. Although it is clear in the first graph that the average donation varies greatly across the states, the second graph shows that the number of records of data we have is too small to analyze this difference. Hence it wouldn’t be a good idea to include this variable in the final model.
plot1 <- alumni %>% group_by(state) %>%
summarise(avgDonation = mean(alumni_giving_rate)) %>%
ggplot(aes(state, avgDonation)) + geom_point() +
ggtitle("Distribution of averate donation across states") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
plot2 <- ggplot(alumni, aes(x = state, y = alumni_giving_rate, color = state)) +
geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") +
ggtitle("Donation rates grouped by state") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8),
legend.position = "none")
ggarrange(plot1, plot2, nrow = 2, ncol = 1)
ranking
There is clearly a linear relationship between the rankings and the alumni donations. The higher ranked schools are more likely to get donations from the alumni.
This is further evident from the univariate model coefficients below.
plot1 <- alumni %>% arrange(by = ranking) %>%
ggplot(aes(x = 1:nrow(alumni), y = ranking)) + geom_point() +
ggtitle("The rankings of the 48 schools sorted in ascending order") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
plot2 <- ggplot(alumni, aes(ranking, alumni_giving_rate)) +
geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") +
ggtitle("Scatter plot relating the two variables") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
ggarrange(plot1, plot2, nrow = 2, ncol = 1)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.1998639 2.90195144 14.886488 3.298133e-19
## ranking -0.5422494 0.09647412 -5.620672 1.068157e-06
For every unit decrease in rank, the alumni donation increases on an average by 0.54
Modelling
Basic Model
Modelling with all the variables that were found to be influential on the response donation rate.
model1 <- lm(alumni_giving_rate ~ student_faculty_ratio +
percent_of_classes_under_20 + private + ranking,
data = alumni)
model1 %>% summary()
##
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio + percent_of_classes_under_20 +
## private + ranking, data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7557 -5.5338 -0.5836 4.7865 22.3672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.8087 14.5213 3.568 0.0009 ***
## student_faculty_ratio -1.2424 0.4911 -2.530 0.0152 *
## percent_of_classes_under_20 -0.1100 0.1881 -0.585 0.5618
## private1 6.7788 5.1073 1.327 0.1914
## ranking -0.2619 0.1119 -2.341 0.0240 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.632 on 43 degrees of freedom
## Multiple R-squared: 0.6227, Adjusted R-squared: 0.5876
## F-statistic: 17.74 on 4 and 43 DF, p-value: 1.138e-08
From the p-values of the coefficients, we observe that percent_of_classes_under_20
and private
variables do not have a significant influence on the model. We observed the contrary during the univariate analyses. This can probably be explained by multicollinearity between the two variables. The differnece between the \(R^2\) and \(adjusted \: R^2\) also indicate the same reason. The \(MSE\) is observed to be 69.65.
## student_faculty_ratio percent_of_classes_under_20
## 3.580330 3.886828
## private ranking
## 3.610477 1.976081
But, on checking the Variance Inflation Factor, we do not observe any values above 10, disproving the multicollinearity theory. Hence, the two variables are ignored from the model for now and will be explored later.
Variable Selection
With the two variables removed, the summary of the new model is
##
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio + ranking,
## data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9086 -6.1546 0.2597 4.4597 21.2957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.74558 3.24880 16.543 < 2e-16 ***
## student_faculty_ratio -1.55767 0.32534 -4.788 1.86e-05 ***
## ranking -0.25291 0.09978 -2.535 0.0148 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.609 on 45 degrees of freedom
## Multiple R-squared: 0.6072, Adjusted R-squared: 0.5898
## F-statistic: 34.79 on 2 and 45 DF, p-value: 7.38e-10
Only significant variables remain in the model but we do not see improvement in the \(adjusted \: R^2\) at 0.59 and \(MSE\) at 72.5.
Residual Diagnostics
Checking the fit of the model with the following residual analysis plots.
model2Augment <- augment(model2) %>% mutate(row_num = 1:n())
# Plot of residuals against fitted values (non-constant variance and non-linearity)
p1 <- ggplot(model2Augment, aes(x = .fitted, y = .std.resid)) +
geom_point(alpha = 0.3) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
xlab("Fitted value") +
ylab("Standardized residuals") +
ggtitle("Non-constant variance & non-linearity test\nFitted values - equally spread around 0") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
# Plot of residuals against predictor variable (checking non-linearity).
p2 <- ggplot(model2Augment, aes(x = ranking, y = .std.resid)) +
geom_point(alpha = 0.3) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
ylab("Standardized residuals") +
ggtitle("Non-linearity test\nX follows a slight exponential curve") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
p3 <- ggplot(model2Augment, aes(x = student_faculty_ratio, y = .std.resid)) +
geom_point(alpha = 0.3) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
ylab("Standardized residuals") +
ggtitle("Non-linearity test\nX follows a curve") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
# Boxplot of residuals (outlying observations)
p4 <- ggplot(model2Augment, aes(y = .std.resid)) +
geom_boxplot() +
ggtitle("Outlying observations test\nOne outlier") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
# Normal probability Q-Q plot of residuals (non-normality).
p5 <- ggplot(model2Augment, aes(sample = .std.resid)) +
geom_qq(alpha = 0.3) +
geom_qq_line(linetype = "dashed", color = "red2") +
xlab("Theoretical quantile") +
ylab("Sample quantile") +
ggtitle("Non-normality test\nThe residuals almost follow a normal distribution") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
# Plot of residuals against time or another sequence (non-independence)
p6 <- ggplot(model2Augment, aes(x = row_num, y = .std.resid)) +
geom_point(alpha = 0.3) +
geom_line() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
xlab("Index") +
ylab("Standardized residual") +
ggtitle("Non-independance test\nNo residual pattern across row numbers") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
ggarrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)
The exponential curve on the ranking
variable can be improved by applying a log transformation.
Log Transformation
The new model after the transformation.
# Transforming ranking
alumni$ranking2 <- log(alumni$ranking)
model4 <- lm(alumni_giving_rate ~ student_faculty_ratio + ranking2,
data = alumni)
summary(model4)
##
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio + ranking2,
## data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3976 -5.7894 0.6323 4.2420 21.5566
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.1314 3.9543 15.459 < 2e-16 ***
## student_faculty_ratio -1.4228 0.3136 -4.538 4.21e-05 ***
## ranking2 -5.2561 1.5878 -3.310 0.00184 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.253 on 45 degrees of freedom
## Multiple R-squared: 0.639, Adjusted R-squared: 0.623
## F-statistic: 39.83 on 2 and 45 DF, p-value: 1.103e-10
It’s observed that the \(adjusted \: R^2\) increases to 0.62 and \(MSE\) reduces to 66.63.
Residual Diagnostics
model4Augment <- augment(model4) %>% mutate(row_num = 1:n())
# Plot of residuals against fitted values (non-constant variance and non-linearity)
p1 <- ggplot(model4Augment, aes(x = .fitted, y = .std.resid)) +
geom_point(alpha = 0.3) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
xlab("Fitted value") +
ylab("Standardized residuals") +
ggtitle("Non-constant variance & non-linearity test\nFitted values - equally spread around 0") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
# Plot of residuals against predictor variable (checking non-linearity).
p2 <- ggplot(model4Augment, aes(x = ranking2, y = .std.resid)) +
geom_point(alpha = 0.3) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
ylab("Standardized residuals") +
ggtitle("Non-linearity test\nX not exponential anymore") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
p3 <- ggplot(model4Augment, aes(x = student_faculty_ratio, y = .std.resid)) +
geom_point(alpha = 0.3) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
ylab("Standardized residuals") +
ggtitle("Non-linearity test\nX follows a curve") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
# Boxplot of residuals (outlying observations)
p4 <- ggplot(model4Augment, aes(y = .std.resid)) +
geom_boxplot() +
ggtitle("Outlying observations test\nOne outlier") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
# Normal probability Q-Q plot of residuals (non-normality).
p5 <- ggplot(model4Augment, aes(sample = .std.resid)) +
geom_qq(alpha = 0.3) +
geom_qq_line(linetype = "dashed", color = "red2") +
xlab("Theoretical quantile") +
ylab("Sample quantile") +
ggtitle("Non-normality test\nThe residuals almost follow a normal distribution") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
# Plot of residuals against time or another sequence (non-independence)
p6 <- ggplot(model4Augment, aes(x = row_num, y = .std.resid)) +
geom_point(alpha = 0.3) +
geom_line() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
xlab("Index") +
ylab("Standardized residual") +
ggtitle("Non-independance test\nNo residual pattern across row numbers") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
ggarrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)
Box-Cox Transformation
The model can be further improved by applying a Box-Cox transformation on the response vairable. The new model is presented below.
# Calculating the lambda for the transformation
bc <- MASS::boxcox(model2, plotit = FALSE)
lambda <- bc$x[which.max(bc$y)]
# Applying the transformation
alumni$alumni_giving_rate2 <- (alumni$alumni_giving_rate ^ lambda - 1) / lambda
model3 <- lm(alumni_giving_rate2 ~ student_faculty_ratio + ranking2,
data = alumni)
summary(model3)
##
## Call:
## lm(formula = alumni_giving_rate2 ~ student_faculty_ratio + ranking2,
## data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5503 -1.1945 0.1262 0.8061 3.9072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.44137 0.75593 19.104 < 2e-16 ***
## student_faculty_ratio -0.31443 0.05994 -5.246 4.04e-06 ***
## ranking2 -0.78227 0.30354 -2.577 0.0133 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.578 on 45 degrees of freedom
## Multiple R-squared: 0.6426, Adjusted R-squared: 0.6267
## F-statistic: 40.45 on 2 and 45 DF, p-value: 8.83e-11
The \(adjusted \: R^2\) increases further to 0.63 and \(MSE\) reduces significantly to 2.43.
Residual Diagnostics
Applying the same residual diagnostics on the new model.
model3Augment <- augment(model3) %>% mutate(row_num = 1:n())
# Plot of residuals against fitted values (non-constant variance and non-linearity)
p1 <- ggplot(model3Augment, aes(x = .fitted, y = .std.resid)) +
geom_point(alpha = 0.3) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
xlab("Fitted value") +
ylab("Standardized residuals") +
ggtitle("Non-constant variance & non-linearity test\nFitted values - equally spread around 0") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
# Plot of residuals against predictor variable (checking non-linearity).
p2 <- ggplot(model3Augment, aes(x = ranking2, y = .std.resid)) +
geom_point(alpha = 0.3) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
ylab("Standardized residuals") +
ggtitle("Non-linearity test\nX follows a slight exponential curve, but improved") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
p3 <- ggplot(model3Augment, aes(x = student_faculty_ratio, y = .std.resid)) +
geom_point(alpha = 0.3) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
ylab("Standardized residuals") +
ggtitle("Non-linearity test\nX still follows a curve, but improved") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
# Boxplot of residuals (outlying observations)
p4 <- ggplot(model3Augment, aes(y = .std.resid)) +
geom_boxplot() +
ggtitle("Outlying observations test\nOne outlier") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
# Normal probability Q-Q plot of residuals (non-normality).
p5 <- ggplot(model3Augment, aes(sample = .std.resid)) +
geom_qq(alpha = 0.3) +
geom_qq_line(linetype = "dashed", color = "red2") +
xlab("Theoretical quantile") +
ylab("Sample quantile") +
ggtitle("Non-normality test\nThe residuals follow normal distribution much better") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
# Plot of residuals against time or another sequence (non-independence)
p6 <- ggplot(model3Augment, aes(x = row_num, y = .std.resid)) +
geom_point(alpha = 0.3) +
geom_line() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
xlab("Index") +
ylab("Standardized residual") +
ggtitle("Non-independance test\nNo residual pattern across row numbers") +
theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
ggarrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)
Variable Selection 2
Insert variable private
Going back to the variabels that were ignored, private
is fit into the final model and found to improve.
model5 <- lm(alumni_giving_rate2 ~ student_faculty_ratio + ranking2 + private,
data = alumni)
summary(model5)
##
## Call:
## lm(formula = alumni_giving_rate2 ~ student_faculty_ratio + ranking2 +
## private, data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2235 -1.2151 0.0255 0.8514 3.1533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.68523 1.47088 7.944 4.85e-10 ***
## student_faculty_ratio -0.17379 0.08707 -1.996 0.05214 .
## ranking2 -0.80677 0.29216 -2.761 0.00836 **
## private1 1.75265 0.81309 2.156 0.03663 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.517 on 44 degrees of freedom
## Multiple R-squared: 0.6767, Adjusted R-squared: 0.6547
## F-statistic: 30.7 on 3 and 44 DF, p-value: 7.259e-11
The \(adjusted \: R^2\) increases again to 0.65 and \(MSE\) reduces further to 2.2.
But the significance of the student_faculty_ratio
variable dips. The increase in \(adjusted \: R^2\) and \(MSE\) is not high enough to justify accepting the new model.
Insert variable percent_of_classes_under_20
This variable however does the opposite and decreases \(adjusted \: R^2\) and \(MSE\).
model6 <- lm(alumni_giving_rate2 ~ student_faculty_ratio + ranking2 +
percent_of_classes_under_20,
data = alumni)
summary(model6)
##
## Call:
## lm(formula = alumni_giving_rate2 ~ student_faculty_ratio + ranking2 +
## percent_of_classes_under_20, data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5751 -1.1999 0.1094 0.8352 3.9101
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.739852 2.666998 5.152 5.83e-06 ***
## student_faculty_ratio -0.300137 0.079863 -3.758 0.000501 ***
## ranking2 -0.755347 0.321999 -2.346 0.023557 *
## percent_of_classes_under_20 0.008209 0.029902 0.275 0.784957
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.594 on 44 degrees of freedom
## Multiple R-squared: 0.6432, Adjusted R-squared: 0.6189
## F-statistic: 26.44 on 3 and 44 DF, p-value: 6.213e-10
Summary
The final model shows that the alumni_giving_rate
depends on the ranking
and the student_faculty_ratio
variables.
\[alumniGivingRate^.505 - 1.98 = 14.636 - (0.319*studentFacultyRatio) - (0.798*log(ranking))\]
The \(adjusted \: R^2\) explains 62.67 of the alumni_giving_rate
with a low mean-squared error of 2.43.
The negative relationship of the two predictor variables can be explained by the negative linear relationship that was observed earlier.