Generational Differences in Student Loan Debt and Implications on Homeownership Trends

MSDS Capstone 2024

Authors

Leann Kim

Kassandra Traieh

Introduction

      As current students, student loans are at the forefront of our minds, looming over us as we juggle the stress of financial burdens and the need for education to grow and succeed in life. Societal pressures push us to graduate college, start our careers, buy a house, and start a family. But are Millennials and Gen Z-ers able to fall into place and follow in the footsteps of the generations before them? Or are they facing additional challenges with the outstanding debt they’re accumulating from student loans? This project aims to shed light on how differences in student loan debt between older and current generations influence homeownership trends.

      Historically, homeownership has been a cornerstone of wealth accumulation for American families, providing stability and financial security across generations (Association, 2024). However, in the aftermath of the Great Recession and amidst rising costs of higher education, younger generations are increasingly burdened by substantial student loan debt (Editors, 2024; Kopparam & Clemens, 2020; Martin & E., 2021). According to 2024 data, outstanding student loan debt in the United States has surpassed $1.75 trillion, which is a 67% increase from the previous decade (Governors of the Federal Reserve System, 2020). Research has shown that high levels of student debt can delay or reduce the likelihood of homeownership (Cooper & Wang, 2014; Gicheva & Thompson, 2015; Mezza et al., 2020). By comparing older generations (Baby Boomers and Gen X) with younger generations (Millennials), we can better understand how the landscape of student debt has evolved and its implications for long-term financial decisions such as homeownership.

      Understanding the differences in student loan debt composition across generations holds personal significance for us as members of the current generations facing the burden of student loans. It speaks to the experiences of students navigating the complexities of higher education financing and the challenges they face in achieving future financial stability, including homeownership. By examining these dynamics, our research aims to empower individuals with knowledge to make informed financial decisions and advocate for systemic reforms.

      This study also reflects on the evolving societal norms and expectations surrounding higher education and homeownership. It highlights the shift towards higher education attendance and the resulting increase in student loan debt, which affects traditional milestones such as homeownership. Ultimately, our research contributes to building a more inclusive and resilient society by addressing systemic barriers to economic opportunity and wealth accumulation.

      In summary, our research holds significant merit, with the potential to drive positive societal change. A better understanding of generational differences in student loan debt and its implications for homeownership rates can lead to more informed policies, such as targeted student loan forgiveness programs or tailored financial assistance for first-time homebuyers burdened by educational debt. It can also enhance educational programs by promoting financial literacy initiatives that help students make informed borrowing decisions. Furthermore, this understanding can contribute to a more equitable and prosperous future by addressing disparities in debt accumulation across generations and fostering greater economic mobility for younger individuals.

Data

Data Acquisition

      The acquisition of student loan data and homeownership data proved challenging due to several factors. A significant lack of publicly available data, especially on student loans dating back to 1990, was compounded by the unavailability of the necessary raw data on government portals and academic repositories. The data we encountered was predominantly aggregated by age group, with limited access to detailed, disaggregated datasets that would allow for more granular analysis. Furthermore, many of the existing studies that addressed related topics did not provide explicit citations or source details for their data, making it difficult to trace the origins and verify the reliability of the information used in previous research. This lack of transparency and accessibility posed significant obstacles to obtaining the comprehensive and historical data required for our study.

      To perform our analysis, we utilized data from the Federal Reserve and Census surveys within the 1989-2022 time frame. The Federal Reserve survey (Governors of the Federal Reserve System, 2020) consisted of age groups, the percentage of those who had a mortgage, and the percentage of those with education loans, and was stored in various tabs in an Excel spreadsheet. The Census survey (Bureau, 2024), also stored in various tabs in an Excel spreadsheet, included age groups, the total number of people who were surveyed, and the number of homeowners within each age group. To further explore our research question and to supplement our existing tables, we found data on the amount of debt students had at the time of their graduation (McLoughlin, 2023). This table included graduation year, amount of debt at graduation, average starting salary out of college, and the debt-to-income ratio. The original data structure can be seen in Figure 1 with an explanation of the features in the Data Dictionary below.

Figure 1: Entity-Relationship Diagram showing original features extracted from survey and student debt datasets

Figure 1: Entity-Relationship Diagram showing original features extracted from survey and student debt datasets
Table 1: Survey of Consumer Finances 1989-2022: a normally triennial cross-sectional survey of U.S. families showcasing family holdings of debt by selected characteristics of families and type of debt.
Feature Description
year Survey year from 1989-2022 in ncrements of 3 years
age_group Age groups: 18-34, 35-44, 45-54, 55-64, 65-74, 75+
percent_mortgage Percent of mortgage debt an individual within the age group holds on average
percent_education_loan Percent of education loan debt an individual within the age groups holds on average



Table 2: Census Housing Vacancy Survey: historical data on rental and homeowner vacancy rates in the U.S.
Feature Description
year Survey year from 1989-2022 in increments of 3 years
age_group Age groups: 18-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75+
total_surveyed Total number of homes surveyed
total_owner Total number of homes surveyed that were owned



Table 3: Student Loan Debt by Year: average debt by year of graduation for students who graduated with a bachelor’s degree
Feature Description
grad_year Graduation year from 1970-2021 in varying increments
debt_at_grad Average student loan debt at graduation
avg_start_salary Average starting salary after graduation
avg_debt_to_income Average debt to average income percentage

Data Cleaning & Feature Engineering

      The data preparation process involved extensive cleaning and feature engineering in Excel and PostgreSQL. It is crucial to emphasize that all data was aggregated, ensuring that no personally identifying information was included within the datasets. For the Federal Reserve data, we extracted variables such as the survey year, age groups, the percentage of families in those age groups with a mortgage loan, and the percentage of families in that age group with an education loan. Similarly, for the Census data, we extracted the survey year, age groups, the total number of homes surveyed, and the total number of homes that were owned. All these variables were organized in a structured manner to facilitate analysis. It is important to note that the age groups from the Federal Reserve and Census data exhibited overlap, although they were categorized differently. The Federal Reserve data utilized broader age categories, while the Census data employed more granular age bins. Specifically, the Federal Reserve age groups were: less than 35, 35-44, 45-54, 55-64, 65-74, and 75+. In contrast, the Census data age groups were segmented as follows: less than 25, 25-29, 30-34, 35-39, …, 70-74, and 75+.

      Once we created these two tables in PostgreSQL, we split each table’s age groups into a minimum age and a maximum age for ease of joining the two tables together. Once we created those new columns in both the Federal Reserve and Census tables, we joined our data on survey year, minimum age, and maximum age, which we treated as primary composite keys. This join is visualized in Figure 2. In the join, we ensured that the broader Federal Reserve age ranges were matched with the more granular Census age ranges. Although the join conditions account for overlapping age ranges, it is important to note that this can lead to the duplication of data. Specifically, when a Federal Reserve age range spans multiple Census age ranges, the same Federal Reserve data might be repeated across multiple rows in the joined table. This duplication can impact subsequent analyses if not properly accounted for, as it might inflate counts or skew averages, making it crucial to carefully consider how the joined data is used in downstream processes.

Figure 2: Process flow diagram: Splitting age groups and joining on year, min_age, and max_age

Figure 2: Process flow diagram: Splitting age groups and joining on year, min_age, and max_age

      Upon joining these two tables, we proceeded to assign generation names for each age range. This involved calculating the minimum and maximum birth years using the survey year along with the minimum and maximum ages. To classify each row into a generational cohort (Traditionalist, Baby Boomer, Generation X, Millennial, or Gen Z), we created a formula to manage overlaps by comparing the minimum and maximum birth years to determine the closest generational boundary alignment for each row. To help visualize this process, please see Figure 3. We also created new columns for minimum and maximum graduation years. This was done by assuming the average college graduation age to be 21 and adding this to the minimum and maximum birth years. It’s important to note that using a single average age may lead to inaccuracies in calculating the minimum and maximum graduation years, which could affect subsequent analyses and the classification of individuals into generational cohorts.

Figure 3: Creating min_birth and max_birth year and assigning each row a generation using generation birth year ranges (Bialik & Fry, 2019)

Figure 3: Creating min_birth and max_birth year and assigning each row a generation using generation birth year ranges (Bialik & Fry, 2019)

      Finally, we integrated our student debt dataset, which included variables such as debt at graduation, average starting salary, and debt-to-income percentage by graduation year. The data source did not provide a downloadable file. Consequently, the values were manually entered into Excel and subsequently imported into PostgreSQL for analysis. We aligned each row of our joined dataset to the corresponding debt figures by comparing the minimum and maximum graduation years to determine the closest match to the graduation year in our debt dataset. Although the assignments were done as carefully as possible, there is a chance that some inaccuracies may remain due to variations in actual graduation ages and the potential misalignment of graduation years within the dataset.

The below Data Dictionary detail the dataset and features.

Table 4: Survey of Consumer Finances 1989-2022: a normally triennial cross-sectional survey of U.S. families showcasing family holdings of debt by selected characteristics of families and type of debt.
Feature Description
year year of survey, in increments of 3
census_min_birth_year minimum birth year calculated from year and census_min_age
census_max_birth_year maximum birth year calculated from year and census_max_age
census_generation_name generation name derived from census_min_birth_year and census_max_birth_year
census_min_grad_year minimum graduation year derived from census_min_birth_year + 21 years
census_max_grad_year maximum graduation year derived from census_max_birth_year + 21 years
debt_at_grad average debt at year of graduation
avg_start_salary average starting salary post graduation
avg_debt_to_income average debt to income percentage
census_min_age minimum age derived from age group
census_max_age maximum age derived from age group
total_surveyed total number of those surveyed
total_owner total number of those surveyed who own a home
federal_min_age minimum age derived from age group
federal_max_age maximum age derived from age group
percent_mortgage percentage of those surveyed with mortgages
percent_education_loan percentage of those surveyed with education loans
generation_order generations encoded (Baby Boomer = 1, Generation X = 2, Millennial = 3)

      As mentioned previously, all the data from the Federal Reserve and Census surveys were anonymized to protect individual privacy, preventing the identification of specific individuals. Additionally, the data was presented in aggregated forms such as percentages and averages, rather than detailed individual records, to further safeguard privacy. However, despite these measures, the potential for bias remains a concern, as survey respondents may not fully represent the entire population, with certain demographics potentially being underrepresented. This is especially the case for our student loan debt dataset, where the sample population is unknown. Additionally, working with aggregated data can weaken the statistical power of analyses and increase the likelihood of Type II errors (failing to detect a true effect) or result in misleading statistics. Furthermore, although the Federal Reserve and Census data were obtained from reputable institutions, we were unable to verify the source of the student debt data. Consequently, we must depend on this analyst as a credible source for student debt information. We acknowledge the limitations of this dataset but feel that its inclusion was crucial for providing a more comprehensive analysis. We advise readers to consider the findings from this source alongside the more robust datasets and to be mindful of the potential impact on our study’s overall conclusions.

Methods

      In our dataset, there were relatively few features to choose from that could be associated with homeownership and student debt. Initially, we selected specific features and employed linear regression models to evaluate the influence of each feature on homeownership rates. To explore more intricate, non-linear relationships among the features, we then applied a Random Forest machine learning model. This approach allowed us to build upon the findings from our linear regression analysis and assess the relative importance of each feature. These methodologies together provide a comprehensive framework for understanding the potential impact of student loan debt and generational factors on homeownership rates.

1. Materials List (Software used)

      The following section details the software and libraries utilized in our analysis. The tools listed were instrumental in processing data, performing statistical analyses, and generating visualizations. The specific software and versions used are as follows:

Table 5: Purpose of software and libraries used
Software/Library Version Purpose
Python 3.11.5 Used for scripting and data analysis
Pandas 2.0.3 Data manipulation and analysis, particularly for handling structured data
NumPy 1.24.3 Numerical computing, used for array operations and mathematical functions
Statsmodel 0.14.0 Implementation of statistical models, specifically used for ordinary least squares (linear regression) model
Skikit-learn 1.3.0 Machine learning library, used for implementing a random forest model
SciPy 1.11.1 Used for statistical analysis and scientific computing
Matplotlib 3.7.2 Visualization library, used to create static visualizations
Seaborn 0.12.2 Visualization library built on Matplotlib, used for creating attractive and informative statistical graphics
RStudio 2023.06.2+561 Used for creating additional visualizations outside of the Python environment
ggplot2 3.4.0 Visualization library in R, used for creating complex and multi-layered plots

2. Data Exploration and Initial Analysis

      Initially, our dataset comprised 144 rows, covering Traditionalist, Baby Boomer, Generation X, Millennial, and Gen Z data. Initially, we intended to incorporate trends for Generation Z into our analysis. However, upon integrating all our datasets, we discovered that the available data for Generation Z was insufficient for a robust analysis of their student debt and homeownership rates. Including Generation Z data would have risked distorting the analysis of Baby Boomers, Generation X, and Millennials. Consequently, we decided to exclude Generation Z data from our analysis.

      Additionally, while our survey datasets from the Census and Federal Reserve included detailed trends for Traditionalists, our student debt dataset only covered graduation years from 1970 onward. Since Traditionalists were born in 1945 or earlier and we assumed an average college graduation age of 21, they were not included in the student debt data. Given that the available dataset did not capture their college graduation years, we had to exclude Traditionalists from our analysis.

      Following preliminary data cleaning and the focus on the Baby Boomer, Millennial, and Gen X cohorts, the dataset was refined to 92 rows, which were subsequently used for analysis. This reduction was essential to ensure the relevance and accuracy of the data in examining the targeted generational groups. While analyzing a small dataset can be manageable and insightful, it also comes with significant limitations related to statistical power, model complexity, and generalizability.

3. Identification of Significant Features

      As mentioned earlier, there were only a handful of features that we could employ in our analysis. In order to determine whether student debt affected homeownership trends, we created a variable to hold the homeownership rate, using the following formula:

\[ \frac{to}{ts} * 100 = hr \]

      In this context, to represents the total number of survey respondents who own a home, corresponding to our total_owner variable. This is divided by ts, which represents the total number of households surveyed, corresponding to our total_surveyed variable. The resulting variable, hr, calculates the proportion of surveyed households that are owner-occupied. This variable was essential for analyzing the correlation between student debt levels and homeownership rates, offering insights into the potential impact of financial burdens on the likelihood of homeownership.

      In our linear regression and random forest models, the dependent variable was homeownership_rate, while the independent variables included debt_at_grad, avg_debt_to_income, avg_start_salary, percent_mortgage, percent_education_loan, and generation_order.

4. Building a Linear Regression Model

      To examine the relationship between homeownership rates and average debt at graduation across generations, we utilized a linear regression model. While our primary focus is to shed light on the impact of student loan debt by generation on homeownership rates, we included additional features, mentioned above, in our model to enhance its robustness and accuracy. We then created training and testing datasets with a 70%/30% split and utilized the statsmodel library to perform Ordinary Least Squares (OLS) regression to evaluate the performance of the model.

      For the results of a linear regression analysis to be valid, certain assumptions about the data must be met. These assumptions include linearity, independence, homoscedasticity, normality, and lack of multicollinearity. We investigate each in the following sections.

4a. Testing for Linearity

      The linearity assumption requires that the relationship between the independent variables and the dependent variable is linear. To assess this, we plotted the residuals (errors) against the predicted values to check for any patterns. By plotting residuals versus fitted values (or predictors), we checked for any systematic patterns. If the residuals display a clear pattern or trend, it suggests that the linear relationship assumption might be violated. Ideally, residuals should scatter randomly around zero, indicating that the linear model is appropriate.

      The residuals vs. fitted values plot in Figure 4 shows a generally random scatter, but there is a very slight parabolic tendency present. Despite this minor curvature, the pattern is not strong enough to warrant a departure from a linear regression model.

Figure 4: Linearity Assumption Satisfied: Residuals vs. Fitted Plot shows no strong curvature of patterns

Figure 4: Linearity Assumption Satisfied: Residuals vs. Fitted Plot shows no strong curvature of patterns

4b. Testing for Independence

      The independence assumption stipulates that the residuals are independent of each other. To test for independence, we used the Durbin-Watson statistic, which tests for autocorrelation in the residuals. The Durbin-Watson statistic ranges from 0 to 4. A value close to 2 suggests no autocorrelation. If the value is significantly less than 2, it indicates positive autocorrelation, while a value significantly greater than 2 indicates negative autocorrelation.

      Our analysis yielded a Durbin-Watson statistic of 1.77. This value is somewhat close to 2, suggesting that while there is a tendency towards positive autocorrelation, it is not extreme. In practice, DW values close to 2 (within a range of about 1.5 to 2.5) are often considered acceptable (Datalab, 2021).

      Given that the Durbin-Watson statistic indicated some positive correlation, we conducted the Breusch-Godfrey test to confirm and assess the extent of autocorrelation. The test yielded a statistic of 0.948 and a p-value of 0.330. Since this p-value exceeds the conventional significance threshold of 0.05, we conclude that there is no statistically significant evidence of autocorrelation in the residuals. This suggests that the residuals are likely independent, supporting the assumption of independence.

4c. Testing for Homoscedasticity

      The homoscedasticity assumption requires that residuals have constant variance across all levels of the independent variable. Although we noted a minor curvature in the linearity assumption, Figure 5 shows no pronounced pattern of variance around the horizontal line, indicating that the homoscedasticity assumption is generally met.

Figure 5: Homoscedasticity Assumption Satisfied: Residuals are generally well-dispersed around the horizontal line

Figure 5: Homoscedasticity Assumption Satisfied: Residuals are generally well-dispersed around the horizontal line

4d. Testing for Normality

      The normality assumption requires that the residuals are normally distributed. To assess this, we created a Q-Q plot of the residuals, seen in Figure 6. The points on the Q-Q plot slightly deviate but generally follow the 45-degree line, indicating that the residuals are approximately normally distributed.

Figure 6: Normality Assumption Satisfied: Residuals generally align with the 45-degree reference line

Figure 6: Normality Assumption Satisfied: Residuals generally align with the 45-degree reference line

      We also performed the Shapiro-Wilk test for normality and got a test statistics of 0.99 and a p-value of 0.86 which suggest that the residuals are approximately normally distributed (Statistics, 2018). Since the p-value is well above the conventional significance level of 0.05, there is no significant evidence to reject the null hypothesis of normality. This supports the assumption that the residuals follow a normal distribution.

4e. Testing for Multicollinearity

      Multicollinearity occurs when independent variables are highly correlated, which can distort the estimation of regression coefficients (Data Science, 2018). We evaluated multicollinearity using the Variance Inflation Factor (VIF) from the statsmodel library.

Table 6: Variance Inflations Factors for half of the selected features exceed 10, signaling high multicollinearity
Feature Value
debt_at_grad 26.534943
avg_start_salary 21.435287
avg_debt_to_income 28.121863
percent_mortgage 1.411813
percent_education_loan 7.590111
generation_order 7.403381

      Half of our features had a VIF value exceeding 10, signaling a significant concern (University, 2018a). To further test this, we analyzed the condition number, which resulted in 241,000. There isn’t a simple definition of what counts as a small and large condition number, however, a condition value larger than 1,000 generally indicates strong multicollinearity (Chen, 2024).

      As we can see in Figure 7, the correlation matrix reveals that percent_mortgage exhibits a moderate negative correlation of approximately -0.50 with the other variables. In contrast, the correlations among the remaining variables are notably higher, with values of 0.88 or above. This pattern suggests that while percent_mortgage is somewhat inversely related to the other variables, the high correlations among the latter indicate substantial multicollinearity within the dataset.

Figure 7: Correlation Matrix of Explanatory Variables: Displays strong pairwise correlations

Figure 7: Correlation Matrix of Explanatory Variables: Displays strong pairwise correlations

      This high multicollinearity is a common issue in survey-based observational studies, where variables are often interrelated due to underlying socio-economic factors (University, 2018b). Such interdependencies reflect the inherent complexity of real-world data, where variables may be influenced by common factors or may capture overlapping aspects of the phenomena being studied. Despite efforts to mitigate multicollinearity by experimenting with different subsets of variables and dropping some from the analysis, these adjustments did not substantially reduce the multicollinearity problem. This persistent issue suggests that the high degree of correlation among the variables is an inherent characteristic of the data, making it difficult to isolate and interpret the individual effects of each variable.

      Consequently, disentangling the unique contributions of each variable becomes challenging, as their high correlations obscure the individual effects and complicate the interpretation of their relationships. This issue can compromise the reliability and interpretability of our linear regression model by leading to unstable coefficient estimates and inflated standard errors. Although the other assumptions of the model are satisfied, the presence of multicollinearity necessitates cautious interpretation of our results.

      Additionally, to ensure the validity of our model, we measure the following values:

  • R-squared
  • Adjusted R-squared
  • Mean Absolute Error (MAE)
  • Mean Squared Error (MSE)
  • Root Mean Squared Error (RMSE)
  • F-statistic

5. Building a Random Forest Model

      Given the limitations identified in the linear regression model, we opted to perform a Random Forest analysis to further investigate the relationship between homeownership rates and various predictor variables. The primary motivation for this choice was to address some of the challenges and assumptions inherent in linear regression that may impact the robustness and interpretability of the results. Random Forest models, being ensemble methods based on decision trees, are less sensitive to multicollinearity and do not require the assumptions of linearity, homoscedasticity, or normality of residuals. This makes them a more flexible and robust alternative when dealing with correlated predictors. Overall, the Random Forest model provides an opportunity to validate and complement the findings from our linear regression analysis, offering a more comprehensive approach to analyzing the predictors of homeownership rates and mitigating some of the limitations observed in the linear model.

      We utilized Scikit-learn’s RandomForest library to train and test the dataset on a 70%/30% split. Additionally, we plotted the features to determine the importance of the various features.

      To ensure the validity of our model and compare the fit with our linear regression model, we measure the following values:

  • R-squared
  • Adjusted R-squared
  • Mean Absolute Error (MAE)
  • Mean Squared Error (MSE)
  • Root Mean Squared Error (RMSE)

Results

Linear Regression Model

Model Fit

      To evaluate the effectiveness of the linear regression model in predicting the target variable, we examined several key performance metrics. These metrics provide insight into the model’s accuracy, fit, and overall statistical significance.

Model test statistics:
Table 7: Linear Regression model demonstrates a relatively good fit
Metric Value
R-squared 0.712
Adjusted R-squared 0.681
Mean Absolute Error (MAE) 10.5018
Mean Squared Error (MSE) 130.3077
Root Mean Squared Error (RMSE) 11.4152
F-statistic 23.43
P-value (F-statistic) 9.63e-14

      The metrics displayed in Table 3 indicate the performance of the linear regression model. It demonstrates a relatively good fit, with an R-squared value of 0.712, indicating that 71.2% of the variability in homeownership rates is explained by the predictors. The Adjusted R-squared of 0.681 accounts for the number of predictors and indicates that the model remains robust even when adjusted for predictor complexity.

      The Mean Absolute Error (MAE) of 10.5018 represents the average magnitude of prediction errors, providing a straightforward measure of prediction accuracy. The Mean Squared Error (MSE) of 130.3077 and Root Mean Squared Error (RMSE) of 11.4152 further quantify the error magnitude, with RMSE being particularly useful for understanding error in the units of the target variable. Additionally, the F-statistic of 23.43, with a p-value of 9.63e-14, indicates that the overall model is statistically significant, meaning the predictors collectively contribute meaningfully to explaining the variability in the target variable.

      These metrics collectively demonstrate that the linear regression model has a good fit and effectively captures the relationship between the predictors and the target variable, though there may still be room for improvement in model performance by addressing multicollinearity.

Coefficient Results

      To understand the influence of each feature on homeownership rate, we examined the coefficients and corresponding p-values of the linear regression model. The coefficients indicate the strength and direction of the relationship between each feature and homeownership rate, while the p-values assess the statistical significance of these relationships.

Table 8: Linear regression coefficients ranked from least to most statistically significant
Feature Coefficient P-Value
generation_order -5.4525 0.317
avg_debt_to_income -0.7455 0.124
debt_at_grad 0.0016 0.063
percent_education_loan 0.7697 0.054
avg_start_salary -0.0013 0.025
percent_mortgage 0.6844 0.000

      We anticipated that debt_at_grad, percent_education_loan, and generation_order would be the most influential features. However, the observed relationships were not as straightforward as expected.

      The coefficient for debt_at_grad indicates a modest positive relationship with homeownership rate, suggesting that increased graduation debt is associated with higher homeownership rates. However, the p-value for this coefficient exceeds conventional significance levels, casting uncertainty on its precise impact. Similarly, the coefficient for percent_education_loan indicates an unexpected positive association between education loan debt and homeownership rates. This counterintuitive result may be due to several factors, including high multicollinearity, sampling issues, or specific economic conditions influencing the observed relationship. For instance, it may suggest that individuals with higher education debt, despite the financial burden, could achieve higher earning potential, enabling them to afford homeownership.

      In contrast, the coefficient for generation_order exhibits a negative relationship, indicating that the generational cohort to which an individual belongs may be inversely related to homeownership rates. Despite being the largest negative coefficient, the lack of statistical significance suggests that this effect is not robust within the framework of this model. This result could be influenced by various factors, including high multicollinearity among the features included in the regression model.

      The coefficient analysis reveals that percent_mortgage exhibits the highest positive and statistically significant effect, with a coefficient of 0.6844 and a p-value of 0. This finding aligns with practical expectations, as mortgage debt is a critical determinant of homeownership. The variables avg_start_salary and avg_debt_to_income were included to refine the model. Notably, avg_start_salary has a slight negative association with homeownership rates, suggesting that higher starting salaries might correlate with lower homeownership, though this effect is minor. Additionally, while avg_debt_to_income shows a negative trend, its p-value indicates that this relationship is not statistically significant at the 0.05 level, suggesting potential weakness or confounding factors.

      Overall, the coefficients offer a nuanced view of how differences in student loan debt between older and current generations affect homeownership trends but do not provide a definitive answer. To achieve a more definitive understanding of the impact of student loan debt on homeownership trends, several methodological improvements could be employed. Longitudinal studies, which track individuals over time, would allow for a clearer examination of how changes in student loan debt influence homeownership. Additionally, addressing multicollinearity through variable selection or principal component analysis, enhancing data quality, and incorporating larger, more diverse samples would strengthen the reliability of findings.

Random Forest Model

Model Fit

      In this model analysis, we utilized the same dependent and independent variables as our linear regression model and a similar 70% training and 30% testing split.

Model test statistics:
Table 9: Random Forest model demonstrates a better fit than linear regression model
Metric Value
Mean Squared Error (MSE) 75.7782
R-squared 0.7452
Adjusted R-squared 0.6724
Mean Absolute Error (MAE) 6.5159
Root Mean Squared Error (RMSE) 8.7051

      he Mean Squared Error was 75.78, far better than our linear regression MSE score of 130.30. Although an MSE of 75.78 suggests some level of error in the predictions, it is a useful benchmark for comparing the model’s performance with our linear regression model. The R-squared value was 0.745, which signifies that approximately 74.5% of the variability in homeownership rates can be explained by the model. The Adjusted R-squared is slightly lower than the R-squared, reflecting the adjustment for the number of predictors. This suggests that while the model is still strong, the inclusion of additional variables has not significantly improved the model’s explanatory power. Overall, the Random Forest model outperforms the linear regression model in terms of R-squared, MAE, MSE, and RMSE, indicating better prediction accuracy and fit to the data.

Feature Importance Results

      To better understand the factors influencing homeownership rates, we analyzed the importance of various features using a Random Forest model. The feature importance table below provides a quantitative measure of how much each feature contributes to the prediction of homeownership rates.

Table 10: debt_at_grad and percent_education_loan rank as the top two most important features, after percent_mortgage
Feature Importance
generation_order 0.03846002578869567
avg_start_salary 0.0726717567809253
avg_debt_to_income 0.08913656748564287
debt_at_grad 0.09359463522665826
percent_education_loan 0.11617657198613383
percent_mortgage 0.5899604427319441

      To aid in visualizing the importance of different features, we created a plot of their respective importances. We excluded percent_mortgage from this visualization due to its dominance as the most significant feature.

      As illustrated in Figure 10, the variables percent_education_loan and debt_at_grad emerge as the top most influential factors in predicting the homeownership rate. The importance of percent_education_loan (0.1162) and debt_at_grad (0.0936) indicates that the amount of student loan debt significantly affects an individual’s likelihood of owning a home. This finding suggests that high levels of student loan debt can be a substantial barrier to homeownership, likely due to the financial strain it imposes on individuals, thereby limiting their ability to save for or afford a home.

      In contrast, generation_order has the lowest importance score (0.0385) among the features analyzed. This result implies that the generational cohort of an individual—such as Baby Boomer, Gen X, or Millennials—has minimal influence on predicting homeownership rates in comparison to more direct financial factors. The features average_start_salary (0.0727) and avg_debt_to_income (0.0891) also contribute to the model, though their impact is less pronounced than that of student loan-related variables. This suggests that while starting salary and debt-to-income ratios are relevant factors in homeownership decisions, they are not as significant as the financial burden imposed by student loans.

Figure 10: Generational cohort is the least significant predictor of homeownership rate.

Figure 10: Generational cohort is the least significant predictor of homeownership rate.

       Overall, the feature importance analysis underscores the dominant role of student loan debt in influencing homeownership likelihood. The results highlight that financial factors, particularly those related to student loans and debt, are critical in understanding and predicting homeownership trends.

Conclusions

      In addressing our research question—how do differences in student loan debt between older and current generations affect homeownership trends?—our study offers both insightful and nuanced perspectives. The historical context and the analysis reveal a complex relationship between student loan debt and homeownership rates. Our findings indicate that student loan debt at graduation is a significant factor influencing homeownership rates. The data show a clear increase in student loan debt across generations, with Millennials bearing the highest levels, which is likely contributing to the challenges they face in achieving homeownership.

      Our analysis revealed a marked increase in average graduation debt from Baby Boomers to Millennials, with Millennials bearing the highest levels of debt. This rise in debt aligns with a trend of increasing financial burden on recent graduates. Our examination of homeownership rates indicated significant fluctuations influenced by economic events, such as the 2008 financial crisis, yet an overall upward trend in recent years.

      The Linear Regression model provided a good fit with an R-squared value of 0.712 and significant predictors such as mortgage debt, though the results were tempered by issues such as multicollinearity and the limitations of aggregated data. The Random Forest model outperformed the linear regression model in predictive accuracy, with an R-squared value of 0.745 and demonstrated that student loan-related variables, particularly debt_at_grad and percent_education_loan, were the most influential factors affecting homeownership rates.

      While the Random Forest model provided more robust results, the linear regression model’s findings were also valuable, highlighting the complexity of the relationships between debt and homeownership. The models collectively suggest that high levels of student loan debt are a barrier to homeownership, though challenges like multicollinearity and data limitations need to be addressed. The confidence in our results is bolstered by the consistency between the two modeling approaches, despite the inherent limitations of the data.

      Through our linear regression and Random Forest analyses, we confirmed that debt_at_grad has a less pronounced but still notable effect. Percent_education_loan, while less influential than mortgage debt, also plays a role, suggesting that education loan debt does impact homeownership trends to a degree. Interestingly, generation_order, representing the sequence of generations, was found to be the least significant predictor of homeownership rates. While generational trends and experiences might shape broader economic and social patterns, the specific generational cohort an individual belongs to does not substantially explain variations in homeownership rates compared to the financial burden of student loans and mortgages.

      As illustrated in Figure 10, the variables percent_education_loan and debt_at_grad emerge as the most influential factors in predicting the homeownership rate. This highlights the significant role that student loan debt plays in determining an individual’s likelihood of owning a home. The substantial impact of percent_education_loan suggests that the proportion of education-related debt an individual carries is a determinant of homeownership. Similarly, debt_at_grad indicates that the amount of debt incurred by graduation also affects homeownership prospects.

      It is crucial to acknowledge that student loan debt is not the sole determinant of homeownership trends. Inflation, for instance, has affected purchasing power and contributed to rising housing costs, which can make homeownership more challenging for many. Additionally, extensive evidence highlights how student loan debt creates barriers in mortgage eligibility and credit scores. According to a study by the Federal Reserve Board, a $1,000 increase in student loan debt is associated with a 1.8 percent decrease in the homeownership rate for public four-year college graduates, resulting in a delay in purchasing a home (Mezza et al., 2020). Furthermore, as addressed above, despite our efforts to mitigate bias, the potential for underrepresentation of certain demographics in the survey samples remains a concern. Student loan debt can exacerbate racial disparities. Racial wealth and income gaps are rooted in historical discriminatory housing policies, meaning that Black students, in particular, may face greater financial risks in pursuing higher education (Fallon, 2022). Not adequately capturing the experiences of these underrepresented groups can affect our research and must be taken into consideration when evaluating our results.

      In summary, while student loan debt does impact homeownership trends, it is just one of several factors contributing to the complexity of financial decisions related to homeownership. The rising debt levels faced by Millennials and Gen Z are significant but must be considered alongside other financial elements such as debt levels, income, and savings. Our research underscores the need for a holistic view when analyzing homeownership trends and suggests that addressing the broader financial challenges faced by younger generations may be crucial in facilitating increased homeownership rates. This insight also emphasizes the need for targeted policies and interventions that address financial barriers directly, rather than focusing solely on generational characteristics. By prioritizing measures that alleviate student loan debt and improve financial stability, stakeholders can more effectively support homeownership aspirations across different generational cohorts.

References

Association, N. B. M. (2024). Homeownership Month with City National Bank. https://www.labmba.org/blog/homeownership-month-with-city-national-bank
Bialik, K., & Fry, R. (2019). Millennial Life: How Young Adulthood Today Compares with Prior Generations. https://www.pewresearch.org/social-trends/2019/02/14/millennial-life-how-young-adulthood-today-compares-with-prior-generations-2/
Bureau, U. S. C. (2024). Housing Vacancies and Homeownership (CPS/HVS). https://www.census.gov/housing/hvs/data/histtabs.html
Campisi, N. (2022). Mortgage Rates History. https://www.forbes.com/advisor/mortgages/mortgage-rates-history/
Chen, G. (2024). Multicollinearity (and Model Validation. https://www.sjsu.edu/faculty/guangliang.chen/Math261a/Ch9slides-multicollinearity.pdf
Cooper, D. H., & Wang, J. C. (2014). Student Loan Debt and Economic Outcomes. https://www.bostonfed.org/publications/current-policy-perspectives/2014/student-loan-debt-and-economic-outcomes.aspx
Data Science, P. for. (2018). Linear Regression. https://www.pythonfordatascience.org/linear-regression-python/
Datalab, A. (2021). Understanding Durbin-Watson Test. https://medium.com/@analyttica/durbin-watson-test-fde429f79203
Editors, CFR. org. (2024). Is Rising Student Debt Harming the u.s. Economy? https://www.cfr.org/backgrounder/us-student-loan-debt-trends-economic-impact
Fallon, K. (2022). How Student Loan Debt Affects the Racial Homeownership Gap. https://housingmatters.urban.org/articles/how-student-loan-debt-affects-racial-homeownership-gap
Gicheva, D., & Thompson, J. (2015). The Effects of Student Loans on Long-Term Household Financial Stability. https://research.upjohn.org/cgi/viewcontent.cgi?filename=8&article=1249&context=up_press&type=additional
Governors of the Federal Reserve System, B. of. (2020). Consumer Credit Outstanding. https://www.federalreserve.gov/releases/g19/HIST/cc_hist_memo_levels.html
Kopparam, R., & Clemens, A. (2020). The Rising Number of u.s. Households with Burdensome Student Debt Calls for a Federal Response. https://equitablegrowth.org/the-rising-number-of-u-s-households-with-burdensome-student-debt-calls-for-a-federal-response/
Martin, E. C., & E., Dwyer. R. (2021). Financial Stress, Race, and Student Debt During the Great Recession. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10373449/
McLoughlin, D. (2023). Student Loan Debt Statistics by Year. https://wordsrated.com/student-loan-debt-by-year/
Mezza, A., Ringo, D., Sherlund, S., & Sommer, K. (2020). Student Loans and Homeownership. Journal of Labor Economics, 38(1).
Statistics, L. (2018). Testing for Normality Using SPSS Statistics. https://statistics.laerd.com/spss-tutorials/testing-for-normality-using-spss-statistics.php#:~:text=If%20the%20Sig.,deviate%20from%20a%20normal%20distribution
University, T. P. S. (2018a). Detecting Multicollinearity Using Variance Inflation Factors. https://online.stat.psu.edu/stat462/node/180/#:~:text=The%20general%20rule%20of%20thumb,of%20serious%20multicollinearity%20requiring%20correction
University, T. P. S. (2018b). What Is Multicollinearity? https://online.stat.psu.edu/stat501/lesson/12/12.1