The dataset for this exercise (males.xlsx) contains data for young working males in the USA with some professional and personal characteristics. Please only use data for the year 1987. We want to explain the log wages from the other variables using the following model:
Logwage ᵢ = β₁ + β₂ school + β₃ exper ᵢ + β₄ union ᵢ + β₅ mar ᵢ + β₆ black ᵢ + β₇ hisp ᵢ + εᵢ
We assume that all εᵢ and all explanatory variables are independent and that εᵢ are independently distributed with expectation 0 and variance σ².
A. Compare summary statistics of all the variables in the model and provide a brief interpretation.
Based on the table above, there are 545 observations for each variable for the year 1987 observation. Looking at the mean values in the table above, one can conclude that the years of schooling are the most important variable that influences log wages and the lowest value of 0.12 for the Black race.
Based on the summary statistics above for the year 1987, 26.2% are part of union, 61.5% are married, 11.6% are black and 15.6% are Hispanic.
B. Estimate the parameters by OLS. Report and interpret the estimation results, including the R2. Pay attention to economic interpretation as well as statistical significance.
This table provides the R and R2 values.
- The R-value represents the simple correlation between the dependent and independent variables (Jain, 2019). In this case, the value is 0.394, which is good.
- The R2 value indicates the coefficient of determination – how much of the total variation in the dependent variable, log wages, can be explained by the independent variable (being part of a union member, Hispanic, black, years of schooling, married status and experience). In this case, the value is 0.155, so it is good.
- Adjusted R-square shows the generalization of the results. In this case, the value is 0.145, which is not far off from 0.155 (R2), so it is good.
Interpretation of Pearson’s correlation values
Independent variable name | Pearson correlation value | Result |
Years of schooling | 0.337 | Positive correlation |
Union member or not | 0.082 | Very weak positive correlation |
Married | 0.140 | Very weak positive correlation |
Black | -0.148 | Very weak negative correlation |
Hispanic | -0.018 | Very weak negative correlation |
Years of participating in the labour market (age-6-school) | -0.203 | Weak negative correlation |
Interpretation of significance (2-tailed) values
Independent variable name | Significance (2-tailed) value | Result (at 95% confidence interval) |
Years of schooling | <0.001 | Acceptable |
Union member or not | 0.054 | Acceptable |
Married | 0.001 | Acceptable |
Black | <0.001 | Acceptable |
Hispanic | 0.675 | Not acceptable |
Years of participating in the labour market (age-6-school) | <0.001 | Acceptable |
The next table is the ANOVA table, which determines whether the model is significant enough to determine the outcome. It looks like the one below.
Elements of this table relevant for interpreting the results are
- P-value/Sig value: Generally, a 95% confidence interval or 5% level of the significance level is chosen for the study. In the above table, it is <0.001. Therefore, the result is significant.
- F-ratio: It represents an improvement in the prediction of the variable by fitting the model after considering the inaccuracy present in the model (Jain, 2019). A value is greater than 1 for F-ratio yield efficient model. In the above table, the value is 16.437, which is good.
These results estimate that as the p-value of the ANOVA table is below the tolerable significance level, thus there is a possibility of rejecting the null hypothesis in further analysis.
C. Test on the basis of the results in b, test the null hypothesis that being a union member, ceteris paribus affects a person’s expected wage by a 5% significant level. Also, test the joint hypothesis that race does not affect wages. In each case formulate the null and alternative hypotheses and present the test statistic.
Interpretation will be as follows:
Independent variable name | Sig value | Hypothesis TestingResult at 95% confidence interval | Interpretation |
Years of schooling | <0.001 | Null Hypothesis Rejected (<0.001<0.05) | There’s a significant change in the log wages due to the years of schooling, because of the Sig. value is less than 0.001, which is less than the acceptable value of 0.05. With a 1% increase in the years of schooling, the log wages will increase by 0.088% (B value). |
Years of participation in the labour market (age-6-school) | 0.890 | Null Hypothesis not rejected (0.890>0.05) | No significant change in log wages due to the years of participation in the labour market from age 6 school. This is because of the Sig. value is 0.890, which is more than the acceptable limit of 0.05. |
Union member or not | 0.006 | Null Hypothesis Rejected (0.006<0.05) | There’s a significant change in the log wages due to union member or not, because of the Sig. value is 0.006, which is less than the acceptable value of 0.05. With a 1% increase in union membership, the log wages will increase by 0.117% (B value). |
Married | 0.020 | Null Hypothesis Rejected (0.020<0.05) | There’s a significant change in the log wages due to marital status, because of the Sig. value is 0.020, which is less than the acceptable value of 0.05. With a 1% increase in marital status, the log wages will increase by 0.091% (B value). |
Black | 0.001 | Null Hypothesis Rejected (0.001<0.05) | There’s a significant change in the log wages due to the race Black, because of the Sig. value is 0.001, which is less than the acceptable value of 0.05. With a 1% increase in the race Black, the log wages will decrease by 0.088% (B value). |
Hispanic | 0.420 | Null Hypothesis not rejected (0.420>0.05) | No significant change in log wages due to the race Hispanic. This is because of the Sig. value is 0.420, which is more than the acceptable limit of 0.05. |
Therefore, the analysis suggests that the years of schooling, union member or not, married and black race has a significant positive relationship with the log wages.
D. Consider a more general model that includes experᵢ². Compare this model with the model given above using R2, adjusted R2 and t-test. What is your conclusion?
This table provides the R and R2 values with a model that includes experᵢ².
- The R-value in this model, the value is 0.406 which is higher than the previous model which is 0.394, which is good.
- The R2 value in this model, the value is 0.165 which is higher than the previous model (0.155), so it is good.
- Adjusted R-square in this model, the value is 0.154, which is not far off from 0.165 (R2), so it is good. Compare to the original model which is 0.145.
Interpretation will be as follows:
Independent variable name | Original model | New model (include exper2i) | Hypothesis TestingResult at 95% confidence interval | Interpretation | ||
t | Sig value | t | Sig value | |||
Years of schooling | 6.689 | <0.001 | 7.100 | <0.001 | Null Hypothesis Rejected (<0.001<0.05) | There’s a positive relationship in the log wages due to the years of schooling, because of the t-value is more than 1.96 in both models. With a 1% increase in the years of schooling, the log wages will increase by 0.095% (B value). |
Years of participation in the labour market (age-6-school) | -0.139 | 0.890 | -2.542 | 0.11 | Null Hypothesis rejected (0.11<0.05) | There’s a negative relationship in the log wages due to the years of participation in the labour market. T-value is less than -1.96 in both models. But in the original model, the sig value is 0.890 and the null hypothesis was not rejected. However, in the new model, the null hypothesis was rejected as the p-value is 0.11. |
Union member or not | 2.733 | 0.006 | 3.006 | 0.003 | Null Hypothesis Rejected (0.003<0.05) | There’s a positive impact in the log wages due to union member or not because the t-value is more than 1.96 in both models. With a 1% increase in union membership, the log wages will increase by 0.129% (B value). |
Marital status | 2.342 | 0.020 | 2.361 | 0.019 | Null Hypothesis Rejected (0.020<0.05) | There’s a positive relationship in the log wages due to marital status because the t-value in both models is more than 1.96. With a 1% increase in marital status, the log wages will increase by 0.091% (B value). |
Black | -3.253 | 0.001 | -3.191 | 0.002 | Null Hypothesis Rejected (0.002<0.05) | The t-values on both the previous model and the new model show a negative relationship on log wages. T is less than -1.96. With a 1% increase in the race Black, the log wages will decrease by 0.194% (B value). |
Hispanic | 0.806 | 0.420 | 0.628 | 0.530 | Null Hypothesis not rejected (0.530>0.05) | No relationship in log wages due to the race Hispanic in both the original model and new model as the t-value is 0. The Sig. values on both models are more than the acceptable limit of 0.05. |
Experience squared | 2.551 | 0.011 | Null Hypothesis Rejected (0.011<0.05) | There’s a positive relationship in the log wages due to experience squared because the t-value is 2.551, which is more than 1.96. With a 1% increase in the experience square, the log wages will increase by 0.010% (B value). |
In the new model which includes experᵢ², the significant value changes. The log wages are affected by the year of participation in the labour market and are still not affected by Hispanics in this analysis (p-value = 0.530>0.05). Thus we can consider removing only Hispanic in the new model.
E. Save the OLS residuals from the original model. Run a regression where you try to explain the residuals from the explanatory variables in the original regression. What do you find? Explain.
Here we see the histogram of residuals roughly following the shape of the normal curve that is superimposed over them.
The scatterplot gives a general idea of the relationship between the log wages and the 6 independent variables. Here there appears to be a positive relationship as there are more points in the bottom-left and top-right quarters of the plot than in the top-left and bottom-right corners.
As shown in the chart above, the residuals are normally distributed in the normal p plot of regression standardized residual and it more or less follows the line. Generally, the points do seem to follow the line so we would assume we have a normal distribution.
F. Extend the model to investigate whether black union members benefit more from union membership than non-black union members. Estimate the extended model and test the hypothesis.
There are 143 observations for each variable – black and non-black which is a member of the union for the year 1987 observation. Looking at the mean values in the Table above, one can conclude that the non-black union is the more important variable (mean = 0.7832) that influences log wages and the lower value of the mean of 0.2168 for the black union.
Table for Non-Black Union
Table for Black Union
Interpretation will be as follows:
Independent variable name | t-value | Sig value | Hypothesis TestingResult at 95% confidence interval | Interpretation |
Non-Black Union Members | 2.109 | 0.037 | Null Hypothesis Rejected (0.037<0.05) | There’s a positive relationship (t=2.109 > 1.96) and significant change in the log wages due to non-black union membership, because of the Sig. value is 0.037, which is less than the acceptable value of 0.05. With a 1% increase in the non-black membership, the log wages will increase by 0.174% (B value). |
Black Union Members | -2.109 | 0.037 | Null Hypothesis Rejected (0.037>0.05) | There’s a negative relationship (t=-2.109<-1.96) and significant change in the log wages due to the Black union membership, because of the Sig. value is 0.037, which is less than the acceptable value of 0.05. With a 1% increase in the Black union membership, the log wages will decrease by 0.174% (B value). |
In conclusion, non-black union members benefit more from the membership as it positively impacts the log wages than the black union members.
G. Make plots that can be used to investigate heteroskedasticity
We test the existence of heteroskedasticity and perform a linear regression with log wages as the dependent variable and the independent variables consist of: years of schooling, years of participation in the labour market (age-6-school), marital status, races black, races Hispanic and union membership.
Based on the Scatterplot output above, it appears that the spots are diffused and do not form a clear specific pattern. So it can be concluded that the regression model (the log wages and the 6 independent variables: union, marital status, exper, school, black and Hispanic) does not occur a heteroskedasticity problem.
Based on the ANOVA test, where we test statistically using the square of residuals as the dependent variable, we see the p-value is 0.152 which is more than 0.05. It means we do not reject the null hypothesis and we don’t have a heteroskedasticity problem.
References
George, D., & Mallery, P. (2013). IBM SPSS Statistics 21 Step by Step: A Simple Guide and Reference. Available at: https://mymoodle.lnu.se/pluginfile.php/7349976/mod_book/chapter/428979/IBM_SPSS_Statistics_Brief_Guide%20%281%29.pdf.
Accredited Professional Statistician For Hire. (2022). Use and Interpret Multiple Regression in SPSS. [online] Available at: https://www.scalestatistics.com/multiple-regression.html [Accessed 25 Jan. 2023].
Jain, R. (2019). How to interpret the results of the linear regression test in SPSS? [online] Knowledge Tank. Available at: https://www.projectguru.in/interpret-results-linear-regression-test-spss/ [Accessed 25 Jan. 2023].
Pallant, Julie. (2010). SPSS survival manual: a step by step guide to data analysis using SPSS. Maidenhead :Open University Press/McGraw-Hill.
Leave a Reply