graphs of performance analytics on a laptop screen

SAS Data Analysis

Part I: Descriptive Statistics 

1. Using the data set IRIS in the SASHELP library, generate summary statistics for the variables PetalLength and PetalWidth. Include the mean, standard deviation, median, the number of non-missing observations, and the number of missing observations. Then Make a histogram of the variable _PetalLength and describe its distribution. 

proc means data=SASHELP.IRIS chartype mean std median n nmiss vardef=df 

qmethod=os;

var PetalLength PetalWidth;

run;

proc univariate data=SASHELP.IRIS vardef=df noprint;

var PetalLength PetalWidth;

histogram PetalLength PetalWidth / normal(noprint);

run;

VariableLabelMeanStd DevMedianNN Miss
PetalLengthPetalWidthPetal Length (mm)Petal Width (mm)37.580000011.993333317.65298237.622376743.500000013.000000015015000

From the histogram, we can see that the distribution of values doesn’t follow the normal curve very well. It is a non-normally distributed variable of PetalLength.

2. Using the same dataset in question 1, check for normality for the variable PetalLength. Add a Normal Quantile-Quantile (Q-Q) plot and compute the sample skewness for the variable. 

proc univariate data=sashelp.iris normal plot;

var petallength;

qqplot petallength /normal(mu=est sigma=est color=red l=1);

run;

proc freq;

tables species/chisq;

run;

We can see that the data points near the tails and heads don’t fall exactly along a straight line. We can see the clear departure from the straight line in this Q-Q plot, indicating that this dataset likely does not follow a normal distribution.

Tests for Normality
TestStatisticp Value
Shapiro-WilkW0.876268Pr < W<0.0001
Kolmogorov-SmirnovD0.198154Pr > D<0.0100
Cramer-von MisesW-Sq1.222285Pr > W-Sq<0.0050
Anderson-DarlingA-Sq7.678546Pr > A-Sq<0.0050

According to SAS Manual (Sas.com, 2023), if the sample size is less than 2000, the Shapiro test is better. Since the p-value for each normality test is less than 0.05, we would reject the null hypothesis for each normality test. This means there is sufficient evidence to conclude that the points variable is not normally distributed.

The UNIVARIATE Procedure

Variable: PetalLength (Petal Length (mm))

Moments
N150Sum Weights150
Mean37.58Sum Observations5637
Std Deviation17.6529823Variance311.627785
Skewness-0.2748842Kurtosis-1.4021034
Uncorrected SS258271Corrected SS46432.54
Coeff Variation46.9744075Std Error Mean1.44135997

The variable PetalLength shows a negative value of skewness (-0.27) which indicates that the data is skewed to the left.

3. Using the same data as in exercise 1, compute the one-way table of the frequencies for the variable Species. 

proc freq data=SASHELP.IRIS;

tables Species / plots=(freqplot cumfreqplot);

Run;

The FREQ Procedure

Iris Species
SpeciesFrequencyPercentCumulativeFrequencyCumulativePercent
Setosa5033.335033.33
Versicolor5033.3310066.67
Virginica5033.33150100.00

4. Using the same dataset, generate a box-plot for the variable PetalLength using Species as the category variable. Remove the species Virginica from the data presented in the boxplot.

data work.filter;

set sashelp.iris;

where Species in (‘Setosa’, ‘Versicolor’);

run;

proc print data=work.filter;

ods graphics / reset width=6.4in height=4.8in imagemap;

proc sgplot data=WORK.FILTER;

vbox PetalLength / category=Species;

yaxis grid;

run;

ods graphics / reset;

  • The middle line of box Setosa lies outside of box Versicolor entirely. Then there is likely to be a difference between PetalLength in Setosa and Petal Length in Versicolor. 
  • Petal Length in Sentosa is comparatively short than in Versicolor which is comparatively tall and it is an obvious difference between Setosa and Versicolor which is worthy for further investigations. 
  • The box plot for petal length in Setosa is much lower than the equivalent plot for Versicolor and it is shoerter boz which mean their data points consistently hover around the center values. 
  • The boxplot for petal length in Versicolor however has a longer whisker which indicates the larger range, more variable data  and wider distribution (more scattered data). 
  • The outliers (the dotted outside the whiskers) in Setosa and Versicolor is about 1.5 times the size of the boxes.
  • Setosa box is completely below Versicolor box which indicates there is a difference between the two group.

Part II: One-sample Tests 

1. Using the data set Heart in the SASHELP library, use a one-way t-test to determine if the mean weight of the population from which the sample was drawn is equal to 150 lbs. Include a test of normality and generate a histogram and box-plot. Should you be concerned that the test for normality rejects the hypothesis at the 0.05 significance level? 

proc ttest data=sashelp.heart h0=150 plots(showh0) alpha=0.05;

   var weight;

run;

The TTEST Procedure

Variable: Weight

NMeanStd DevStd ErrMinimumMaximum
5203153.128.91540.400967.0000300.0
Mean95% CL MeanStd Dev95% CL Std Dev
153.1152.4153.928.915428.370429.4820
DFt ValuePr > t
52027.70<.0001

Since the p-value (t = 7.7, p = <.0001) is less than .05, we reject the null hypothesis. This means we have sufficient evidence to say that the mean weight of the population from which the sample was drawn is different than 150 lbs.

The histogram above shows the overlaid normal and kernel densities, a box plot, the 95% confidence interval for the mean, and the null value of 150 lbs. The confidence interval excluded the null value, consistent with the rejection of the null hypothesis at alpha = 5%.

The curvilinear shape of the Q-Q plot suggests a possible slight deviation from normality. 

proc univariate data=sashelp.heart normal plot;

var weight;

qqplot weight /normal(mu=150 sigma=est color=red l=1);

run;

proc freq;

tables weight/chisq;

run;

Tests for Normality
TestStatisticp Value
Kolmogorov-SmirnovD0.048465Pr > D<0.0100
Cramer-von MisesW-Sq3.05329Pr > W-Sq<0.0050
Anderson-DarlingA-Sq18.50518Pr > A-Sq<0.0050

Since the p-value for each normality test is less than .05, we would reject the null hypothesis for each normality test. This means there is sufficient evidence to conclude that the points variable is not normally distributed, which confirms the histogram and Q-Q plots.

The UNIVARIATE Procedure

Variable: Weight

Moments
N5203Sum Weights5203
Mean153.086681Sum Observations796510
Std Deviation28.9154261Variance836.101866
Skewness0.55594115Kurtosis0.52275608
Uncorrected SS126284474Corrected SS4349401.91
Coeff Variation18.8882703Std Error Mean0.40086919

The variable Weight shows a positive value of skewness (0.55) which indicates that the data is skewed to the right (right-skewed distribution), which confirm the histogram chart and Q-Q Plot.

Part III: Two-sample tests 

1. Using the Heart data in the SASHELP library, compare the Systolic (systolic blood pressure) for males and females (variable Sex indicate gender where values F and M indicates gender female respectively male. Generate a histogram and box-plot to investigate the distribution of Systolic for males and females. Can we assume that these are normal distributed based the histogram? 

proc means data=SASHELP.HEART chartype mean std median n nmiss vardef=df 

qmethod=os;

var Systolic;

class Sex;

run;

proc univariate data=SASHELP.HEART vardef=df noprint;

var Systolic;

class Sex;

histogram Systolic / normal(noprint);

run;

Analysis Variable : Systolic
SexN ObsMeanStd DevMedianNN Miss
Female2873136.886181725.9835883130.000000028730
Male2336136.938356220.6535522134.500000023360

From the histogram, we can see that both Systolic in Male and Female are slightly right-skewed as it has tail on the right side of the distribution. This type of distribution is positively skewed distribution and non-normal distributed.

proc sgplot data=SASHELP.HEART;

vbox Systolic / category=Sex;

yaxis grid;

run;

ods graphics / reset;

  • The Systolic Male and Systolic Femal boxes are overlap wth one another which indicates the group is not much different. 
  • The middle line of Systolic in Female box is slightly lower than the middle line of Systolic Male box. Then there is likely to be a difference between Systolic in female and male. 
  • The box plot for Systolic in Male is shorter than Systolic Female box, which mean their data points consistently hover around the center values. 
  • The boxplot for Systolic in Female is taller box and has a longer whisker which indicates the larger range, more variable data and wider distribution (more scattered data). 
  • The outliers (the data points dotted outside the whiskers) in Systolic in Female and Male are above the box and are about 1.2 times the size of the boxes.

proc sgplot data=SASHELP.HEART;

  vbox Systolic / group=sex;

  keylegend / title=”Systolic for Males and Females”;

run;

  • The Median Values: The line in the middle of the box plot for Male is higher than the line for Female, which indicated that the gender Male had a higher median systolic.
  • The Dispersion: The box plot for Female is slightly longer than Male, which indicates that the Systolic are much more spread out among Female.
  • The Skewness: The line in the middle of the box plot for Male is at the center of the box, which indicates that the distribution of Systolic has little skew at all. 

2. Conduct the tests for normality. Do you reject the null hypothesis that the data values are normally distributed? 

proc univariate data=sashelp.heart normal;

var systolic;

class sex;

run;

Variable: Systolic

Sex = Female

Moments
N2873Sum Weights2873
Mean136.886182Sum Observations393274
Std Deviation25.9835883Variance675.14686
Skewness1.51582354Kurtosis3.98506675
Uncorrected SS55772798Corrected SS1939021.78
Coeff Variation18.9818928Std Error Mean0.48476506
Basic Statistical Measures
LocationVariability
Mean136.8862Std Deviation25.98359
Median130.0000Variance675.14686
Mode120.0000Range218.00000
  Interquartile Range30.00000
Tests for Location: Mu0=0
TestStatisticp Value
Student’s tt282.3763Pr > |t|<.0001
SignM1436.5Pr >= |M|<.0001
Signed RankS2064251Pr >= |S|<.0001
Tests for Normality
TestStatisticp Value
Kolmogorov-SmirnovD0.124774Pr > D<0.0100
Cramer-von MisesW-Sq10.35425Pr > W-Sq<0.0050
Anderson-DarlingA-Sq60.9063Pr > A-Sq<0.0050
  • Since the p-value for each normality test is less than .05, we would reject the null hypothesis for each normality test.
  • This means there is sufficient evidence to conclude that the Systolic in Female variable is not normally distributed.
Quantiles (Definition 5)
LevelQuantile
100% Max300
99%230
95%185
90%170
75% Q3150
50% Median130
25% Q1120
10%110
5%106
1%98
0% Min82
Extreme Observations
LowestHighest
ValueObsValueObs
825542802726
8611252864173
8924252905099
9048292943251
9032363004629

Variable: Systolic

Sex = Male

Moments
N2336Sum Weights2336
Mean136.938356Sum Observations319888
Std Deviation20.6535522Variance426.569218
Skewness1.33026148Kurtosis3.6509985
Uncorrected SS44800976Corrected SS996039.123
Coeff Variation15.0823719Std Error Mean0.42732504
Basic Statistical Measures
LocationVariability
Mean136.9384Std Deviation20.65355
Median134.5000Variance426.56922
Mode140.0000Range186.00000
  Interquartile Range22.50000
Tests for Location: Mu0=0
TestStatisticp Value
Student’s tt320.4548Pr > |t|<.0001
SignM1168Pr >= |M|<.0001
Signed RankS1364808Pr >= |S|<.0001
Tests for Normality
TestStatisticp Value
Kolmogorov-SmirnovD0.117447Pr > D<0.0100
Cramer-von MisesW-Sq5.452689Pr > W-Sq<0.0050
Anderson-DarlingA-Sq32.9541Pr > A-Sq<0.0050
  • Since the p-value for each normality test is less than .05, we would reject the null hypothesis for each normality test.
  • This means there is sufficient evidence to conclude that the Systolic in Male variable is not normally distributed.
Quantiles (Definition 5)
LevelQuantile
100% Max276.0
99%204.0
95%174.0
90%162.0
75% Q3146.0
50% Median134.5
25% Q1123.5
10%114.0
5%110.0
1%102.0
0% Min90.0
Extreme Observations
LowestHighest
ValueObsValueObs
9024952343115
9452092463608
9448212503574
9650782603953
9645492763598

  • The variable Systolic in Female and Male show positive value of skewness (1.52 and 1.33), which indicates that the data is skewed to the right (right-skewed distribution), which confirm the histogram chart and Q-Q Plot that the data do not follow a normal distribution.
  • From the histogram we can see that the distribution of Systolic in Male and Systolic in Female don’t follow the normal curve very well, which agrees with the results of the normality tests that we performed.

3. Run a two-sample t-test comparing the Systolic (systolic blood pressure) for males and females where Ho states that the difference in population means is zero (use 1 % significance level). Can we assume equal variances for men and females? 

proc ttest data=sashelp.heart h0=0 plots(showh0) alpha=0.01;

  var systolic;

  class sex;

run;

The TTEST Procedure

Variable: Systolic

SexMethodNMeanStd DevStd ErrMinimumMaximum
Female 2873136.925.98360.484882.0000300.0
Male 2336136.920.65360.427390.0000276.0
Diff (1-2)Pooled -0.052223.74190.6614  
Diff (1-2)Satterthwaite -0.0522 0.6462  
SexMethodMean99% CL MeanStd Dev99% CL Std Dev
Female 136.9135.6138.125.983625.127726.8955
Male 136.9135.8138.020.653619.901621.4604
Diff (1-2)Pooled-0.0522-1.75651.652223.741923.156424.3556
Diff (1-2)Satterthwaite-0.0522-1.71731.6130   
Equality of Variances
MethodNum DFDen DFF ValuePr > F
Folded F287223351.58<.0001
  • The ratio of variances is 1.58 indicating that the two groups have similar sample variances and thus we might assume that they have equal population variances for male and females. 
  • The Equality of Variances test reveals sufficient evidence of unequal variances and is significant (Pr>F = .0001, with p=.0001), so we reject the null hypothesis of equal population variances. In this case, the variances appear to be unequal.
MethodVariancesDFt ValuePr > |t|
PooledEqual5207-0.080.9371
SatterthwaiteUnequal5204.4-0.080.9357
  • In the t-test table above we choose Satterthwaite t-test (Unequal variances), and report a t statistic is -0.08 with 5204.4 degrees of freedom, with p-value of p=0.9357. Since p-value is greater than .1, we fail to reject the null hypothesis of equal population variances for males and females. 
  • We can assume equal variances for males and females at 1 % significance level.

.

Part IV: Linear Regression Analysis 

1. Using the Cars dataset in the SASHELP library, run a multiple regression with MSRP as the dependent variable and the three variables Horsepower, Weight, and Length as the predictor variables. Interpret the results. Is there evidence of multicollinearity?

proc reg data=sashelp.cars;

model msrp = horsepower weight length;

Run;

The REG Procedure

Model: MODEL1

Dependent Variable: MSRP

Number of Observations Read428
Number of Observations Used428
Analysis of Variance
SourceDFSum ofSquaresMeanSquareF ValuePr > F
Model31.141437E1138047899508342.60<.0001
Error42447087920181111056416  
Corrected Total4271.612316E11   
  • The overall F-value of the regression is 342.60 and the corressponding p-value is <.0001.
  • Since the p-value is less than .05, we conclude that the regression model as a whole is statistically significant.
Root MSE10538R-Square0.7079
Dependent Mean32775Adj R-Sq0.7059
Coeff Var32.15371  
  • In this regression model, the observed values fall an average of 10538 units from the regression line. 
  • The R-Square value shows the percentage of variation (71%) in the MSRP can be explained by the independent variables (Engine, Weight, & Heigh of the car). In general, the larger the R2 of a regression model, the better the predictor variables are able to predict the value of the response variable. 
Parameter Estimates
VariableLabelDFParameterEstimateStandardErrort ValuePr > |t|
InterceptIntercept1235067312.469373.210.0014
Horsepower 1240.305579.1913126.14<.0001
WeightWeight (LBS)10.157691.110990.140.8872
LengthLength (IN)1-231.6637849.29999-4.70<.0001
  • We can use the parameter estimate values in this table to write the fitted regression equation:

MSRP = 23506 + 240.31*(Horsepower)+0.16*(Weight)-231.67*(Length).

  • Horsepower (b=240.31), is statistically significant (p=<.0001), and the coefficient is positive which would indicate that an increase in horsepower is related to a higher Manufacturer’s Suggested Retail Price (MSRP), which is what we would expect.
  • Next, the weight of the car (b=0.16, p=0.89) is not significant, which indicates the car’s weight is an unrelated and not important factor to MSRP.
  • Finally the percentage of length (b=-231.67, p=<.0001) is significant, and its coefficient is negative indicating that the higher the length of the car, the lower the MSRP.  
  • We can conclude that higher horsepower is related to higher MSRP, that the weight of the car was not related to the MSRP, and that shorter length is associated with higher MSRP. 
  • We may decide to remove the weight of the car from the model since it isn’t statistically significant.

proc reg data=sashelp.cars;

 model msrp = horsepower weight length / vif tol collin;

run;

quit;

The REG Procedure

Model: MODEL1

Dependent Variable: MSRP

Number of Observations Read428
Number of Observations Used428
Analysis of Variance
SourceDFSum ofSquaresMeanSquareF ValuePr > F
Model31.141437E1138047899508342.60<.0001
Error42447087920181111056416  
Corrected Total4271.612316E11   
Root MSE10538R-Square0.7079
Dependent Mean32775Adj R-Sq0.7059
Coeff Var32.15371  
Parameter Estimates
VariableLabelDFParameterEstimateStandardErrort ValuePr > |t|ToleranceVarianceInflation
InterceptIntercept1235067312.469373.210.0014.0
Horsepower 1240.305579.1913126.14<.00010.596591.67619
WeightWeight (LBS)10.157691.110990.140.88720.365792.73381
LengthLength (IN)1-231.6637849.29999-4.70<.00010.519081.92648

In the above result we can see that the lowest tolerance value is 0.37 (no values fall below 0.1), so there is no threat of multicollinearity indicated through our tolerance analysis. As for variance inflation, the highest value is at 2.73 (no value above10), indicating a lack of multicollinearity. 

Collinearity Diagnostics
NumberEigenvalueConditionIndexProportion of Variation
InterceptHorsepowerWeightLength
13.918001.000000.000310690.003700.001010.00019710
20.062957.889510.016410.611800.000351520.00664
30.0172015.091390.050810.366320.643720.00161
40.0018545.976950.932470.018180.354910.99155

Based on the above table, none of the eigenvalues and the condition index associations is small (close to zero) and the corresponding condition numbers large. Hence we can conclude that there’s no indication of multicollinearity.

References

Zach (2022). SAS: How to Use Proc Univariate for Normality Tests – Statology. [online] Statology. Available at: https://www.statology.org/sas-proc-univariate-normality-test/ [Accessed 2 Feb. 2023].

Sas.com. (2023). SAS Help Center. [online] Available at: https://documentation.sas.com/doc/en/pgmsascdc/v_024/fedsqlref/p1p5rkkh0zqyhcn1qf5jqpk8nosr.htm [Accessed 2 Feb. 2023].

in

Leave a Reply

Your email address will not be published. Required fields are marked *