Commonly Used Statistical Procedures in SAS
There are over 30 different statistics you can request with the MEANS procedure. If you don’t specify any options, PROC MEANS will print the number of non-missing values, the mean, the standard deviation, and the minimum and maximum values for each variable.
The VAR statement specifies which numeric variables to use in the analysis. If it is absent, SAS uses all numeric variables.
If you use the PROC MEANS statement with no other statements, then you will get statistics for all observations and all numeric variable in your data set. The BY statement performs separate analyses for each level of the variables in the list. The data must first be sorted.
The CLASS statement performs separate analyses for each level of the variables in the list, but its output is more compact than with the BY statement, and the data do not have to be sorted first.
Sometimes you want to save summary statistics for further analysis, or to merge it with other data set. The NOPRINT option SAS in the PROC MEANS statement tells SAS there is no need to produce and print any results since the results are saved.
The SAS data set created in the OUTPUT statement will contain all the variables defined after OUT=data-set name; any variables listed in BY or CLASS statement; two new variables, _TYPE_and JREQ_.
The most obvious reason for using PROC FREQ is to create tables showing the distribution of categorical data values. In addition, PROC FREQ can also reveal irregularities in your data. The basic form of PROC FREQ is
To produce a one-way frequency table, just list the variable names. For example, the following statement produces a frequency table listing the number of observations for each value of the variable “group” :
To produce a cross-tabulation, list the variables separated by an asterisk. The next statement produces a cross-tabulation showing the number of observations for each combination of gender by group.
You can specify any number of table requests in a single TABLES statement, and you can have as many TABLES statements as you wish. If you want to use the formatted value, FORMAT statement should be used. Options, if any, appear after a slash in the TABLES statement. OPTIONS for controlling the output of PROC FREQ include
LIST prints cross–tabulations in list format
MISSING includes missing values in frequency statistics
NOCOL suppresses printing of column percentages
NOROW suppresses printing of row percentage
OUT=dataset write a data set containing frequencies
PROC PLOT and PLOT CHART
The scatter plots produced by these procedures provide an easy and intuitive way to get a feeling about your data. To get more sophisticated plots, use PROC GPLOT.
The basic form of PROC PLOT is
PLOT vertical-variable * horizontal-variable;
The PLOT statement tells SAS which variables to plot and how. SAS plots the first variable on the vertical axis and second on the horizontal. You can have any number of PLOT statements and any number of plot requests in a single PLOT statement. By default, SAS uses letters to mark the points on your plot: A for a single observation, B for two observations at the same point, C for three, and so on. To substitute a different character, such as an asterisk, specify it in the way below,
PLOT x8* x9= �*�;
You can also use a third variable as the plot character, making a convenient label for each point. This statement tells SAS to use the first letter from the variable “Name” to mark each point.
You can plot more than one variable on the vertical axis by using OVERLAY option.
To plot the frequency bar or histogram, we use PLOT CHART. The HBAR statement generates a chart with horizontal bars while VBAR generates vertical bars.
The option LEVELS= can be used to define the number of bars in the chart. An alternative to the LEVELS is to specify midpoints by using
MIDPOINTS= lower_limit TO upper_limit BY interval;
The statement above would likely produce a chart with midpoints that are not integers. To avoid this, the option DISCRETE should be added.
You can also specify the GROUP option to produce a side by side graph. See 3.19.
PROC UNIVARIATE & PROC MEANS
PROC UNIVARIATE gives lots of descriptive statistics which describe the distribution of a single variable. These statistics include the mean, median, mode, standard deviation, skewness, etc. The VAR statement indicates which variable(s) you want to describe. Without a VAR statement, SAS will calculate statistics for all numeric variables in your data set.
You can specify other options in the PROC statement, such as PLOT or NORMAL. The NORMAL option produces tests for normality while the PLOT option produces three plots of your data (stem-and-leaf plot, box plot, and normal probability plot). You can use BY or CLASS statement to obtain separate analysis for BY or CLASS group.
The skewness indicates how symmetrical the distribution is (whether it is more spread out one side) while kurtosis indicates how flat or peaked the distribution is. The normal distribution has values of zero for both skewness and kurtosis.
UNIVARIATE prints out all the summary statistics: mean, variance, quartiles, extremes, t- tests, standard error etc. But if you know you want only a few of these statistics, PROC MEANS is a better way to go. With PROC MEANS, you can ask for just the statistics that you want.
The default confidence level for the confidence limits is 0.95 or 95%. If you want a different confidence level, then request it with the ALPHA= option in the PROC MEANS statement. For example, if you want 99% confidence limits, then specify ALPHA=0.0l along with CLM option.
A correlation coefficient measures the linear relationship between two variables. If two variables were completely unrelated they would have a correlation of zero. If they were perfectly correlated they would have a correlation of 1.0 or -1.0.
The basic statement is
tells SAS to compute correlations between all the numeric variables in the most recent data set. You can add the VAR and WITH statements to specify
variables. Variables listed in the VAR statement appear across the top of the table of correlations, while variables listed in the WITH statement appear down the side of the table.
By default, PROC CORR computes the Pearson product-moment correlation coefficients. The output starts with descriptive statistics for each variable and then lists the correlation matrix which contains: correlation coefficients and the probability of getting a larger absolute value for each correlation.
Many options are available with PROC CORR including options for saving statistics in an output data set.
The REG procedure is a general-purpose procedure for regression that
- handles multiple regression models
- provides nine model-selection methods
- allows interactive changes both in the model and in the data used to fit the model
- allows linear quality restrictions on parameters
- tests linear hypotheses and multivariate hypotheses
- produces collinearity diagnostics, influence diagnostics, and partial regression leverage plots
- saves estimates, predicted values, residuals, confidence limits, and other diagnostic statistics in output SAS data sets
- generates plots of data and of various statistics
- “paints” or highlights scatter plots to identify particular observations or groups of observations
- uses, optionally, correlations or crossproducts for input
MODEL y=x1 x2;
The LOGISTIC procedure fits logistic models, in which the response can be either dichotomous or polychotomous. Stepwise model selection is available. You can request regression diagnostics, and predicted and residual values. The syntax is similar as the regression procedure.
MODEL y=x1 x2;
The input of binary response data that are grouped:
MODEL r/n=x1 x2;
Here, n represents the number of trials and r represents the number of events.