Education, Higher--Research

Model
Digital Document
Publisher
Florida Atlantic University
Description
The cross-validated classification accuracy of predictive discriminant analysis (PDA) and logistic regression (LR) models was compared for the two-group classification problem. Thirty-four real data sets varying in number of cases, number of predictor variables, degree of group separation, relative group size, and equality of group covariance matrices were employed for the comparison. PDA models were built based on assumptions of multivariate normality and equal covariance matrices, and cases were classified using Tatsuoka's (1988, p. 351) minimum chi square rule. LR models were built using the International Mathematical and Statistical Library (IMSL) subroutine Categorical Generalized Linear Model (CTGLM), available with the 32-bit Microsoft Fortran v4.0 Powerstation. CTGLM uses a nonlinear approximation technique (Newton-Raphson) to determine maximum likelihood estimates of model parameters. The group with the higher log-likelihood probability was used as the LR prediction. Cross-validated hit-rate accuracy of PDA and LR models was estimated using the leave-one-out procedure. McNemar's (1947) statistic for correlated proportions was used in the statistical comparisons of PDA and LR hit rate estimates for separate-group and total-sample proportions (z = 2.58, a =.01). Total-sample and separate-group cross-validated classification accuracy obtained by PDA was not significantly different from that obtained by LR in any of the 31 data sets for which maximum likelihood estimates of LR model parameters could be calculated. This was true regardless of assumptions made about population sizes (i.e., equal or unequal). Neither theoretical nor data-based considerations were helpful in predicting these results. Although it does not appear from these data to make a difference which classification model is used, use of the method described in this study for comparing PDA and LR models will enable researchers to select the optimal classification model for a specific data set, regardless of data conditions.