Discriminant analysis

Model
Digital Document
Publisher
Florida Atlantic University
Description
This study seeks to investigate the problem of the cultural boundary between the Kissimmee-Lake Okeechobee and the East Okeechobee culture areas. The problem is addressed here using sites along the geographical region known as the Loxahatchee Scarp, focusing mainly on three sites, Whitebelt I (8PB220), Whitebelt III (8PB222) and JR244 (8MT1327). This study compares ceramic type data using the multivariate statistical analysis of discriminant analysis. The relative frequencies of ceramic types from the test sites are compared to other sites with generally accepted cultural affiliations. The ceramic frequencies are used in order to sort each sites level into several culture areas, those include the Glades, Indian River, Kissimmee-Lake Okeechobee and East Okeechobee culture areas. The results of this study demonstrated the utility of using discriminant analysis in the sorting of levels within sites into appropriate culture areas. The analysis suggests that although ceramics are a key component in determining where that site fits into the vast scheme of known archaeological culture areas, ceramics alone are a poor determinator without considering other factors, such as lithic or shell tools.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The cross-validated classification accuracy of predictive discriminant analysis (PDA) and logistic regression (LR) models was compared for the two-group classification problem. Thirty-four real data sets varying in number of cases, number of predictor variables, degree of group separation, relative group size, and equality of group covariance matrices were employed for the comparison. PDA models were built based on assumptions of multivariate normality and equal covariance matrices, and cases were classified using Tatsuoka's (1988, p. 351) minimum chi square rule. LR models were built using the International Mathematical and Statistical Library (IMSL) subroutine Categorical Generalized Linear Model (CTGLM), available with the 32-bit Microsoft Fortran v4.0 Powerstation. CTGLM uses a nonlinear approximation technique (Newton-Raphson) to determine maximum likelihood estimates of model parameters. The group with the higher log-likelihood probability was used as the LR prediction. Cross-validated hit-rate accuracy of PDA and LR models was estimated using the leave-one-out procedure. McNemar's (1947) statistic for correlated proportions was used in the statistical comparisons of PDA and LR hit rate estimates for separate-group and total-sample proportions (z = 2.58, a =.01). Total-sample and separate-group cross-validated classification accuracy obtained by PDA was not significantly different from that obtained by LR in any of the 31 data sets for which maximum likelihood estimates of LR model parameters could be calculated. This was true regardless of assumptions made about population sizes (i.e., equal or unequal). Neither theoretical nor data-based considerations were helpful in predicting these results. Although it does not appear from these data to make a difference which classification model is used, use of the method described in this study for comparing PDA and LR models will enable researchers to select the optimal classification model for a specific data set, regardless of data conditions.
Model
Digital Document
Publisher
Florida Atlantic University
Description
This research refers to studies on information-theoretic (IT) aspects of data-sequence patterns and developing thereof discriminant algorithms that enable distinguishing the features of underlying sequence patterns having characteristic, inherent stochastical attributes. The application potentials of such algorithms include bioinformatic data mining efforts. Consistent with the scope of the study as above, considered in this research are specific details on information-theoretics and entropy considerations vis-a-vis sequence patterns (having stochastical attributes) such as DNA sequences of molecular biology. Applying information-theoretic concepts (essentially in Shannon's sense), the following distinct sets of metrics are developed and applied in the algorithms developed for data-sequence pattern-discrimination applications: (i) Divergence or cross-entropy algorithms of Kullback-Leibler type and of general Czizar class; (ii) statistical distance measures; (iii) ratio-metrics; (iv) Fisher type linear-discriminant measure and (v) complexity metric based on information redundancy. These measures are judiciously adopted in ascertaining codon-noncodon delineations in DNA sequences that consist of crisp and/or fuzzy nucleotide domains across their chains. The Fisher measure is also used in codon-noncodon delineation and in motif detection. Relevant algorithms are used to test DNA sequences of human and some bacterial organisms. The relative efficacy of the metrics and the algorithms is determined and discussed. The potentials of such algorithms in supplementing the prevailing methods are indicated. Scope for future studies is identified in terms of persisting open questions.