Modeling the combined influence of complexity and quality in supervised learning

Tipo
Artigo
Data de publicação
2022
Periódico
Intelligent Data Analysis
Citações (Scopus)
1
Autores
De Avila Mendes R.
Da Silva L.A.
Orientador
Título da Revista
ISSN da Revista
Título de Volume
Membros da banca
Programa
Resumo
© 2022 - IOS Press. All rights reserved.Data classification is a data mining task that consists of an algorithm adjusted by a training dataset that is used to predict an object's class (unclassified) on analysis. A significant part of the performance of the classification algorithm depends on the dataset's complexity and quality. Data Complexity involves the investigation of the effects of dimensionality, the overlap of descriptive attributes, and the classes' separability. Data Quality focuses on the aspects such as noise data (outlier) and missing values. The factors Data Complexity and Data Quality are fundamental for the performance of classification. However, the literature has very few studies on the relationship between these factors and to highlight their significance. This paper applies Structural Equation Modeling and the Partial Least Squares Structural Equation Modeling (PLS-SEM) algorithm and, in an innovative manner, associates Data Complexity and Data Quality contributions to Classification Quality. Experimental analysis with 178 datasets obtained from the OpenML repository showed that the control of complexity improves the classification results more than data quality does. Additionally paper also presents a visual tool of datasets analysis about the classification performance perspective in the dimensions proposed to represent the structural model.
Descrição
Palavras-chave
Assuntos Scopus
Class separability , Classification algorithm , Data classification , Data complexity , Data mining tasks , Data quality , Object class , Performance , Structural equation models , Training dataset
Citação
DOI (Texto completo)