An Analysis of Different Text Representation Schemes for an Immune Clustering Algorithm
Tipo
Artigo de evento
Data de publicação
2025
Periódico
Lecture Notes in Networks and Systems
Citações (Scopus)
0
Autores
Ferraria M.A.
Balbi P.P.
de Castro L.N.
Balbi P.P.
de Castro L.N.
Orientador
Título da Revista
ISSN da Revista
Título de Volume
Membros da banca
Programa
Resumo
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.This research investigates the challenges and effectiveness of various text representation methods (standard vector, grammar-based, and distributed), when applied to clustering short texts. The study explores Bag-of-Words for standard vector, Linguistic Inquiry and Word Count (LIWC), Part-of-Speech Tagging (POS-Tagging), and the Medical Research Council Psycholinguistic Database (MRC) for grammar-based, and Word2Vec, fastText, Doc2Vec, and SentenceBERT for distributed representations. Utilizing the aiNet bio-inspired clustering algorithm, the results reveal surprising findings, with grammar-based representations demonstrating competitive performance despite their simplicity, while standard vectors exhibit known challenges like high dimensionality. The study contributes insights into the properties of different text representations, providing a foundation for optimizing their application in clustering tasks with short and informal texts.
Descrição
Palavras-chave
Assuntos Scopus
Bag of words , Clusterings , Immune clustering algorithms , Natural Computing , Parts-of-speech tagging , Representation method , Representation schemes , Short texts , Text representation , Text-mining