An Analysis of Different Text Representation Schemes for an Immune Clustering Algorithm

Ferraria M.A.; Balbi P.P.; de Castro L.N.

An Analysis of Different Text Representation Schemes for an Immune Clustering Algorithm

Tipo

Artigo de evento

Data de publicação

2025

Periódico

Lecture Notes in Networks and Systems

Citações (Scopus)

0

Autores

Ferraria M.A.
Balbi P.P.
de Castro L.N.

Resumo

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.This research investigates the challenges and effectiveness of various text representation methods (standard vector, grammar-based, and distributed), when applied to clustering short texts. The study explores Bag-of-Words for standard vector, Linguistic Inquiry and Word Count (LIWC), Part-of-Speech Tagging (POS-Tagging), and the Medical Research Council Psycholinguistic Database (MRC) for grammar-based, and Word2Vec, fastText, Doc2Vec, and SentenceBERT for distributed representations. Utilizing the aiNet bio-inspired clustering algorithm, the results reveal surprising findings, with grammar-based representations demonstrating competitive performance despite their simplicity, while standard vectors exhibit known challenges like high dimensionality. The study contributes insights into the properties of different text representations, providing a foundation for optimizing their application in clustering tasks with short and informal texts.

Assuntos Scopus

Bag of words , Clusterings , Immune clustering algorithms , Natural Computing , Parts-of-speech tagging , Representation method , Representation schemes , Short texts , Text representation , Text-mining