A genetic algorithm approach for topic clustering: A centroid-based encoding scheme

Dionisios N. Sotiropoulos, Demitrios E. Pournarakis, George M. Giaglis

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    Abstract

    This paper addresses the problem of topic clustering, through the utilization of a novel genetic algorithm approach which is highly scalable on large volumes of textual data, by introducing a centroid-based encoding scheme. The proposed topic clustering method is anchored on the Latent Dirichlet Allocation (LDA) probabilistic topic modeling framework, aiming at identifying cluster formations that are optimal in terms of semantic coherence. Our work focuses on reformulating the clustering problem as a discrete optimization problem within the n-dimensional standard simplex since all the LDA-based data patterns correspond to n-valued probability distribution vectors. The novelty of our proposed genetic algorithm approach lies primarily upon the adaptation of the centroid-based encoding scheme, in the sense that cluster assignments are implicitly extracted by assigning each data point to the nearest cluster center. Experimentation was conducted on a large corpus of twitter posts, particularly relating to the UBER transportation network. The obtained topic clustering results indicate significant improvement in extracting semantically focused groups of documents when compared against traditional clustering algorithms, such as the k-means. The clustering superiority of our proposed genetic algorithm is also justified by measuring the intra- and inter-cluster semantic distances of the obtained cluster formations.

    Original languageEnglish
    Title of host publicationIISA 2016 - 7th International Conference on Information, Intelligence, Systems and Applications
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    ISBN (Electronic)9781509034291
    DOIs
    Publication statusPublished - 14 Dec 2016
    Event7th International Conference on Information, Intelligence, Systems and Applications, IISA 2016 - Chalkidiki, Greece
    Duration: 13 Jul 201615 Jul 2016

    Other

    Other7th International Conference on Information, Intelligence, Systems and Applications, IISA 2016
    Country/TerritoryGreece
    CityChalkidiki
    Period13/07/1615/07/16

    Fingerprint

    Dive into the research topics of 'A genetic algorithm approach for topic clustering: A centroid-based encoding scheme'. Together they form a unique fingerprint.

    Cite this