This paper addresses the problem of topic clustering, through the utilization of a novel genetic algorithm approach which is highly scalable on large volumes of textual data, by introducing a centroid-based encoding scheme. The proposed topic clustering method is anchored on the Latent Dirichlet Allocation (LDA) probabilistic topic modeling framework, aiming at identifying cluster formations that are optimal in terms of semantic coherence. Our work focuses on reformulating the clustering problem as a discrete optimization problem within the n-dimensional standard simplex since all the LDA-based data patterns correspond to n-valued probability distribution vectors. The novelty of our proposed genetic algorithm approach lies primarily upon the adaptation of the centroid-based encoding scheme, in the sense that cluster assignments are implicitly extracted by assigning each data point to the nearest cluster center. Experimentation was conducted on a large corpus of twitter posts, particularly relating to the UBER transportation network. The obtained topic clustering results indicate significant improvement in extracting semantically focused groups of documents when compared against traditional clustering algorithms, such as the k-means. The clustering superiority of our proposed genetic algorithm is also justified by measuring the intra- and inter-cluster semantic distances of the obtained cluster formations.
|Title of host publication||IISA 2016 - 7th International Conference on Information, Intelligence, Systems and Applications|
|Publisher||Institute of Electrical and Electronics Engineers Inc.|
|Publication status||Published - 14 Dec 2016|
|Event||7th International Conference on Information, Intelligence, Systems and Applications, IISA 2016 - Chalkidiki, Greece|
Duration: 13 Jul 2016 → 15 Jul 2016
|Other||7th International Conference on Information, Intelligence, Systems and Applications, IISA 2016|
|Period||13/07/16 → 15/07/16|