BERT-Based Deep Embedded Clustering for Topic Modeling

Penulis: Cahyadi, Danu JulianMurfi, HendriSatria, YudiAbdullah, SariniWidyaningsih, Yekti
Informasi
JurnalInternational Conference on Computer, Control, Informatics and its Applications, IC3INA, 2024 International Conference on Computer, Control, Informatics and its Applications (IC3INA)
PenerbitInstitute of Electrical and Electronics Engineers Inc., 2024 International Conference on Computer, Control, Informatics and its …, 2024, IEEE
Volume & EdisiEdisi 2024
Halaman331 - 336
Tahun Publikasi2024
ISSN29945933
Jenis SumberScopus
Sitasi
Scopus: 2
Google Scholar: 2
PubMed: 2
Abstrak
Topic detection is a powerful method that emerges as a solution to uncover the latent structures in a document. A general framework of clustering-based topic detection consists of two steps: representation learning and topic detection with clustering. In this study, bidirectional encoder representations from transformers (BERT) is utilized for the representation learning because of its ability to learn text, allowing BERT to capture the context of each word’s context based on its surrounding. Text representations obtained from BERT are used for topic detection with clustering. Deep embedded clustering (DEC) and improved deep embedded clustering (IDEC) are the clustering models used in this study for topic detection with clustering. DEC and IDEC are deep learning-based clustering techniques that can simultaneously transform data into lower dimensional space and optimize the clusters. The combination of BERT as the text representation model with DEC and IDEC becomes a deep learning structure model for topic detection. After obtaining the word sets that represent the topics, evaluations are carried out by examining the sensitivity of hyperparameters and the topic coherence value. The simulations showed that DEC and IDEC are robust to hyperparameter changes. DEC and IDEC also outperformed uniform manifold approximation and projection (UMAP)based K-means and eigenspace-based fuzzy c-means (EFCM) by using topic coherence Word2Vec (TC-W2V). © 2024 IEEE.
Dokumen & Tautan

© 2025 Universitas Indonesia. Seluruh hak cipta dilindungi.