A Split-then-Join Approach to Abstractive Summarization for Very Long Documents in a Low Resource Setting

Penulis: Fazry, Lhuqita; Yulianti, Evi
Informasi
JurnalProceedings - IC2IE 2025: 8th 2025 International Conference of Computer and Informatics Engineering: Human-Machine Synergy Brings Together the Physical and Digital Worlds
PenerbitInstitute of Electrical and Electronics Engineers Inc.
Halaman -
Tahun Publikasi2025
ISBN979-833159028-4
Jenis SumberScopus
Abstrak
Current state-of-the-art abstractive text summarization models, such as BIGBIRD-PEGASUS, are limited by a maximum input capacity of 4,096 tokens, significantly hindering their application to very long documents (exceeding 20,000 tokens). A common practice is to truncate such documents, but this leads to substantial information loss and degraded summarization quality. To address this, we propose a novel "Split-then-Join"(SPIN) approach that enables BIGBIRD-PEGASUS to effectively summarize very long documents in a low-resource setting. Our method strategically augments the training dataset by splitting very long document-summary pairs into smaller, manageable parts that fit within the model's token limit. We introduce three SPIN variants for document-summary pairing during training and evaluate them on subsets of arXiv and BigPatent datasets specifically filtered for documents longer than 20,000 tokens. Our experimental results demonstrate that SPIN 3 consistently outperforms the baseline BIGBIRD-PEGASUS model and other SPIN variants, achieving ROUGE-1 scores of 41.7 on arXiv and 35.6 on BigPatent. These findings highlight the efficacy of our split-then-join strategy for very long document summarization and underscore that salient information can be distributed throughout the document, not solely at its beginning. This research provides a practical solution for abstractive summarization of extremely lengthy texts, particularly in scenarios with limited computational resources. © 2025 IEEE.
Dokumen & Tautan

© 2025 Universitas Indonesia. Seluruh hak cipta dilindungi.