,
Department of Information Systems, King Abdulaziz University , Jeddah , Saudi Arabia
,
Assistant Professor, Department of Information Systems, King Abdulaziz University , Jeddah , Saudi Arabia
Associate Professor, Department of Information Systems, King Abdulaziz University , Jeddah , Saudi Arabia
The speed of the increasing digital content requires the creation of successful Automatic Text Summarization (ATS) systems. Although major improvements have been made in the summarization of high-resource languages, the summarization of Arabic texts has not been effectively studied, especially in terms of comparative studies of preprocessing methods of documents and word-embedding algorithms. This paper explores the effects of some of the most important variables on the work of graph-based extractive summarization of Arabic news articles, namely, preprocessing methods, word embeddings, ranking methods, and compression ratios. There were experiments using the Essex Arabic Summary Corpus (EASC) with four preprocessing methods (Khoja, Farasa, Qalsadi, and Stanza), two word embedding models (GloVe and AraBERT), two ranking algorithms (PageRank and HITS), and two compression ratios (30% and 40%). The quality of summarizing was measured by the ROUGE-1 F- score. The findings indicated a significant difference (p < 0.001) in all factors, and GloVe performs better than AraBERT (average ROUGE-1 F-score of 0.389 vs. 0.36), and a higher compression ratio (40% more) achieved better performance. To be more precise, such preprocessing techniques as Khoja and Farasa yielded the same ROUGE-1 F-scores of 0.381 and 0.379, respectively, and Stanza gave much lower ones (0.364). It was statistically significant that there have been interactions between preprocessing model and word embedding model, ranking algorithm and compression ratio. Future research will offer more extensive guidelines on how to choose the best preprocessing and representation strategies to use with Arabic ATS systems by including larger and more varied datasets, as well as human evaluation methods to offer a wider range of evaluation. More studies will also be done on the fusion of the supervised summarization technique and deep learning-based systems and multilingual summarization systems.
This is an open access article distributed under the Creative Commons Attribution Non-Commercial License (CC BY-NC) License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
0
The statements, opinions and data contained in the journal are solely those of the individual authors and contributors and not of the publisher and the editor(s). We stay neutral with regard to jurisdictional claims in published maps and institutional affiliations.