TextRank 的加权改进、BART 的轻量化优化
融合语义嵌入与多维特征的 TextRank 算法改进
该组文献聚焦于通过引入外部语义信息(如 BERT、Word2Vec、FastText 词向量)和统计特征(如 TF-IDF、位置权重、词性、BM25)来优化传统 TextRank 算法。旨在解决传统图模型在捕捉长距离语义关联及句子重要性表征方面的局限性,提升关键词提取和摘要生成的准确性。
- Enhancing Arabic Extractive Summarization with TF-IDF-Weighted AraBERT Sentence Embeddings and Semantic Clustering(Wadeea R. Naji, Suresha, Fahd A. Ghanem, 2025, The Indonesian Journal of Computer Science)
- ONTO-TDM domain ontology population for a specific discipline(Rosana Abdoune, Lydia Lazib, Farida Bouarab-Dahmani, J. Fernández-breis, 2024, Applied Ontology)
- IWF-TextRank Keyword Extraction Algorithm Modelling(Liyan Zhang, Wenhui Wang, Jian Ma, Yuan Wen, 2024, Applied Sciences)
- TextRank Keyword Extraction Algorithm Using Word Vector Clustering Based on Rough Data-Deduction(Ning Zhou, Wenqian Shi, Renyu Liang, Na Zhong, 2022, Computational Intelligence and Neuroscience)
- Enhanced TextRank using weighted word embedding for text summarization(Evi Yulianti, Nicholas Pangestu, Meganingrum Arista Jiwanggi, 2023, International Journal of Electrical and Computer Engineering (IJECE))
- Perbandingan TextRank Berbasis TF-IDF dan Word2Vec dalam Peringkasan Teks Berita Bahasa Indonesia(Yohannes Christian Gurning, Samuel Cristian Saragih, Yuyun Yusnida Lase, Julham Julham, 2025, Jurnal Komputer Teknologi Informasi Sistem Informasi (JUKTISI))
- TextRank keyword extraction method weighted by multivariate quantitative indexes(Xin Luan, Wenya Gao, Ming Chen, Dalei Song, 2021, No journal)
- Chinese News Text Abstract Extraction Using Improved MMR(Yuanyuan Zheng, Yunqing Liu, HongChuan Qin, 2021, 2021 International Conference on Electronic Information Engineering and Computer Science (EIECS))
- Research on Computational Pragmatics Model Based on Improved Bayesian Algorithm(Fang Qi, 2025, 2025 2nd International Conference on Intelligent Computing and Robotics (ICICR))
- A Keyword Extraction Method for Transportation Industry Standards based on improved TextRank(Shaoyang Zhang, Runze Ye, Jiye Wang, Kang Yang, 2021, 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC))
- Multiple Choice Question Generation Based on the Improved TextRank(Lai Wei, Guo-sheng Hao, Xia Wang, Shuoshuo Meng, Xiaohan Yang, Yi Zhu, 2024, Proceedings of the 3rd International Conference on Internet Technology and Educational Informatization, ITEI 2023, November 24–26, 2023, Zhengzhou, China)
- A Hybrid Semantic Representation Method Based on Fusion Conceptual Knowledge and Weighted Word Embeddings for English Texts(Zan Qiu, Guimin Huang, Xingguo Qin, Yabing Wang, Jiahao Wang, Ya Zhou, 2024, Inf.)
- Research on Confidentiality Management of Language Technology Resources Based on the Combination of BERT and TextRank Algorithms with a Language Technology Vocabulary(Muchao Chen, Yanjun Zhang, Ping Zhang, 2024, Proceedings of the 3rd International Conference on Signal Processing, Computer Networks and Communications)
- Keyword Acquisition for Language Composition Based on TextRank Automatic Summarization Approach(Yan Jiang, Chunlin Xiang, Lingtong Li, 2024, International Journal of Advanced Computer Science and Applications)
- Obfuscated PHP Webshell Detection Using the Webshell Tailored TextRank Algorithm(Hye Ju Lee, Seon-Jin Hwang, Millati Pratiwi, Yoon-Ho Choi, 2024, Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing)
- Advanced Text Summarization Model Incorporating NLP Techniques and Feature-Based Scoring(Estabraq Abdulreda Kadhim, M. Feizi-Derakhshi, Hadi S. Aghdasi, 2025, IEEE Access)
- Using the Ship-Gram Model for Japanese Keyword Extraction Based on News Reports(Miao Teng, 2021, Complex.)
- An Intelligent Duplicate Bug Report Detection Method Based on Technical Term Extraction(Xiaoxue Wu, Wenjing Shan, Wei Zheng, Zhiguo Chen, Tao Ren, Xiaobing Sun, 2023, 2023 IEEE/ACM International Conference on Automation of Software Test (AST))
- A Multidimensional-Weighted TextRank and LSTM-Attention Model for Network Public Opinion Sentiment Analysis(Minjie He, Qi Huang, 2025, Informatica)
- Text Recommendation Algorithm Fused with BERT Semantic Information(Xing Xie, Zifeng Ren, Yuming Gu, Chengwen Zhang, 2021, Proceedings of the 2021 5th International Conference on Computer Science and Artificial Intelligence)
- Implementation of text summarization on indonesian scientific articles using textrank algorithm with TF-IDF web-based(Jeremia Jordan Sihombing, Arnita Arnita, Said Iskandar Al Idrus, D. Y. Niska, 2024, Journal of Soft Computing Exploration)
- Design of Intelligent Extraction Method for Key Electronic Information Based on Neural Networks(Xiaoqin Chen, Xiaojun Cheng, 2024, International Journal of Advanced Computer Science and Applications)
- Improving TextRank Keyword Extraction Based on Word Embedding and Text Networks(Tao Wang, Chenhao Zhao, Shuang Ma, Qingyu Zou, 2025, 2025 44th Chinese Control Conference (CCC))
TextRank 在垂直领域与多源数据中的应用实践
这部分文献探讨了改进后的 TextRank 及其变体在特定场景下的应用,包括社交媒体(Twitter/Bilibili)、电子商务评论、公共卫生、金融交易监控、股价预测及低资源语言处理。研究强调了结合领域特征(如用户参与度、首句权重、情感倾向)对关键信息浓缩的实用价值。
- From Tweets to Insights: Enhancing Summarization through Social Interaction Context(Anwar Alhenshiri, 2025, AlQalam Journal of Medical and Applied Sciences)
- Automatic Text Review Summarization of Digital Library System Application using TextRank Algorithm and TF-IDF(Ichwanul Muslim Karo Karo, Adidtya Perdana, Sri Dewi, 2024, 2024 4th International Conference of Science and Information Technology in Smart Administration (ICSINTESA))
- Automatic Text Summarization for Public Health WeChat Official Accounts Platform Base on Improved TextRank(Zixuan Cheng, Shunli Guo, 2022, Journal of Environmental and Public Health)
- Leading Sentence News TextRank(Phua Yeong Tsann, Yew Kwang Hooi, Mohd Fadzil bin Hassan, Matthew Teow Yok Wooi, 2021, 2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA))
- Hot word analysis and sentiment analysis of teaching videos based on Naive Bayes(Lisha Yao, Yiwen Zhang, Hongmei Li, 2025, Proceedings of the 2nd International Conference on Intelligent Education and Computer Technology)
- Construction and Study of a Mongolian Single-Document Summarization Dataset(Qirilige Qi, Qintu Si, Siriguleng Wang, Yongshun Han, 2025, Data Intelligence)
- Sentiment Analysis Using E-Commerce Review Keyword-Generated Image with a Hybrid Machine Learning-Based Model(Jiawen Li, Yuesheng Huang, Yayi Lu, Leijun Wang, Yongqi Ren, Rongjun Chen, 2024, Computers, Materials & Continua)
- Spam Identification Based on Text Feature Fusion(Yujie Jin, Tianbao Xie, 2024, Proceedings of the 2024 International Conference on Intelligent Education and Computer Technology)
- Design and application of transaction monitoring visualization system in banking financial business(Wei-Pei Fu, 2026, No journal)
- Research on stock price prediction using TextRank based text summarization technology and sentiment analysis(Hengxuan Cui, Yingjie Zhu, Fangqing Gu, Lianshuang Wang, 2022, 2022 18th International Conference on Computational Intelligence and Security (CIS))
- Cross-Platform Event Popularity Analysis via Dynamic Time Warping and Neural Prediction(Xiaofeng Gao, Wenyi Xu, Zixuan Zhang, Yan Tang, Guihai Chen, 2023, IEEE Transactions on Knowledge and Data Engineering)
- Research on Long Text Similarity Calculation Method Based on TextRank and BERT(Xi Zhao, Binglin Zhu, Xiaofeng Liu, 2024, 2024 4th Asia Conference on Information Engineering (ACIE))
BART 模型的知识蒸馏与量化压缩技术
该组文献关注 BART 及其轻量化版本(如 DistilBART)的性能优化。通过知识蒸馏、模型量化、潜在空间压缩(引入自编码器)等手段,在保持生成质量的同时,显著降低模型参数量和推理延迟,使其适用于视频摘要、教育翻译等资源受限场景。
- The extraction of a brief summary from scientific documents using machine learning methods(Gulden Murzabekova, Galia Mukhamedrakhimova, Zhazira Taszhurekova, Yerbol Yerbayev, Zhanagul Doumcharieva, V. Makhatova, Moldir Tolganbaeva, S. Serikbayeva, 2025, Bulletin of Electrical Engineering and Informatics)
- The Perils of Naive Truncation: A Context Ablation Study for Dialogue Summarization on DialogSum(S. Rafi, L. V. S. R. Boddu, Devasish Viswanadh Kolla, Chandana Pothuguntla, Rukmini Bhandara Myla, 2025, 2025 5th International Conference on Artificial Intelligence and Signal Processing (AISP))
- Comparative Analysis of Pretrained Encoder-Decoder Transformer Models for Extreme Text Summarization(Tamma Rajyalakshmi, K. S. Kuppusamy, 2023, 2023 Second International Conference on Advances in Computational Intelligence and Communication (ICACIC))
- Evaluating Text Summaries Generated by Large Language Models Using OpenAI's GPT(Hassan Shakil, Atqiya Munawara Mahi, Phuoc Nguyen, Zeydy Ortiz, Mamoun T. Mardini, 2024, 2024 International Conference on Machine Learning and Applications (ICMLA))
- Comparative Analysis of News Articles Summarization using LLMs(Archanaa. N, S. B, Suwin Kumar. J.D. T, B. G., Srinath Doss, 2024, 2024 Asia Pacific Conference on Innovation in Technology (APCIT))
- Automated Summarization of E-Commerce Application Reviews for Generating Application Descriptions to Enhance Customer Insights(Jessica Berliani, Alicia Jocelyn Siahaya, L. A. Wulandhari, Ghinaa Zain Nabiilah, 2025, 2025 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT))
- Exploring Abstractive Summarization: A Comparative Study of PEGASUS, DistilBART, BART, Microsoft ProphetNet, GPT-2, and GPT Models(Siddhi Kamalkar, K. C. R., 2025, 2025 7th International Conference on Intelligent Sustainable Systems (ICISS))
- Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models(A. A. Falaki, R. Gras, 2023, Mach. Learn. Knowl. Extr.)
- A Robust Approach to Fine-tune Pre-trained Transformer-based models for Text Summarization through Latent Space Compression(A. A. Falaki, R. Gras, 2022, 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA))
- AI Video Summarizer(Jay Kapadiya, 2025, International Journal for Research in Applied Science and Engineering Technology)
- Compression of Generative Pre-trained Language Models via Quantization(Chaofan Tao, Lu Hou, Wei Zhang, Lifeng Shang, Xin Jiang, Qun Liu, Ping Luo, Ngai Wong, 2022, ArXiv)
- Summarization and Translation of NPTEL Videos into Regional Indian Languages(Gnanapriya M A, P. A, R. Ramakrishnan, Susmitha Vekkot, Vivek Venugopal, 2025, 2025 International Conference on Advancements in Power, Communication and Intelligent Systems (APCI))
- DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and Quantization(Zheng Li, Zijian Wang, Ming Tan, Ramesh Nallapati, Parminder Bhatia, Andrew O. Arnold, Bing Xiang, Dan Roth, 2022, No journal)
- Comparative Performance Analysis of Transformer-Based Summarization Models Using Transcripts(Mahitha Tenneti, P. Prithvi, 2025, 2025 5th International Conference on Artificial Intelligence and Signal Processing (AISP))
- A Comparative Study on the Quality of Cybersecurity News Summarization Methods(Thanchanok Leartsathienchai, Somkiat Kosolsombat, Chiabwoot Ratanavilisagul, 2025, 2025 10th International Conference on Computational Intelligence and Applications (ICCIA))
- Metric-guided Distillation: Distilling Knowledge from the Metric to Ranker and Retriever for Generative Commonsense Reasoning(Xingwei He, Yeyun Gong, Alex Jin, Weizhen Qi, Hang Zhang, Jian Jiao, Bartuer Zhou, Biao Cheng, Sm Yiu, Nan Duan, 2022, No journal)
- CAKD: A Confidence-Aware Knowledge Distillation Approach for Building Compact and Efficient LLMs(Mohammad Basheer Kotit, Omama Hamad, K. Shaban, Ali Hamdi, 2025, 2025 IEEE/ACS 22nd International Conference on Computer Systems and Applications (AICCSA))
- WhatsApp Romanized Sinhala (Singlish) Group Chat Summarization Using NLP Techniques(Patabandi K. P. D. P, Rathsara K. M. A. C. D, Nirmani H. M. C, 2025, Asian Journal of Research in Computer Science)
Transformer 架构的结构化剪枝与高效部署策略
这部分文献研究了针对 BART 及大语言模型(LLMs)的通用压缩技术,包括层塌陷(Layer Collapse)、注意力头剪枝、结构化剪枝(Wanda/Wanda++)、提示词压缩(LLMLingua-2)以及模块化替换,旨在实现模型在边缘设备上的高效部署。
- LaCo: Large Language Model Pruning via Layer Collapse(Yifei Yang, Zouying Cao, Hai Zhao, 2024, No journal)
- Efficient Transformer-Based Abstractive Urdu Text Summarization Through Selective Attention Pruning(Muhammad Azhar, Adeen Amjad, Ghulam Farid, D. A. Dewi, M. Batumalay, 2025, Inf.)
- Modular Transformers: Compressing Transformers into Modularized Layers for Flexible Efficient Inference(Wangchunshu Zhou, Ronan Le Bras, Yejin Choi, 2023, No journal)
- Zero-Shot Sentiment Analysis Exploring BART Models(Konstantinos Kyritsis, I. Perikos, M. Paraskevas, 2023, 2023 IEEE/ACIS 8th International Conference on Big Data, Cloud Computing, and Data Science (BCD))
- TranscribEase – Speech and Document Summarizer(Nisarga S, Arpitha Devangavi, Prerana M Otageri, Prarthana N, 2024, June 2024)
- Deep learning-based modified transformer model for automated news article summarization(B. Srinivas, Lavanya Bagadi, K. Darimireddy Naresh, P. Surya Prasad, Sivaji Satrupalli, B. Anil Kumar, 2024, Facta universitatis - series: Electronics and Energetics)
- DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models(Yuxuan Jiang, Dawei Li, Francis Ferraro, 2025, ArXiv)
- Compact Language Models via Pruning and Knowledge Distillation(Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, M. Patwary, M. Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 2024, ArXiv)
- A Simple and Effective Pruning Approach for Large Language Models(Mingjie Sun, Zhuang Liu, Anna Bair, J. Z. Kolter, 2023, ArXiv)
- The Oracle and The Prism: A Decoupled and Efficient Framework for Generative Recommendation Explanation(Jiaheng Zhang, Daqiang Zhang, 2025, ArXiv)
- Reinforcement learning for LLM-based explainable TCM prescription recommendation with implicit preferences from small language models(XinYu Wang, Xiaohe Sun, Lei Yang, Yitong Zhang, Tao Yang, Jiadong Xie, Kongfa Hu, 2025, Chinese Medicine)
- BERT Busters: Outlier Dimensions that Disrupt Transformers(Olga Kovaleva, Saurabh Kulshreshtha, Anna Rogers, Anna Rumshisky, 2021, No journal)
- Improving Language Model Distillation through Hidden State Matching(Sayantani Dasgupta, Trevor Cohn, 2025, No journal)
- RSSN at Multilingual Counterspeech Generation: Leveraging Lightweight Transformers for Efficient and Context-Aware Counter-Narrative Generation(V. Ravindran, 2025, No journal)
- Wanda++: Pruning Large Language Models via Regional Gradients(Yifan Yang, Kai Zhen, Bhavana Ganesh, A. Galstyan, Goeric Huybrechts, Markus Muller, Jonas M. Kubler, R. Swaminathan, Athanasios Mouchtaris, S. Bodapati, Nathan Susanj, Zheng Zhang, Jack G. M. Fitzgerald, Abhishek Kumar, 2025, No journal)
- LLM-Pruner: On the Structural Pruning of Large Language Models(Xinyin Ma, Gongfan Fang, Xinchao Wang, 2023, ArXiv)
- Efficient self-attention with smart pruning for sustainable large language models(S. Belhaouari, Insaf Kraidia, 2025, Scientific Reports)
- LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression(Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. V. Zhao, Lili Qiu, Dongmei Zhang, K. Cobbe, Vineet Kosaraju, Mo Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, 2024, No journal)
- FGeo-TP: A Language Model-Enhanced Solver for Geometry Problems(Yiming He, Jia Zou, Xiaokai Zhang, Na Zhu, Tuo Leng, 2024, ArXiv)
- DP-BART for Privatized Text Rewriting under Local Differential Privacy(Timour Igamberdiev, Ivan Habernal, 2023, ArXiv)
- Efficient Large Language Model Fine-Tuning with Joint Structural Pruning and Parameter Sharing(Rui Wang, Yumin Chen, Mengmeng Liu, Guiran Liu, Binrong Zhu, Wuyang Zhang, 2025, 2025 6th International Conference on Computer Vision and Data Mining (ICCVDM))
- PathRAG: Pruning Graph-based Retrieval Augmented Generation with Relational Paths(Boyu Chen, Zirui Guo, Zi Yang, Yuluo Chen, Junze Chen, Zhenghao Liu, Chuan Shi, Cheng Yang, 2025, ArXiv)
- What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph(Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou, 2025, No journal)
BART 摘要生成的忠实度提升与策略优化
该组文献致力于解决生成式模型在摘要任务中的幻觉与事实一致性问题。通过引入关键词引导(RAKE/KeyBERT)、实体忠实度奖励(EHI)、检索增强(Re2G)以及强化学习反馈,优化 BART 的生成策略,确保摘要的准确性与逻辑性。
- Improving Abstractive News Summarization Using Keyword Extraction for Human-Like Summaries(Maria Linneke Adjie, Wilson Gregory Pribadi, L. A. Wulandhari, Ghinaa Zain Nabiilah, 2025, 2025 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT))
- Improving Factual Error Correction for Abstractive Summarization via Data Distillation and Conditional-Generation Cloze(Yingqi Zhu, Yiyang Li, Lei Li, Dingxing Hu, Xueyi Hao, Dongsheng Chen, Xingyue Zhang, Zhejun Zhang, Yanquan Zhou, Marina Litvak, N. Vanetik, 2024, IEEE Transactions on Audio, Speech and Language Processing)
- Fine-Tuning Large Language Models Using Entity Hallucination Index for Text Summarization.(P. K., R. Balabantaray, K. Vittala, Muktikanta Sahu, 2026, Journal of visualized experiments : JoVE)
- Probing of Quantitative Values in Abstractive Summarization Models(Nathan M. White, 2022, ArXiv)
- Intelligent Multi-Document Summarisation for Extracting Insights on Racial Inequalities from Maternity Incident Investigation Reports(Georgina Cosma, Mohit Kumar Singh, Patrick Waterson, G. T. Jun, Jonathan Back, 2024, No journal)
- Re2G: Retrieve, Rerank, Generate(Michael R. Glass, Gaetano Rossiello, Md. Faisal Mahbub Chowdhury, Ankita Rajaram Naik, Pengshan Cai, A. Gliozzo, 2022, ArXiv)
加权图建模与知识蒸馏的基础理论研究
该组文献提供了图算法改进与模型压缩的底层数学支撑,涉及加权网络的边缘聚类、邻接矩阵分析、Logit 标准化以及跨学科的图排序应用(如生物医学基因排序、流行病防护策略),为 TextRank 和 BART 的优化提供了方法论参考。
- Topic-Weighted Kernels: Text Kernels Integrating Topic Weights and Deep Word Embeddings for Semantic Text Analytics(Nikhil V. Chandran, V. S. Anoop, S. Asharaf, 2025, IEEE Access)
- LogTIW:A log anomaly detection model based on TF-IDF weighted semantic features(Jia Kang, Junfeng Zhao, Zhengxin Li, 2024, 2024 International Joint Conference on Neural Networks (IJCNN))
- Protection Strategy against an Epidemic Disease on Edge-Weighted Graphs Applied to a COVID-19 Case(Ronald Manríquez, Camilo Guerrero-Nancuante, Carla Taramasco, 2021, Biology)
- Prioritization of cancer driver gene with prize-collecting steiner tree by introducing an edge weighted strategy in the personalized gene interaction network(Shaohua Zhang, Zhen-Nan Wang, Yan Li, Wei-feng Guo, 2022, BMC Bioinformatics)
- Extracting and ranking product features in consumer reviews based on evidence theory(Lixin Zhou, Li Tang, Zhenyu Zhang, 2022, Journal of Ambient Intelligence and Humanized Computing)
- SAW-GCN:A Semantic Class-Aware Weighted GCN for Offensive Language Detection(Zineb Ferhat Hamida, 2025, 2025 International Conference on Artificial Intelligence and Innovative Applications (AIIA))
- Logit Standardization in Knowledge Distillation(Shangquan Sun, Wenqi Ren, Jingzhi Li, Rui Wang, Xiaochun Cao, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Crowd counting at the edge using weighted knowledge distillation(Muhammad Asif Khan, Hamid Menouar, Ridha Hamila, Adnan Abu-Dayya, 2025, Scientific Reports)
- Synchronization Acceleration of Networked Systems via Edge Addition to Single-Root Weighted Digraphs(Haosen Cao, Hai-Tao Zhang, Lihua Xie, 2025, IEEE Transactions on Automatic Control)
- Text Summarization using Textrank, Lexrank and Bart model(Sherin Mariam Jijo, Disha Panchal, Jalpa Ardeshana, Urvashi Chaudhari, 2024, 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT))
- Automatic Text Summarization Method Based on Improved TextRank Algorithm and K-Means Clustering(Wenjun Liu, Yuyan Sun, Bao Yu, Hailan Wang, Qingcheng Peng, Mengshu Hou, Huan Guo, Hai Wang, Cheng Liu, 2024, Knowl. Based Syst.)
- An Order Approach for the Core Maintenance Problem on Edge-Weighted Graphs(Bin Liu, Zhenming Liu, Feiteng Zhang, 2021, No journal)
- Order based algorithms for the core maintenance problem on edge-weighted graphs(Feiteng Zhang, Bin Liu, Zhenming Liu, Q. Fang, 2022, Theor. Comput. Sci.)
- Model-based edge clustering for weighted networks with a noise component(Haomin Li, Daniel K. Sewell, 2025, Comput. Stat. Data Anal.)
- Some interlacing results on weighted adjacency matrices of graphs with degree-based edge-weights(Xueliang Li, N. Yang, 2023, Discret. Appl. Math.)
- An Interlayer Link Prediction Method Based on Edge-Weighted Embedding(Hefei Hu, Sirui Zhang, Yanan Wang, 2023, Complex.)
- Automatic Graph Modeling of Power Transformer Management Data(Wenqi Huang, Jiayi Zhang, Wei Xi, Yachen Tang, Qinyu Feng, Guangyi Liu, 2023, 2023 IEEE International Conference on Advanced Power System Automation and Protection (APAP))
- Learning to solve Minimum Cost Multicuts efficiently using Edge-Weighted Graph Convolutional Neural Networks(Steffen Jung, Margret Keuper, 2022, No journal)
本报告综合了 TextRank 算法的加权改进与 BART 模型的轻量化优化两大前沿方向。研究脉络清晰地展示了自然语言处理领域在“语义增强”与“计算效率”之间的平衡:一方面,通过融合深度学习嵌入(BERT/Word2Vec)与多维统计特征,使传统的无监督图算法 TextRank 具备了更强的语义感知能力,并广泛应用于垂直领域;另一方面,针对以 BART 为代表的预训练模型,通过知识蒸馏、结构化剪枝、量化及潜在空间压缩等技术手段,显著降低了大规模生成式模型的部署门槛,同时结合检索增强与忠实度优化策略,解决了生成内容的事实一致性挑战。这些研究共同推动了高效、精准的文本分析与生成技术在工业界的落地。
总计101篇相关文献
No abstract available
The length of a news article may influence people’s interest to read the article. In this case, text summarization can help to create a shorter representative version of an article to reduce people’s read time. This paper proposes to use weighted word embedding based on Word2Vec, FastText, and bidirectional encoder representations from transformers (BERT) models to enhance the TextRank summarization algorithm. The use of weighted word embedding is aimed to create better sentence representation, in order to produce more accurate summaries. The results show that using (unweighted) word embedding significantly improves the performance of the TextRank algorithm, with the best performance gained by the summarization system using BERT word embedding. When each word embedding is weighed using term frequency-inverse document frequency (TF-IDF), the performance for all systems using unweighted word embedding further significantly improve, with the biggest improvement achieved by the systems using Word2Vec (with 6.80% to 12.92% increase) and FastText (with 7.04% to 12.78% increase). Overall, our systems using weighted word embedding can outperform the TextRank method by up to 17.33% in ROUGE-1 and 30.01% in ROUGE-2. This demonstrates the effectiveness of weighted word embedding in the TextRank algorithm for text summarization.
In the process of keyword extraction, news text has its uniqueness. Keywords extraction of news text not only needs to pay attention to the difference of quantitative indexes of words, but also needs to consider the influence of phrases. In order to improve the keyword extraction effect of news texts, this paper constructs a keyword graph based on TextRank, improves the probability transition matrix by combining four quantitative indicators of node frequency, location, span and part of speech, realizing the weight difference of words. Considering the influence of word segmentation technology on phrases extraction, the reconstruction of phrases is completed according to the law of recombination and the concept of combinatorial entropy is defined to realize the filtering of reconstructed phrases. According to the statistical quantitative index of phrases, the linear weighted value is assigned to the reconstructed phrases, and finally, the TopN words or phrases are selected as keywords according to their weight value. Experimental results show that the proposed algorithm is not only superior to the traditional TextRank and TF-IDF algorithms, but also has great advantages compared with the improved PositionRank and MyWPMWRank algorithms, the F value of which can be increased by 9.75% at most, which effectively improves the keywords extraction effect of news text.
This paper proposes a method for classifying the security level of language technology resources by combining the BERT and TextRank algorithms, aimed at improving the efficiency and accuracy of language technology resource classification. Initially, the BERT model is used to generate a language technology vocabulary, and based on this vocabulary, weighted sentence vectors for the target document are created. Subsequently, the BERT word vectors are integrated with the enhanced TextRank algorithm, where the similarity between sentence weights and classification rules determines the security level of the language technology resource. Experimental results demonstrate that the proposed method achieves superior accuracy compared to traditional classification methods. With the increasing significance of language technology in information security, the development of more efficient models for classification will be a crucial direction for future research.
The most common traditional approaches to summarizing large texts while retaining their importance are TF-IDF and TextRank. However, these methods often fail to retain narrative coherence and accuracy. This study’s improved summarization methodology overcomes these limitations by combining the linguistic and semantic resources. Moreover, although it is more computationally complex, it efficiently combines higher quality with faster summarization. Specifically, a method relies on a weighted feature score scheme. For example, various textual features such as Named Entity Counts, Noun Counts, and Sentence Position contribute to the summarization quality appropriately. This study’s summarization algorithm was tested using the CNN, XSum and BBC Summarization datasets, which aggregate documents from different areas. The methodology was checked against traditional methods using ROUGE-1 and ROUGE-2, ROUGE-L and BERTScore. The last one, BERTScore, evaluates the semantic similarity of the generated summaries and the references. This study shows that the proposed methodology generates summaries that are not only informative but even semantically faithfully reproduce the original textual information; it achieves high scores in terms of F1-measure across different evaluations like BERTSCORE (0.8857) and ROUGE-1(0.6388), ROUGE-2(0.5662) and ROUGE-L (0.6421). It thus suggests that the approach is applicable in real-life situations and deserves further research.
This paper conducts a comprehensive and in-depth analysis of the popular vocabulary data in Bilibili's educational videos, covering multiple key indicators of the videos. Through quantitative statistical analysis, it aims to reveal the intrinsic connections and patterns among these indicators. Additionally, a Naive Bayes classifier combined with weighted TF-IDF and TextRank algorithms is employed to perform sentiment analysis on the bullet comments, categorizing them into positive and negative sentiments and calculating the proportion of each sentiment tendency. Through sentiment analysis, a deeper understanding of users' emotional attitudes and preferences towards the video content can be achieved. This paper analyzes the popular vocabulary data in Bilibili's educational videos, explores the patterns of video indicators, identifies knowledge hotspots and learning demands, and interprets users' emotional feedback. The research results can help educators grasp students' preferences, optimize the content of educational videos, and enhance the efficiency of knowledge dissemination. It provides data for educational institutions to build user profiles and improve evaluation systems, promoting the innovation of teaching models; its methods and perspectives can also offer new ideas for educational emotion research and teaching strategy improvement, contributing to the enhancement of the quality of smart education.
Social media platforms, particularly Twitter (now X), generate massive volumes of short and fragmented posts that pose significant challenges for information retrieval and knowledge extraction. Traditional tweet summarization methods rely primarily on the content of individual tweets, often producing incomplete and shallow summaries that fail to capture the broader discourse context. This study introduces a comment-aware summarization framework that integrates user replies and interactions into the summarization process, aiming to enhance informativeness, coherence, and contextual richness. A dataset of tweets and associated comments was manually collected and preprocessed. Two summarization interfaces were developed: (1) a baseline tweet-only system and (2) an experimental comment-aware system that combines tweets with semantically relevant, engagement-weighted comments. Both systems employed TF–IDF vectorization, cosine similarity, and the TextRank algorithm for extractive summarization. A within-subjects user study with 32 participants compared the two systems across quantitative metrics (task completion time, number of queries, correctness of results) and qualitative dimensions (satisfaction, credibility, ease of use, diversity of perspectives). Results showed that the comment-aware system required slightly longer task times but significantly reduced the number of queries (p < 0.001) and improved accuracy (p < 0.01). Participants consistently rated the comment-aware summaries as more informative and reflective of diverse perspectives. These findings highlight the value of socially enriched summarization in overcoming the limitations of short-form content and provide design recommendations for next-generation information retrieval systems.
With the rapid development of Internet finance, the traditional banking financial business monitoring methods have been unable to meet the growing business needs due to scalability and reliability problems. This study designed and implemented a visualization system for banking transaction monitoring based on computer technology. The core of this system lies in the use of efficient log capture technology, structured storage methods, and multi perspective visualization analysis technology, while integrating transaction volume prediction and warning functions. By adopting a text keyword extraction algorithm based on weighted TextRank and combining it with a dynamic threshold adjustment mechanism, the system can significantly reduce false alarm rates and improve response speed. In addition, by utilizing distributed graph storage technology, the system can maintain high performance even when the amount of data increases. This study aims to build an intuitive and efficient monitoring platform to accelerate business insights and enhance the risk management and business decision-making capabilities of banks.
—With the rapid development of the Internet and other emerging media, how to find the needed information from massive electronic documents in time and accurately has become an urgent problem. A key electronic information extraction method based on neural network learning ideas has been proposed to solve the problems of time-consuming and difficult deep semantic feature mining in traditional text classification methods. Firstly, a weighted graph model was introduced to improve the TextRank keyword extraction algorithm, helping to capture complex data information and implicit semantics. The results indicate that the optimization method has the highest extraction accuracy (96.52%) on the CSL dataset, and its performance in feature extraction of information data is superior to other comparative models. Secondly, combining LSTM and self attention mechanism to achieve key feature extraction of contextual semantic information. The results indicate that this optimization method has relatively small training and testing errors in data classification, and tends to converge in the later stages of iteration. The accuracy of information extraction reached 94.37%, which is better than other comparative models. The keyword extraction integrity of the fusion model on the THUCNews dataset and Sogou News dataset were 86.2 and 84.1, respectively, with consistency of 96.3 and 94.7, and grammatical correctness of 92.1 and 92.2, respectively. The neural network-based extraction method proposed by the research institute can not only effectively improve the accuracy of information extraction, but also adapt to the changing data environment, and has great potential for application in the field of electronic information processing.
No abstract available
Faced with the problem of text recommendation with massive data on the Internet, the use of a recommendation method based on deep learning combined with semantic information will improve the accuracy of the recommendation results. Therefore, we propose a HyReB (Hybrid Recommendation algorithm combining BERT and CNN network). The algorithm HyReB uses the BERT word vector as the input of the CNN network and incorporates external semantic information in features extraction and topic classification. Then we combine BERT and TextRank algorithms to extract text keywords and calculate the BERT word vector similarity of topic word. Finally, we do the weighted calculation of the label proportion of the recommended text and the similarity of the topic word vector to make the text top-N recommendation. The HyReB algorithm makes user interest extraction more refined and incorporates BERT semantic information into the text recommendation. Experiments show that the feature extraction of HyReB is more accurate and has a better recommendation effect when performing small-scale accurate text recommendation.
No abstract available
DNA methylation is a critical epigenetic modification that plays a central role in gene regulation, cellular differentiation, and the development of various diseases, including cancers. Aberrant methylation patterns have emerged as both biomarkers and mechanistic drivers in pathogenesis, underscoring the urgent need for precise and efficient predictive tools. Although some deep learning techniques have advanced methylation prediction, most existing models are trained independently on single-species datasets. This species-specific approach limits efficiency because each model can only handle one dataset at a time, often at the expense of predictive performance. Additionally, the state-of-the-art deep learning models tend to have enormous parameter counts and computational overhead, making them impractical for integration into local offline software applications. To overcome these challenges, we propose AutoFE-Pointer, a lightweight and novel framework that harnesses an improved softened pointer network to dynamically extract and weight features from diverse DNA sequences. AutoFE-Pointer is designed to simultaneously process 17 different benchmark datasets spanning multiple species, achieving superior performance compared to models that are trained individually on single-species data. In doing so, it not only offers state-of-the-art predictive accuracy and robust cross-species generalization but also significantly reduces computational demands, facilitating its deployment in local offline environments. This breakthrough represents a significant advancement in the field of epigenetic modeling and computational biology.
The development of information technology has significantly changed how information is accessed, necessitating readers to absorb content efficiently and make quick decisions. To address this challenge, this research developed a text summarization system specifically for Indonesian scientific articles using a web-based implementation of the TextRank and TF-IDF algorithms. TextRank was selected for its capability to identify key sentences without requiring training data, while TF-IDF was employed to weight words based on their frequency within the document. The dataset comprised 100 scientific articles in Indonesian from the Unimed Kode Journal, covering the years 2022-2024. The summarization process included several critical stages: text preprocessing, TF-IDF weighting, cosine similarity calculation, and sentence ranking. The resulting summaries were rigorously evaluated by language experts and website specialists using a Likert scale to assess both the quality of the summaries and the usability of the system. The findings demonstrated that the system effectively generated summaries that retained essential information from the original articles, with the highest accuracy observed at a 50% compression rate (88.533%). Additionally, the system achieved good performance at 40% compression (85.133%) and 30% compression (81.26%). The web-based system allows users to input article text and quickly obtain a summary, offering a practical tool for researchers and readers to efficiently comprehend academic content.
Webshells, which are malicious tools that enable unauthorized command execution on web pages, pose threats, such as server attacks and data breaches. In response, system administrators continuously monitor web vulnerabilities and potential webshell attacks. However, attackers often obfuscate webshells to avoid detection. Existing static webshell detection methods commonly rely on static statistical features and focus on obfuscation characteristics. Consequently, this approach may inadvertently increase the likelihood of falsely identifying an obfuscated normal file as a webshell. That is, if obfuscated general files are incorporated into a dataset for algorithm and information protection purposes, an obfuscation bias problem arises, thereby diminishing the webshell detection efficacy. To mitigate this, we propose a webshell detection methodology that leverages a webshell-tailored TextRank algorithm. This approach aims to detect webshells without introducing the biases associated with obfuscation techniques. This method involves deobfuscating the source code and generating a feature matrix from the operation code (opcode) and Abstract Syntax Tree (AST) derived from the code. Its performance was compared with those of machine learning algorithms such as Random Forest (RF), Support Vector Machines (SVM), and Extreme Gradient Boosting (XGBoost).
Keywords are used to provide a concise summary of the text, enabling the quick understanding of core information and assisting in filtering out irrelevant content. In this paper, an improved TextRank keyword extraction algorithm based on word vectors and multi-feature weighting (IWF-TextRank) is proposed to improve the accuracy of keyword extraction by comprehensively considering multiple features of words. The key innovation is demonstrated through the application of a backpropagation neural network, combined with sequential relationship analysis, to calculate the comprehensive weight of words. Additionally, word vectors trained using Word2Vec are utilised to enhance the model’s semantic understanding. Finally, the effectiveness of the algorithm is verified from various aspects using traffic accident causation data. The results show that this algorithm demonstrates a significant optimisation effect in keyword extraction. Compared with the traditional model, the IWF-TextRank algorithm shows significant improvement in accuracy (p-value), recall (R-value), and F-value.
The availability of mobile apps of digital library systems facilitates the needs of library visitors and allow the user give review as a user experience. The summarized user experience can be an insightful input for the development of the mobile app. However, the large number of reviews will take a long time to read and summarize. Thus, a technique should be developed that can provide summarization quickly. Text summarization is a natural language processing technique for extracting information and producing simplified versions of texts, and an example of a popular algorithm is TextRank. In some cases, the algorithm is not optimal without proper feature extraction for identifying sentence rank. The purpose of this study is to provide a text summary from the text review using a combination of the TextRank algorithm and Term Frequency-Inverse Document Frequency (TF-IDF). In addition, this study also analyzes feature extraction techniques in presenting the summary. These methods are evaluated using Rouge-1 and Rouge-2. As a result, the top 10 reviews with the highest sentence rank scores were extracted for summarization. Besides, TF-IDF has a better contribution than Bag of Word in presenting text summaries. where it achieved a score of 0.6014 Rouge-1 and 0.6173 Rouge-2.
In natural language processing, text summarization is crucial for applications like information retrieval, content generation, resource optimization, and legal and academic research. It creates a concise version of the original text without omitting crucial information, thereby facilitating the efficient understanding and processing of large data volumes. Researchers are increasingly focusing on developing more effective summarization techniques. This paper reviews existing text summarization methods, including the Textrank algorithm, fuzzy logic, Latent Semantic Analysis, and deep learning techniques. We have implemented Textrank, LexRank, and the BART model on the news summary dataset. For evaluation, we utilized ROUGE1, ROUGE2, and ROUGEL metrics, considering precision, recall, and F-measure.
—It is important to extract keywords from text quickly and accurately for composition analysis, but the accuracy of traditional keyword acquisition models is not high. Therefore, in this study, the Best Match 25 algorithm was first used to preprocess the compositions and evaluate the similarity between sentences. Then, TextRank was used to extract the abstract, construct segmentation and named entity model, and finally verify the research content. The results show that in the performance test, the Best Match 25 similarity algorithm has higher accuracy, recall rate and F1 value, the average running time is only 2182ms, and has the largest receiver working characteristic curve area, which is significantly higher than other models, reaching 0.954. The accuracy of TextRank algorithm is above 90%, the average accuracy of 100 text analysis is 94.23%, the average recall rate and F1 value are 96.67% and 95.85%, respectively. In comparison of the application of the four methods, the research model shows obvious advantages, the average keyword coverage rate is 94.54%, the average processing time of 16 texts is 11.29 seconds, and the average 24-hour memory usage is only 15.67%, which is lower than the other three methods. The experimental results confirm the superiority of the model in terms of keyword extraction accuracy. This research not only provides a new technical tool for language composition teaching and evaluation, but also provides a new idea and method for keyword extraction research in the field of natural language processing.
Aiming at the problem that the text input length of the BERT model is limited to 512 characters, and the number of words in long text is large, it is impossible to input long text directly into the BERT model to better obtain the meaning of words, this paper proposes a new long text similarity calculation method based on TextRank and BERT. Firstly, the method uses TextRank algorithm to extract a certain number of key words and sentences from the long text, to reduce the length of the text and retain the most important semantic information as far as possible. Then, the vector representation of the extracted sentence information is obtained through the BERT model, and it is input into the network layer of Bi-LSTM and attention mechanism to further extract features. Finally, the feature representation of the two pieces of text is obtained and the similarity is calculated to determine whether it matches. Experiments show that the proposed method can effectively solve the problem that the number of words in long text cannot be input into the BERT model, and effectively reduce the amount of calculation of the model, and the model can also better extract deep semantic information. Experiments on the public Chinese long text News datasets CNSE (Chinese News Same Event) and CNSS (Chinese News Same Story) achieve 82.86%and 87.91%accuracy respectively.
. Currently, strategies for generating multiple choice questions (MCQ) seldom take the analysis of semantic and syntactic dependency features into consideration. A Chinese MCQ generation method is proposed based on the improved TextRank algorithm with semantic similarity and dependency relatedness to extract keywords, primarily entities, as knowledge points for MCQs. Verb weight is introduced to improve the accuracy of initial weight in keyword extraction to obtain knowledge points for MCQs more precisely in texts. Synonyms based on Word2Vec is used to generate distractors for MCQs, which are filtered to ensure that each distractor refers to a different entity. Experiments show that compared to human generated questions, the accuracy of identification is 59.5%, and the 𝐹 (cid:2869) value is 0.58. The aspect of keyword extraction in the cloze questions generation task evaluation metrics shows some improvement. The calculated question difficulty exhibits a strong negative correlation with answer accuracy.
Text summarization is a crucial natural language processing (NLP) task that facilitates efficient information retrieval and comprehension by condensing large volumes of information into informative and concise summaries. Methods of extractive and abstractive summarization have evolved significantly with the advent of large language models (LLMs). Abstractive summarization is investigated in this paper using state-of-the-art LLMs such as PEGASUS, DistilBART, BART, Microsoft ProphetNet, GPT-2, and GPT. Development was done based on the CNN/Daily Mail corpus, a standard collection widely known for having an extensive quantity of news articles and human-annotated summaries.Sequence-based measures of similarity and ROUGE scores, emphasizing unigram and bigram overlaps, were utilized as measures of the quality of the summarization.The potentialities and limitations of LLMs are highlighted within this paper, which further looks into present methodology and research within abstractive summarization. The findings emphasize the need for further research in this field by demonstrating that LLMs can generate abstractive summaries that are both brief and semantically relevant.
This research examines the effectiveness of OpenAI's GPT models as independent evaluators of text summaries generated by six transformer-based models from Hugging Face: DistilBART, BERT, ProphetNet, T5, BART, and PEGASUS. We evaluated these summaries based on the essential properties of a high-quality summary: concision, relevance, coherence, and readability using traditional metrics such as ROUGE and Latent Semantic Analysis (LSA). Uniquely, we employ GPT not as a summarizer, but as an evaluator, allowing it to independently assess summary quality without predefined metrics. Our analysis revealed significant correlations between GPT evaluations and traditional metrics, particularly in assessing relevance and coherence. The results demonstrate the potential of GPT as a robust tool for evaluating text summaries, providing insights that complement established metrics while providing a basis for comparative analysis of transformer-based models in natural language processing tasks.
Recent advancements in large language models (LLMs) have led to notable improvements in abstractive summarization quality. However, hallucination - especially entity-level hallucination where non-existent or incorrect entities are introduced - remains a critical challenge. In this work, we propose a reward-driven fine-tuning framework for summarization models using the Entity Hallucination Index (EHI) as a guiding metric. The methodology here begins with generating initial summaries from pre-trained models such as Flan-T5, DistilBART, and Mistral (or other popular LLM) on structured transcript datasets, XSUM. We compute EHI by extracting named entities from both generated summaries and gold references, evaluating precision, and penalizing fabricated entities. The fine-tuning process is guided by reinforcement learning, where EHI serves as the reward signal. We adopt a REINFORCE-style update mechanism to optimize the summarization model towards maximizing entity faithfulness. Experiments demonstrate that models fine-tuned with EHI achieve lower hallucination rates without compromising informativeness. Furthermore, we show that EHI-guided models generalize better on out-of-domain summarization tasks, suggesting enhanced robustness. The approach here offers a practical direction for improving factuality in summarization, emphasizing the critical role of accurate entity representation.
In order to improve educational content accessibility for a variety of audiences, this study proposes a system for summarizing, translating, and turning NPTEL lecture transcripts into audio in regional Indian languages. Transcripts are preprocessed, succinct summaries are produced using sophisticated models like DistilBART, BART, T5, Google Pegasus, and BERT-sum, and they are assessed using BLEU and ROUGE metrics. DistilBART outperformed the others with the highest BLEU score 45.13, ROUGE 1 F1, score of 74.31, and ROUGE 2 F1 score of 68.29 thereby showing competency in summarizing educational content. Technical words were kept in English for consistency while summaries were translated into Tamil, Telugu, Malayalam, Kannada, and Hindi using the Google API. Google Text-to-Speech (TTS) was used to turn the translated summaries into audio, resulting in outputs that were clear and natural and evaluated by native speakers and subject matter experts. By bridging linguistic and technological divides, this scalable multi-modal solution makes excellent educational materials accessible to audiences that do not speak English.
This study proposes a machine learning-based approach for automatic summarization of scientific documents using a fine-tuned DistilBART model a lightweight and efficient version of the bidirectional and auto-regressive transformers (BART) architecture. The model was trained on a large corpus of 12,540 scientific articles (2015–2023) collected from the arXiv repository, enabling it to effectively capture domain-specific terminology and structural patterns. The proposed pipeline integrates advanced text preprocessing techniques, including tokenization, stopword removal, and stemming, to enhance the quality of semantic representation. Experimental evaluation demonstrates that the fine-tuned DistilBART achieves high summarization performance, with ROUGE-2=0.472 and ROUGE-L=0.602, outperforming baseline transformer-based models. Unlike conventional approaches, the method shows strong applicability beyond academic research, including automated indexing of technical documentation, metadata extraction in digital libraries, and real-time text processing in embedded natural language processing (NLP) systems. The results highlight the potential of transformer-based summarization to accelerate scientific knowledge discovery and improve the efficiency of information retrieval across various domains.
The spread of video content on websites such as YouTube has established an urgent demand of effective applications in order to process and summarize long videos. In this paper, the author introduces a web application, the YouTube video summarizer, an automated transcription, summarization, and translation of YouTube video content, which uses the AI to perform the task. It is developed with Streamlit serving as the frontend and is based on a multi-model AI pipeline: OpenAI Whisper to transcription speech to text remarkably, a DistilBART model to abstractive text summing, and NLLB-200 to multilingual translation by Facebook. The app takes a YouTube URL, downloads the audio, and creates a summary, which can be translated into various languages (although with optional translation), all without leaving a user-friendly interface. We gauge the accuracy of the system functional and the latency of its processing and quality of the output. Findings of a user study show that the time to extract important information in the video is significantly reduced, and a summary of the video can be made very coherent and relevant. The system shows the usefulness of using an integrated AI pipeline as a way of automating content digestion, making video information more accessible and actionable. We talk about the system architecture, the issues faced during the implementation and any further improvement that could be made to it to make it scalable and multimodal.
With the growing popularity of WhatsApp group chats, especially in Sri Lanka, users increasingly face challenges of information overload, leading to missed or unread important messages. While solutions exist for summarizing English-typed messages, there has been no significant attempt to summarize Singlish, a unique typing style where Sinhala words are written using the English alphabet. This research aims to address this gap by developing a Natural Language Processing (NLP)-based system to automatically summarize Singlish-typed WhatsApp group chats over 24 hours. Using exported chat data without media attachments, a customized data pre-processing pipeline was developed to clean, tokenize, and extract keywords from the chats. Two popular Summarization models, facebook/bart-base and sshleifer/distilbart-cnn 12-6, were employed to generate concise summaries, which were then distributed to users via email. The system was evaluated through information retrieval metrics and human assessments to ensure relevance and quality. The study highlights the challenges of processing Singlish due to its informal variations and lack of language resources and sets a foundation for future improvements in chat summarization for low-resource languages. The developed solution not only enhances user productivity but also contributes to the broader field of localized NLP research.
In today’s world, people frequently find themselves reluctant to read lengthy news articles, making news summarization a crucial tool for efficient information consumption. This study investigates the use of Large Language Models (LLMs), specifically BART and DistilBART, to generate abstractive news summaries based on the CNN Daily Mail News dataset, which includes text and highlight data. To address the limitation of highlights that often only capture parts of sentences, keyword extraction techniques such as RAKE, YAKE, and KeyBERT were incorporated to guide the summarization method, improving the quality of the generated summaries. The results demonstrate that the BART and DistilBART models, when combined with keyword extraction methods, produce more balanced and coherent summaries compared to the gold summaries. The BART model combined with RAKE obtained the greatest ROUGE-1 rating of 0.3733, while the DistilBART model with KeyBERT(paraphrase-MiniLM-L6-v2) reached a ROUGE-1 score of 0.3722, showcasing the effectiveness of keyword extraction techniques in improving news summarization performance but with just the right tuned hyperparameters.
No abstract available
In the era of digital commerce, online reviews play a vital role in shaping consumer decisions and perceptions, particularly on major e-commerce platforms like Tokopedia and Shopee in Indonesia. However, despite their popularity, the overwhelming volume of user-submitted reviews can make it difficult for consumers to quickly identify relevant insights. This research addresses this issue by focusing on abstractive text summarization of user reviews, specifically using T5-small and DistilBART CNN 12-6 models. The methodology includes data collection from the Google Play Store, followed by preprocessing, manual summarization, and model training. The models are evaluated based on ROUGE metrics (ROUGE-1, ROUGE-2, and ROUGE-L) to assess their performance in creating coherent, human-like summaries. Results show that DistilBART CNN 12-6 outperforms T5-small in all evaluation metrics, with fine-tuning and hyperparameter adjustments enhancing both models. Stopword removal slightly improved DistilBART’s performance while having minimal impact on T5. The findings provide a foundation for developing summarization techniques that enhance the accessibility of user feedback, helping consumers make informed decisions while supporting application improvements
In today’s fast-paced digital world, cybersecurity incidents often occur without warning and during times when individuals are unavailable to monitor them, such as while they are asleep. The ability to comprehend such incidents quickly and accurately is crucial for mitigating risks and implementing timely responses.This study leverages advancements in Natural Language Processing (NLP) to compare fine-tuned models, such as DistilBART and T5 trained on news datasets, against general purpose models like LLaMA and Gemma. The goal is to evaluate their effectiveness in summarizing cybersecurity news articles with a focus on ensuring accuracy and relevance for rapid situational awareness. The models are assessed using a comprehensive set of metrics, including ROUGE-1, ROUGE-2, ROUGE-L, BERTScore Precision, BERTScore Recall and BERTScore F1. Beyond summarization, I explore clustering methods for extracting keywords from articles to identify patterns and insights within the text.The study highlights the comparative strengths and weaknesses of domain-specific versus general-purpose models, providing critical insights into their performance. I to find highlight the importance of tailored training in achieving high quality, context-aware summaries for cybersecurity applications.
This paper presents a comparative performance analysis of state-of-the-art Transformer-based abstractive summarization models applied to lecture transcript data from the National Programme on Technology Enhanced Learning (NPTEL). Unlike news or encyclopedic text, NPTEL transcripts contain spoken-language artifacts, domain-specific technical terms, and long context spans, which make summarization significantly more challenging. Four models—DistilBART, BART-Large-CNN, T5-Large, and FLAN-T5-Base—were benchmarked under identical conditions. In addition to conventional evaluation, we introduce a multi-dimensional assessment framework that measures summarization time, redundancy reduction, lexical richness, and TF-IDF keyword coverage, providing a more holistic evaluation than ROUGE alone. Experimental results show that BART-Large-CNN and T5-Large excel in technical coverage and richness, while DistilBART achieves low-latency inference suitable for resource-constrained or real-time deployments. A Streamlit-based deployment demonstrates practical integration for e-learning platforms. To ensure a controlled and reproducible evaluation, we analyze a representative subset drawn from a single NPTEL course (Digital Circuits and Systems) 1 rather than the entire corpus, enabling careful chunk-level diagnostics without cross-course confounds.
In the rapidly evolving domain of Natural Language Processing (NLP), the efficiency of Large Language Models (LLMs) in generating abstractive text summaries plays a pivotal role in information synthesis. This study advances the understanding of LLM performance by conducting a comprehensive evaluation of seven cutting-edge models on four distinct datasets. The models selected for comparison include Distilbart-cnn-12-6, Led-base-16384, Google’s Bigbird, Microsoft’s ProphetNet, Facebook’s BART, T5 fine-tuned, and Google’s PEGASUS . Each model’s summarization prowess is rigorously assessed using a battery of metrics: ROUGE, METEOR, BERTScore, Cosine Similarity, and BLEU. The goal is to discern the intricate relationship between dataset characteristics and model efficacy, delivering insights into the inherent advantages and limitations of each model in handling specific data contexts. The results contribute to a refined understanding of LLM applicability, offering empirical evidence to aid in the selection of the most suitable model for varied summarization needs.
This research introduces a summarization app using Flask and Hugging Face API. It summarizes audio, PDFs, and text efficiently. It features robust audio summarization by converting speech to text and then summarizing it. It also supports PDF and text summarization. The pre-trained NLP models like DistilBART, offers accurate and concise summaries. The user interface is intuitive, allowing seamless interaction. Overall, it's a comprehensive tool for extracting insights from various content sources, enhancing productivity.
The amount of textual data on the internet is increasing enormously, so data summarization into text has become essential. As generating text summaries manually is an arduous task and humans are generally prone to make mistakes, deep learning techniques have evolved to overcome this problem. Modified transformer-based deep learning models with varying encoder-decoder and feed-forward network layers are proposed to develop an abstractive summary of the news articles. The proposed transformer model provides the advantage of parallelization with the help of multiple attention head layers to process long sentences, and hence, better text summarization performance is achieved. These models are trained on an ?in-shorts? dataset, and the proposed model is compared with the PEGASUS-CNNdaily-mail, BART-large-CNN, and DistilBART-CNN-12-6 models on the CNN/DailyMail dataset. The performance is evaluated in terms of the ROUGE score by comparing it with the existing Recurrent Neural Network (RNN) model. The suggested transformer model achieved a ROUGE score of 0.33, surpassing the RNN model score of 0.17. This innovative approach can be employed on extensive textual data to extract summaries or headlines.
The BART model is an advanced adaptation of transformers introduced by Facebook. It has incorporated elements from both BERT and GPT transformers, enabling significant advancements in language understanding and general speech processing. Utilizing both encoder and decoder components, BART proves versatile for various tasks, including translation, text completion, automatic sentence generation, entity recognition, sentiment analysis, and more. In this study, we focus on the study of pretrained models, BART and a modified version called distilbart, in the context of Zero-Shot Text Classification. In the experimental study we dive into the Zero-Shot technique applied to various pretrained Transformers. Our analysis demonstrates that, depending on the Model we utilize, we can achieve F1 scores of up to 88%, showcasing the model's effectiveness in discerning classes for this Sentiment Analysis problem using the Zero-Shot Text Classification technique.
Comparative Analysis of Pretrained Encoder-Decoder Transformer Models for Extreme Text Summarization
Text summarization plays a pivotal role in condensing crucial information from huge volumes of text. This study investigates the utilization of pre-trained transformer models within the domain of text summarization, with a specific emphasis on extreme summarization. It delves into the effectiveness of two prominent models, text-to-text transfer transformers and bidirectional and auto-regressive transformers, when applied to the task of summarization, providing a comparative analysis of their capabilities. The experiments conducted in this study involve the utilization of the XSum and SciTLDR datasets. While fine-tuning T5 and BART on summarization tasks is a standard approach, we delve into the performance of these models without fine-tuning. Additionally, we explore the potential of other pretrained models, such as PREGASUS and distilBART, in generating concise and coherent summaries. This study contributes to the understanding of how pre-trained transformer models can be harnessed effectively for text summarization, especially in extreme summarization scenarios. The findings shed light on the performance, challenges, and potential of these models, opening avenues for further research in the field of automatic text summarization.
Abstractive text summarization has recently become a popular approach, but data hallucination remains a serious problem, including with quantitative data. We propose a set of probing tests to evaluate the efficacy of abstract summarization models’ modeling of quantitative values found in the input text. Our results show that in most cases, the encoders of recent SOTA-performing models struggle to provide embeddings that adequately represent quantitative values in the input compared to baselines, and in particular, they outperform random representations in some, but surprisingly not all, cases. Under our assumptions, this suggests that the encoder’s performance contributes to the quantity hallucination problem. One model type in particular, DistilBART-CDM, was observed to underperform randomly initialized representations for several experiments, and performance versus BERT suggests that standard pretraining and fine-tuning approaches for the summarization task may play a role in underperformance for some encoders.
In healthcare, thousands of safety incidents occur every year, but learning from these incidents is not effectively aggregated. Analysing incident reports using AI could uncover critical insights to prevent harm by identifying recurring patterns and contributing factors. To aggregate and extract valuable information, natural language processing (NLP) and machine learning techniques can be employed to summarise and mine unstructured data, potentially surfacing systemic issues and priority areas for improvement. This paper presents I-SIRch:CS, a framework designed to facilitate the aggregation and analysis of safety incident reports while ensuring traceability throughout the process. The framework integrates concept annotation using the Safety Intelligence Research (SIRch) taxonomy with clustering, summarisation, and analysis capabilities. Utilising a dataset of 188 anonymised maternity investigation reports annotated with 27 SIRch human factors concepts, I-SIRch:CS groups the annotated sentences into clusters using sentence embeddings and k-means clustering, maintaining traceability via file and sentence IDs. Summaries are generated for each cluster using offline state-of-the-art abstractive summarisation models (BART, DistilBART, T5), which are evaluated and compared using metrics assessing summary quality attributes. The generated summaries are linked back to the original file and sentence IDs, ensuring traceability and allowing for verification of the summarised information. Results demonstrate BART's strengths in creating informative and concise summaries.
Clustering is a fundamental task in network analysis, essential for uncovering hidden structures within complex systems. Edge clustering, which focuses on relationships between nodes rather than the nodes themselves, has gained increased attention in recent years. However, existing edge clustering algorithms often overlook the significance of edge weights, which can represent the strength or capacity of connections, and fail to account for noisy edges--connections that obscure the true structure of the network. To address these challenges, the Weighted Edge Clustering Adjusting for Noise (WECAN) model is introduced. This novel algorithm integrates edge weights into the clustering process and includes a noise component that filters out spurious edges. WECAN offers a data-driven approach to distinguishing between meaningful and noisy edges, avoiding the arbitrary thresholding commonly used in network analysis. Its effectiveness is demonstrated through simulation studies and applications to real-world datasets, showing significant improvements over traditional clustering methods. Additionally, the R package ``WECAN'' has been developed to facilitate its practical implementation.
Synchronization Acceleration of Networked Systems via Edge Addition to Single-Root Weighted Digraphs
Distributed networked systems with Laplacian dynamics play a crucial role in various fields, such as engineering, biology, system science, social science, and physics. To optimize the synchronization performance of such networks, topological adjustments have been identified as effective and efficient ways. This article establishes a theoretical framework for accelerating synchronization convergence by introducing an additional directed edge or increasing edge weight to a network topology. The convergence speed is quantified by the second smallest real part of Laplacian matrix eigenvalues, called Fielder eigenvalue. We develop a necessary and sufficient condition guaranteeing synchronization acceleration with additional edge of single-root digraphs, enabling efficiently screening out all such edges through computation of graph Laplacian eigenspace. Moreover, for a sufficiently small topological variation, the accelerating extent and optimal edge can be approximately estimated as well. Numerical examples demonstrate the effectiveness of the proposed identification method of acceleration edges.
Presently, users usually register accounts on online social networks (OSNs). Identifying the same user in different networks is also known as interlayer link prediction. Most existing interlayer link prediction studies use embedding methods, which represent nodes in a common representation space by learning mapping functions. However, these studies often directly model links within the pre-embedding layer as equal weights, fail to effectively distinguish the strength of edge relationships, and do not fully utilize network topology information. In this paper, we propose an interlayer link prediction model based on weighted embedding of connected edges within the network layer, which models the links within the network layer as weighted graphs to better represent the network and then uses appropriate embedding methods to represent the network in a low-dimensional space. After embedding, vector similarity and distance similarity are used as comprehensive evaluation scores. This paper has conducted a large number of simulation experiments on actual networks. The results show that our proposed model has higher prediction accuracy in all aspects than current advanced models and can achieve the highest accuracy when the training frequency is low, which proves the validity of the proposed model.
Visual crowd counting has gained serious attention during the last couple of years. The consistent contributions to this topic have now solved several inherited challenges such as scale variations, occlusions, and cross-scene applications. However, these works attempt to improve accuracy and often ignore model size and computational complexity. Several practical applications employ resource-limited stand-alone devices like drones to run crowd models and require real-time inference. Though there have been some good efforts to develop lightweight shallow crowd models offering fast inference time, the relevant literature dedicated to lightweight crowd counting is limited. One possible reason is that lightweight deep-learning models suffer from accuracy degradation in complex scenes due to limited generalization capabilities. This paper addresses this important problem by proposing knowledge distillation to improve the learning capability of lightweight crowd models. Knowledge distillation enables lightweight models to emulate deeper models by distilling the knowledge learned by the deeper model during the training process. The paper presents a detailed experimental analysis with three lightweight crowd models over six benchmark datasets. The results report a clear significance of the proposed method supported by several ablation studies.
The minimum cost multicut problem is the NP-hard/APX-hard combinatorial optimization problem of partitioning a real-valued edge-weighted graph such as to minimize the total cost of the partition. While graph convolutional neural networks (GNN) have proven to be promising in the context of combinatorial optimization, most of them are only tailored to or tested on positive-valued edge weights, i.e. they do not comply to the nature of the multicut problem. We therefore adapt various GNN architectures including Graph Convolutional Networks, Signed Graph Convolutional Networks and Graph Isomorphic Networks to facilitate the efficient encoding of real-valued edge costs. Moreover, we employ a reformulation of the multicut ILP constraints to a polynomial program as loss function that allows to learn feasible multicut solutions in a scalable way. Thus, we provide the first approach towards end-to-end trainable multicuts. Our findings support that GNN approaches can produce good solutions in practice while providing lower computation times and largely improved scalability compared to LP solvers and optimized heuristics, especially when considering large instances.
Cancer is a heterogeneous disease in which tumor genes cooperate as well as adapt and evolve to the changing conditions for individual patients. It is a meaningful task to discover the personalized cancer driver genes that can provide diagnosis and target drug for individual patients. However, most of existing methods mainly ranks potential personalized cancer driver genes by considering the patient-specific nodes information on the gene/protein interaction network. These methods ignore the personalized edge weight information in gene interaction network, leading to false positive results. In this work, we presented a novel algorithm (called PDGPCS) to predict the Personalized cancer Driver Genes based on the Prize-Collecting Steiner tree model by considering the personalized edge weight information. PDGPCS first constructs the personalized weighted gene interaction network by integrating the personalized gene expression data and prior known gene/protein interaction network knowledge. Then the gene mutation data and pathway data are integrated to quantify the impact of each mutant gene on every dysregulated pathway with the prize-collecting Steiner tree model. Finally, according to the mutant gene’s aggregated impact score on all dysregulated pathways, the mutant genes are ranked for prioritizing the personalized cancer driver genes. Experimental results on four TCGA cancer datasets show that PDGPCS has better performance than other personalized driver gene prediction methods. In addition, we verified that the personalized edge weight of gene interaction network can improve the prediction performance. PDGPCS can more accurately identify the personalized driver genes and takes a step further toward personalized medicine and treatment. The source code of PDGPCS can be freely downloaded from https://github.com/NWPU-903PR/PDGPCS.
No abstract available
Simple Summary Infectious diseases have been part of human history. Countless epidemics have produced high mortality rates in vulnerable populations. With the understanding of the spread of these types of diseases, population groups have been able to adapt and better cope with infections. Given the COVID-19 pandemic, one of the strategies used is the modeling of infectious diseases with the aim of establishing protection measures for people and stopping the spread of the epidemic. Our study evaluates protection strategies through infectious disease modeling with COVID-19 data in a commune in Chile. The results of the simulations indicate that the model generates important protection for the population by recognizing the super-propagating people (bridge nodes). This type of protection can be key in the fight against COVID-19. Abstract Among the diverse and important applications that networks currently have is the modeling of infectious diseases. Immunization, or the process of protecting nodes in the network, plays a key role in stopping diseases from spreading. Hence the importance of having tools or strategies that allow the solving of this challenge. In this paper, we evaluate the effectiveness of the DIL-Wα ranking in immunizing nodes in an edge-weighted network with 3866 nodes and 6,841,470 edges. The network is obtained from a real database and the spread of COVID-19 was modeled with the classic SIR model. We apply the protection to the network, according to the importance ranking list produced by DIL-Wα, considering different protection budgets. Furthermore, we consider three different values for α; in this way, we compare how the protection performs according to the value of α.
No abstract available
Let G be a graph and d i denote the degree of a vertex v i in G , and let f ( x, y ) be a real symmetric function. Then one can get an edge-weighted graph in such a way that for each edge v i v j of G , the weight of v i v j is assigned by the value f ( d i , d j ). Hence, we have a weighted adjacency matrix A f ( G ) of G , in which the ij -entry is equal to f ( d i , d j ) if v i v j ∈ E ( G ) and 0 otherwise. In this paper, we obtain uniform interlacing inequalities for the weighted adjacency eigenvalues under some kinds of graph operations including edge subdivision, vertex deletion and vertex contraction. In addition, if f ( x, y ) is increasing in the variable x , then some examples are given to show that the interlacing inequalities are the best possible for each type of the operations. This paper attempts to unify the study of spectral properties for the weighted adjacency matrices of graphs with degree-based edge-weights.
No abstract available
High-quality models across various natural language processing tasks, such as summarization and chatbots, often rely on large architectures, making them computationally intensive and challenging to deploy in resource-constrained environments. While knowledge distillation enables smaller student models to approximate the performance of larger teacher models, existing methods frequently encounter significant trade-offs between accuracy and efficiency. Additionally, uncertain predictions from teacher models can negatively impact the student’s learning process. In this paper, we introduce CAKD, a novel approach that optimizes the training of student models by selectively emphasizing the teacher model’s most reliable predictions using confidence scores. By integrating entropybased confidence weighting into the distillation loss, CAKD effectively prioritizes high-confidence samples, resulting in improved performance and efficiency. Our experiments on text summarization (using a BART-based model on the CNN/DM dataset) and chatbot tasks (using Llamabased model on the DailyDialog and PersonaChat datasets) demonstrate that CAKD achieves significant performance gains over larger teacher models, with improvements of 10.53, 2.1 and 0.38 ROUGE-L points respectively.
Improving factual consistency in abstractive summarization has been a focus of recent research. One promising approach is the post-editing method. However, previous works have yet to make sufficient use of factual factors in summaries and suffer from the negative effect of the training datasets. In this paper, we first propose a novel factual error correction model FactCloze based on a conditional-generation cloze task. FactCloze can construct the causality among factual factors while being able to determine whether the blank can be answered. Then, we propose a data distillation method to generate a more faithful summarization dataset SummDSC via multiple-dimensional evaluation. We validate our method on both non-LLM and LLM-generated datasets. Besides BART and T5, we implement FactCloze using DeepSeek prompt. Finally, we examine the differences between LLM-based and traditional evaluation metrics for factual error correction.
Commonsense generation aims to generate a realistic sentence describing a daily scene under the given concepts, which is very challenging, since it requires models to have relational reasoning and compositional generalization capabilities. Previous work focuses on retrieving prototype sentences for the provided concepts to assist generation. They first use a sparse retriever to retrieve candidate sentences, then re-rank the candidates with a ranker. However, the candidates returned by their ranker may not be the most relevant sentences, since the ranker treats all candidates equally without considering their relevance to the reference sentences of the given concepts. Another problem is that re-ranking is very expensive, but only using retrievers will seriously degrade the performance of their generation models. To solve these problems, we propose the metric distillation rule to distill knowledge from the metric (e.g., BLEU) to the ranker. We further transfer the critical knowledge summarized by the distilled ranker to the retriever. In this way, the relevance scores of candidate sentences predicted by the ranker and retriever will be more consistent with their quality measured by the metric. Experimental results on the CommonGen benchmark verify the effectiveness of our proposed method: (1) Our generation model with the distilled ranker achieves a new state-of-the-art result. (2) Our generation model with the distilled retriever even surpasses the previous SOTA.
The integration of Large Language Models (LLMs) into explainable recommendation systems often leads to a performance-efficiency trade-off in end-to-end architectures, where joint optimization of ranking and explanation can result in suboptimal compromises. To resolve this, we propose Prism, a novel decoupled framework that rigorously separates the recommendation process into a dedicated ranking stage and an explanation generation stage. This decomposition ensures that each component is optimized for its specific objective, eliminating inherent conflicts in coupled models. Inspired by knowledge distillation, Prism leverages a powerful, instruction-following teacher LLM (FLAN-T5-XXL) as an Oracle to produce high-fidelity explanatory knowledge. A compact, fine-tuned student model (BART-Base), the Prism, then specializes in synthesizing this knowledge into personalized explanations. Our extensive experiments on benchmark datasets reveal a key finding: the distillation process not only transfers knowledge but also acts as a noise filter. Our 140M-parameter Prism model significantly outperforms its 11B-parameter teacher in human evaluations of faithfulness and personalization, demonstrating an emergent ability to correct hallucinations present in the teacher's outputs. While achieving a 24x speedup and a 10x reduction in memory consumption, our analysis validates that decoupling, coupled with targeted distillation, provides an efficient and effective pathway to high-quality, and perhaps more importantly, trustworthy explainable recommendation.
As demonstrated by GPT-3 and T5, transformers grow in capability as parameter spaces become larger and larger. However, for tasks that require a large amount of knowledge, non-parametric memory allows models to grow dramatically with a sub-linear increase in computational cost and GPU memory requirements. Recent models such as RAG and REALM have introduced retrieval into conditional generation. These models incorporate neural initial retrieval from a corpus of passages. We build on this line of research, proposing Re2G, which combines both neural initial retrieval and reranking into a BART-based sequence-to-sequence generation. Our reranking approach also permits merging retrieval results from sources with incomparable scores, enabling an ensemble of BM25 and neural initial retrieval. To train our system end-to-end, we introduce a novel variation of knowledge distillation to train the initial retrieval, reranker and generation using only ground truth on the target sequence output. We find large gains in four diverse tasks: zero-shot slot filling, question answering, fact checking and dialog, with relative gains of 9% to 34% over the previous state-of-the-art on the KILT leaderboard. We make our code available as open source.
The increasing size of generative Pre-trained Language Models (PLMs) have greatly increased the demand for model compression. Despite various methods to compress BERT or its variants, there are few attempts to compress generative PLMs, and the underlying difficulty remains unclear. In this paper, we compress generative PLMs by quantization. We find that previous quantization methods fail on generative tasks due to the homogeneous word embeddings caused by reduced capacity and the varied distribution of weights. Correspondingly, we propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules. Empirical results on various tasks show that our proposed method outperforms the state-of-the-art compression methods on generative PLMs by a clear margin. With comparable performance with the full-precision models, we achieve 14.4x and 13.4x compression rate on GPT-2 and BART, respectively.
In an effort to reinforce both the interpretability and accuracy of prescription recommendations in Traditional Chinese Medicine (TCM), this study puts forward a two-stage training framework that integrates knowledge distillation from a teacher model with implicit preference-driven reinforcement learning grounded in a compact model. Above all, GPT-4o is employed to parse structured TCM clinical records, creating high-quality distillation samples. These are employed to guide Low-Rank Adaptation (LoRA)-based fine-tuning of the Qwen2.5-7B model, enabling it to generate explainable outputs in the format of "symptom analysis—prescription recommendation—prescription explanation". Then, a lightweight BART (Bidirectional and Auto-Regressive Transformers) model is trained to learn the mapping relation between symptoms and prescriptions. Its outputs are compared with those of the large model to construct preference pairs, which are subsequently utilized in Direct Preference Optimization (DPO)-based reinforcement tuning to further align the model with potentially better recommendations. The suggested model achieves a P@30 of 35.62% and F1@30 of 37.36%, outperforming existing baselines. Knowledge distillation contributes to the improvement of the model's generalization and explainability, while implicit preference-based reinforcement further enhances F1@30 by 2.01%. Overall, the model obtains more desirable performance in both accuracy and explainability. The recommended approach not only improves the quality and transparency of TCM prescription recommendations, but also offers a fruitful strategy for building trustworthy and clinically applicable intelligent TCM decision-support systems.
Pre-trained Transformer models like T5 and BART have advanced the state of the art on a wide range of text generation tasks. Compressing these models into smaller ones has become critically important for practical use. Common neural network compression techniques such as knowledge distillation or quantization are limited to static compression where the compression ratio is fixed. In this paper, we introduce Modular Transformers, a modularized encoder-decoder framework for flexible sequence-to-sequence model compression. Modular Transformers train modularized layers that have the same function of two or more consecutive layers in the original model via module replacing and knowledge distillation. After training, the modularized layers can be flexibly assembled into sequence-to-sequence models that meet different performance-efficiency trade-offs. Experimental results show that after a single training phase, by simply varying the assembling strategy, Modular Transformers can achieve flexible compression ratios from 1.1x to 6x with little to moderate relative performance drop.
We proposed a technique to reduce the decoder’s number of parameters in a sequence-to-sequence (seq2seq) architecture for automatic text summarization. This approach uses a pre-trained Autoencoder (AE) trained on top of an encoder’s output to reduce its embedding dimension, which significantly reduces the summarizer model’s decoder size. Two experiments were performed to validate the idea: a custom seq2seq architecture with various pre-trained encoders and incorporating the approach in an encoder-decoder model (BART) for text summarization. Both studies showed promising results in terms of ROUGE score. However, the impressive outcome is the 54% decrease in the inference time and a 57% drop in GPU memory usage while fine-tuning with minimal quality loss (4.5% R1 score). It significantly reduces the hardware requirement to fine-tune large-scale pre-trained models. It is also shown that our approach can be combined with other network size reduction techniques (e.g. Distillation) to further reduce any encoder-decoder model parameters count. The implementation and checkpoints are available on GitHub.1
This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Considering the redundancy in natural language, existing approaches compress prompts by removing tokens or lexical units according to their information entropy obtained from a causal language model such as LLaMa-7B. The challenge is that information entropy may be a suboptimal compression metric: (i) it only leverages unidirectional context and may fail to capture all essential information needed for prompt compression; (ii) it is not aligned with the prompt compression objective. To address these issues, we propose a data distillation procedure to derive knowledge from an LLM to compress prompts without losing crucial information, and meantime, introduce an extractive text compression dataset. We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one, and use a Transformer encoder as the base architecture to capture all essential information for prompt compression from the full bidirectional context. Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT. We evaluate our method on both in-domain and out-of-domain datasets, including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH. Despite its small size, our model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. Additionally, our model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x. Our code is available at https://aka.ms/LLMLingua-2.
Knowledge distillation involves transferring soft labels from a teacher to a student using a shared temperature-based softmax function. However, the assumption of a shared temperature between teacher and student implies a mandatory exact match between their logits in terms of logit range and variance. This side-effect limits the performance of student, considering the capacity discrepancy between them and the finding that the innate logit relations of teacher are sufficient for student to learn. To address this issue, we propose setting the temperature as the weighted standard deviation of logit and performing a plug-and-play Z-score pre-process of logit standardization before applying softmax and Kullback-Leibler divergence. Our pre-process enables student to focus on essential logit relationsfrom teacher rather than requiring a magnitude match, and can improve the performance of existing logit-based distillation methods. We also show a typical case where the conventional setting of sharing temperature between teacher and student cannot reliably yield the authentic dis-tillation evaluation; nonetheless, this challenge is success-fully alleviated by our Z-score. We extensively evaluate our method for various student and teacher models on CIFAR-100 and ImageNet, showing its significant superiority. The vanilla knowledge distillation powered by our pre-process can achieve favorable performance against state-of-the-art methods, and other distillation variants can obtain considerable gain with the assistance of our pre-process. The codes, pre-trained models and logs are released on Github.
Automatic text summarization has become an essential solution for processing massive textual information, particularly in lengthy news articles. This study compares two variants of the TextRank algorithm using different weighting schemes: TF-IDF and Word2Vec, for summarizing Indonesian news texts. The dataset comprises 160 news articles from Kompas.com, which underwent preprocessing. Evaluation was conducted using ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-L), manual readability assessment, and execution runtime. The results indicate that TextRank with Word2Vec outperforms TF-IDF in both ROUGE scores (ROUGE-1 F1: 0.7033 vs 0.6454) and processing speed. These findings suggest that incorporating semantic representations into graph-based algorithms like TextRank significantly improves summary quality and runtime efficiency.
When TextRank algorithm based on graph model constructs graph associative edges, the co-occurrence window rules only consider the relationships between local terms. Using the information in the document itself is limited. In order to solve the above problems, an improved TextRank keyword extraction algorithm based on rough data reasoning combined with word vector clustering, RDD-WRank, was proposed. Firstly, the algorithm uses rough data reasoning to mine the association between candidate keywords, expands the search scope, and makes the results more comprehensive. Then, based on Wikipedia online open knowledge base, word embedding technology is used to integrate Word2Vec into the improved algorithm, and the word vector of TextRank lexical graph nodes is clustered to adjust the voting importance of nodes in the cluster. Compared with the traditional TextRank algorithm and the Word2Vec algorithm combined with TextRank, the experimental results show that the improved algorithm has significantly improved the extraction accuracy, which proves that the idea of using rough data reasoning can effectively improve the performance of the algorithm to extract keywords.
With the rapid development of we-media information dissemination, WeChat official accounts platform has become an important way for people to obtain health related knowledge. However, the platform information is redundant, miscellaneous, and overloaded. In order to meet the increasingly accurate and efficient knowledge service needs of users, reorganizing and aggregating document knowledge resources is effective. If we use the way of artificial recognition to filter information, it will inevitably cause huge labor and time cost, and the effect is very little in front of massive articles. This paper proposes a text summarization method for the WeChat platform based on improved TextRank that takes into account both user demands and sentence features during the summarization process. The data source crawled from the Sogou WeChat platform. The results show that the TextRank algorithm has obvious hints on the accuracy of text summarization extraction after fusing the Word2vec model. The improved TextRank method, integrating user demands and sentence features into the model, makes the results of text summarization closer to the theme of the article and more able to meet the user demand. According to the complexity of the algorithm, this method is not suitable for the automatic summarization of long or multiple documents.
News sentiment analysis is widely used in stock price forecasting, and the existing research is mostly limited to sentiment mining of news headlines, ignoring the effective information contained in article news. This study introduces extractive text summarization technology into stock price prediction, use Word2Vec and TextRank algorithm to extract effective text information contained in article news, and adopt the emotion comprehensive calculation method based on news headlines and news abstracts, taking into account the effective information of headlines and original texts. Input the comprehensively calculated news sentiment value as sentiment feature into the stock price prediction model LSTM, and propose a stock price prediction framework based on TextRank text summarization techniques and sentiment analysis. Finally, select the stock transaction data of A-share CTG DUTY-FREE for a total of 587 trading days from December 25, 2019 to May 31, 2022 for comparative experiments. The result shows, the sentiment analysis algorithm based on TextRank text summarization technology proposed in this article has the best extraction effect on the sentiment value of news texts, in terms of prediction accuracy, the model is 11.67% higher than the benchmark model on the test set.
Application of automatic text summarization is a popular Natural Language Processing task and often used in extracting lengthy content to produce short summary. This is a tedious yet time-consuming task. This study focuses on Malay news articles with the aim to select representative sentences for Malay news headline generation. The dataset used in the experiment is a collection of multi-genre Malay news published between year of 2017 and 2019 from Bernama.com. In this study, a leading sentence approach is applied in the TextRank with TF-IDF and Word2Vec as language models to perform salient sentence extraction. In the experiment, the top-ranking sentences extracted are based on the 15%, 20%, 25% and 30% of the original news content. The extracted contents are evaluation against the original news headline using ROUGE evaluation matric. The model shows that the inclusion of first sentence and first two sentences from the news are able to achieve significant improvement. This leading sentence approach is able to achieve improvement of the F1 score from 1.36 to 7.98. Besides that, the experiment also proofs that the ROUGE scores decrease as the percentage of extraction increase. Thus, the proposed method is fast and resource efficient as compared to other state-of-the-art Natural Language approach.
Facing the current standards of large scale and large quantity in transportation industry, how to efficiently extract standard keywords to provide professional services is a problem that needs to be solved in the industry at present. According to the text characteristics of transportation industry standards, this paper proposes a keyword extraction method based on improved TextRank and uses TF-IDF and Word2Vec algorithm. Then, different weights are assigned according to different factors such as the position, word frequency, semantics and part of speech of the industry standard text, so as to quickly extract more authoritative keywords in industry standard. Experiments show that compared with the classical TextRank, TF-IDF and word2vec algorithms, the proposed method has a great improvement in Precision, Recall and F value for the data set of transportation industry standards.
Automatic text summarization is a core task in natural language processing, aimed at compressing information and extracting key semantics. However, as a typical low-resource language, research on Mongolian automatic summarization has progressed slowly due to the lack of high-quality annotated corpora. To address this gap, this study proposes a hybrid strategy that integrates headline-guided filtering and graph-based ranking to construct MTESum, a single-document summarization dataset for Mongolian. Based on multi-source news articles, the dataset was built through systematic cleaning, BPE (Byte Pair Encoding) subword tokenization, and a combination of TextRank and Jaccard similarity-based redundancy elimination. The resulting dataset comprises 1,000 Mongolian news content–summary pairs. To validate the quality and applicability of MTESum, we conduct a series of experiments using four unsupervised extractive methods, TF-IDF, TextRank, mnTextRank, and Word2Vec + mnTextRank. Additionally, we perform human evaluation and statistical analysis to assess the dataset's summary quality. Experimental results show that the constructed dataset exhibits strong content coverage and structural consistency, providing a reliable foundation for Mongolian text summarization research and o ff ering methodological insights for building summarization datasets in other low-resource languages.
In the context of the accelerated pace of daily life and the development of e-commerce, online shopping is a mainstream way for consumers to access products and services. To understand their emotional expressions in facing different shopping experience scenarios, this paper presents a sentiment analysis method that combines the e-commerce review keyword-generated image with a hybrid machine learning-based model, in which the Word2Vec-TextRank is used to extract keywords that act as the inputs for generating the related images by generative Artificial Intelligence (AI). Subsequently, a hybrid Convolutional Neural Network and Support Vector Machine (CNN-SVM) model is applied for sentiment classification of those keyword-generated images. For method validation, the data randomly comprised of 5000 reviews from Amazon have been analyzed. With superior keyword extraction capability, the proposed method achieves impressive results on sentiment classification with a remarkable accuracy of up to 97.13%. Such performance demonstrates its advantages by using the text-to-image approach, providing a unique perspective for sentiment analysis in the e-commerce review data compared to the existing works. Thus, the proposed method enhances the reliability and insights of customer feedback surveys, which would also establish a novel direction in similar cases, such as social media monitoring and market trend research.
Ontologies play a vital role in organizing and constructing knowledge across various domains, enabling effective knowledge management and sharing. The development of domain-specific ontologies, such as the ONTO-TDM ontology for teaching domain modeling, is essential for providing a comprehensive and standardized representation of knowledge within a given discipline. However, to maximize the usefulness and relevance of such ontologies, it is crucial to automate their population with domain-specific information, reducing manual work and ensuring scalability. This paper presents a novel method for ontology population by extracting and integrating relevant information from diverse sources. The method combines the TextRank algorithm with Word2Vec to enhance keyword extraction, capturing both semantic meaning and textual importance. Keywords are then annotated and used to train a machine learning classifier, which aids in integrating new instances into the ontology. Experiments show that the proposed method achieves a precision of 63.33%, a recall of 61.29% and an F1-score of 62.28%, significantly improving keyword extraction and ontology population accuracy compared to existing methods. This validates the method’s effectiveness in semi-automatically extracting relevant instances from diverse data sources, enhancing the efficiency and accuracy of ontology population, and advancing automated knowledge management in domain-specific contexts.
Network spam has long plagued Internet users. How to accurately and efficiently identify spam is an urgent problem to be solved. Up to now, a lot of research work has been proposed to identify the spam. On the basis of previous work, we propose a new model, which is based on text feature fusion. In particular, we use the TextRank algorithm to extract text keywords, use the Word2vec algorithm to vectorize the text, and finally use the feature fusion method based on attention mechanism to fuse the text features and input them into the BiLSTM model to verify its effect of identifying spam. Through experiments, we find that the accuracy of the proposed model can reach 0.78, which is better than the comparison model. Therefore, according to the analysis results, our method can effectively identify the microblog spam from the text level.
As the bug description data generated during the software maintenance cycle, bug reports are usually hastily written by different users, resulting in many redundant and duplicate bug reports (DBRs). Once the DBRs are repeatedly assigned to developers, it will inevitably lead to a serious waste of human resources, especially for large-scale open-source projects. Recently, many experts and scholars have devoted themselves to researching the detection of DBRs and put forward a series of detection methods for DBRs. However, there is still much room for improvement in the performance of DBR prediction. Therefore, this paper proposes a new method for detecting DBR based on technical term extraction, CTEDB (Combination of Term Extraction and DeBERTaV3) for short. This method first extracts technical terms from the text information of bug reports based on Word2Vec and TextRank algorithms. Then it calculates the semantic similarity of technical terms between different bug reports by combining Word2Vec and SBERT models. Finally, it completes the DBR detection task by combining the DeBERTaV3 model. The experimental results show that CTEDB has achieved good results in detecting DBR, and has obviously improved the accuracy, F1-score, recall and precision compared with the baseline approaches.
Nowadays, the primary media for information dissemination is shifting to online platforms. Events usually burst online through multiple modern online media. Therefore, predicting event popularity trends becomes crucial for online platforms to track public concerns and make appropriate decisions. However, little research focuses on event popularity prediction from a cross platform perspective. Challenges stem from the vast diversity of events and media, limited access to aligned datasets across different platforms, and a considerable amount of noise in datasets. In this paper, we solve the cross-platform event popularity prediction problem by proposing a model named <italic>DancingLines</italic>, which is mainly composed of three parts. First, we propose <italic>TF-SW</italic>, a semantic-aware popularity quantification model based on Term Frequency with Semantic Weight. TF-SW obtains the event popularity based on Word2Vec and TextRank, and generates Event Popularity Time Series (EPTS). Then, we propose <inline-formula><tex-math notation="LaTeX">$\omega$</tex-math><alternatives><mml:math><mml:mi>ω</mml:mi></mml:math><inline-graphic xlink:href="gao-ieq1-3090663.gif"/></alternatives></inline-formula><italic>DTW-CD</italic>, a pairwise time series alignment model derived from Dynamic Time Wrapping (DTW) with Compound Distance (CD) for aligning the EPTS on several platforms. Finally, we aggregate two time series and propose a neural-based prediction model implementing Long Short-Term Memory (LSTM) with attention mechanism to obtain accurate event popularity predictions. Evaluation results based on large scale real-world datasets demonstrate that DancingLines can efficiently characterize, align, and predict event popularity in a cross-platform manner.
Data for electrical components is maintained by different information systems in China. Currently, the relational database is commonly used to obtain and manage data for equipment maintenance purposes. However, as the modern power system develops, there is a growing need for massive data processing from different sources, which poses challenges to relational databases of its limitations in model flexibility, query efficiency, and scalability. Compared to traditional relational databases, graph databases handle fast-changing, inter-connected data well, offer flexibility, and have higher query efficiency. This paper introduces a graph modeling technique for transformer equipment to quickly acquire and maintain data for transformer equipment operation and management purposes. An initial data screening is performed to select data with different sources, structures, and characteristics. Word2vec and K-means are adapted to define and cluster model candidate sets. TextRank is applied to perform disambiguation for those transformer model sets. The transformer management graph model is then optimized based on its business scenario. The automatic resulting model provides comprehensive transformer data management, which opens transformer equipment operation and maintenance scenarios. This work significantly improves transformer automatic modeling and data management efficiency and compatibility. The results demonstrate the importance and potential of applying graph database in electrical component data management.
In this paper, we conduct an in-depth study of Japanese keyword extraction from news reports, train external computer document word sets from text preprocessing into word vectors using the Ship-gram model in the deep learning tool Word2Vec, and calculate the cosine distance between word vectors. In this paper, the sliding window in TextRank is designed to connect internal document information to improve the in-text semantic coherence. The main idea is to use not only the statistical and structural features of words but also the semantic features of words extracted through word-embedding techniques, i.e., multifeature fusion, to obtain the importance weights of words themselves and the attraction weights between words and then iteratively calculate the final weight of each word through the graph model algorithm to determine the extracted keywords. To verify the performance of the algorithm, extensive simulation experimental studies were conducted on three different types of datasets. The experimental results show that the proposed keyword extraction algorithm can improve the performance by a maximum of 6.45% and 20.36% compared with the existing word frequency statistics and graph model methods, respectively; MF-Rank can achieve a maximum performance improvement of 1.76% compared with PW-TF.
When the traditional maximal marginal relevance (MMR) algorithm extracts text summarization, only the similarity between the sentence and the text is used as the score of the sentence. Although the extracted abstract is rich in content, the sentences in the abstract are not all important sentences, and the context information and global information of the sentence are ignored. In this paper, an improved maximum marginal relevance algorithm (WT-MMR) is proposed to extract text summarization with rich content, important sentences, and no redundancy. The WT-MMR follows two principles: “Sentences relevance sentences in the abstract”. The experimental results show that the quality of abstracts extracted by the WT-MMR algorithm is about 10% higher than that of the traditional MMR algorithm, and proves that using Word2Vec, TextRank and Semantic information for extracting text summarization improves the performance of the MMR algorithm.
Keyword extraction is a fundamental task in natural language processing and plays a pivotal role in applications such as information retrieval, text categorization, summarization, and semantic analysis. In this paper, we propose an improved TextRank keyword extraction algorithm (EMF-TextRank) that leverages word embedding and text networks while integrating multiple features of words. First, the word co-occurrence network is optimized using word embedding techniques. Subsequently, TextRank is enhanced by comprehensively evaluating word features across multiple dimensions, including semantics, statistics, network structure, and location. Finally, experimental results validate the effectiveness of the proposed algorithm. Compared to the traditional TextRank algorithm, the improved version exhibits superior performance across multiple datasets, offering new insights and application potential for keyword extraction tasks.
In today’s data-driven world, automatic text summarization is essential for extracting insights from large data volumes. While extractive summarization is well-studied, abstractive summarization remains limited, especially for low-resource languages like Urdu. This study introduces process innovation through transformer-based models—Efficient-BART (EBART), Efficient-T5 (ET5), and Efficient-GPT-2 (EGPT-2)—optimized for Urdu abstractive summarization. Innovations include strategically removing inefficient attention heads to reduce computational complexity and improve accuracy. Theoretically, this pruning preserves structural integrity by retaining heads that capture diverse linguistic features, while eliminating redundant ones. Adapted from BART, T5, and GPT-2, these optimized models significantly outperform their originals in ROUGE evaluations, demonstrating the effectiveness of process innovation and optimization for Urdu natural language processing.
Privatized text rewriting with local differential privacy (LDP) is a recent approach that enables sharing of sensitive textual documents while formally guaranteeing privacy protection to individuals. However, existing systems face several issues, such as formal mathematical flaws, unrealistic privacy guarantees, privatization of only individual words, as well as a lack of transparency and reproducibility. In this paper, we propose a new system'DP-BART'that largely outperforms existing LDP systems. Our approach uses a novel clipping method, iterative pruning, and further training of internal representations which drastically reduces the amount of noise required for DP guarantees. We run experiments on five textual datasets of varying sizes, rewriting them at different privacy guarantees and evaluating the rewritten texts on downstream text classification tasks. Finally, we thoroughly discuss the privatized text rewriting approach and its limitations, including the problem of the strict text adjacency constraint in the LDP paradigm that leads to the high noise requirement.
Multiple studies have shown that Transformers are remarkably robust to pruning. Contrary to this received wisdom, we demonstrate that pre-trained Transformer encoders are surprisingly fragile to the removal of a very small number of features in the layer outputs (<0.0001% of model weights). In case of BERT and other pre-trained encoder Transformers, the affected component is the scaling factors and biases in the LayerNorm. The outliers are high-magnitude normalization parameters that emerge early in pre-training and show up consistently in the same dimensional position throughout the model. We show that disabling them significantly degrades both the MLM loss and the downstream task performance. This effect is observed across several BERT-family models and other popular pre-trained Transformer architectures, including BART, XLNet and ELECTRA; we also show a similar effect in GPT-2.
The application of contemporary artificial intelligence techniques to address geometric problems and automated deductive proof has always been a grand challenge to the interdiscipline field of mathematics and artificial Intelligence. This is the fourth article in a series of our works, in our previous work, we established of a geometric formalized system known as FormalGeo. Moreover we annotated approximately 7000 geometric problems, forming the FormalGeo7k dataset. Despite the FGPS (Formal Geometry Problem Solver) can achieve interpretable algebraic equation solving and human-like deductive reasoning, it often experiences timeouts due to the complexity of the search strategy. In this paper, we introduced FGeo-TP (Theorem Predictor), which utilizes the language model to predict theorem sequences for solving geometry problems. We compared the effectiveness of various Transformer architectures, such as BART or T5, in theorem prediction, implementing pruning in the search process of FGPS, thereby improving its performance in solving geometry problems. Our results demonstrate a significant increase in the problem-solving rate of the language model-enhanced FGeo-TP on the FormalGeo7k dataset, rising from 39.7% to 80.86%. Furthermore, FGeo-TP exhibits notable reductions in solving time and search steps across problems of varying difficulty levels.
Retrieval-augmented generation (RAG) improves the response quality of large language models (LLMs) by retrieving knowledge from external databases. Typical RAG approaches split the text database into chunks, organizing them in a flat structure for efficient searches. To better capture the inherent dependencies and structured relationships across the text database, researchers propose to organize textual information into an indexing graph, known asgraph-based RAG. However, we argue that the limitation of current graph-based RAG methods lies in the redundancy of the retrieved information, rather than its insufficiency. Moreover, previous methods use a flat structure to organize retrieved information within the prompts, leading to suboptimal performance. To overcome these limitations, we propose PathRAG, which retrieves key relational paths from the indexing graph, and converts these paths into textual form for prompting LLMs. Specifically, PathRAG effectively reduces redundant information with flow-based pruning, while guiding LLMs to generate more logical and coherent responses with path-based prompting. Experimental results show that PathRAG consistently outperforms state-of-the-art baselines across six datasets and five evaluation dimensions. The code is available at the following link: https://github.com/BUPT-GAMMA/PathRAG
Recent Multimodal Large Language Models(MLLMs) often use a large number of visual tokens to compensate their visual shortcoming, leading to excessive computation and obvious visual redundancy. In this paper, we investigate what kind of visual tokens are needed for MLLMs, and reveal that both foreground and background tokens are critical for MLLMs given the varying difficulties of examples. Based on this observation, we propose a graph-based method towards training-free visual token pruning, termed G-Prune. In particular, G-Prune regards visual tokens as nodes, and construct their connections based on their semantic similarities. Afterwards, the information flow is propagated via weighted links, and the most important tokens after iterations are kept for MLLMs, which can be front or background. To validate G-Prune, we apply it to a recent MLLM called LLaVA-NeXT, and conduct extensive experiments on a set of benchmarks. The experiment results show that G-Prune can greatly reduce computation overhead while retaining high performance on both coarse- and fine-grained tasks. For instance, G-Prune can reduce 63.57% FLOPs of LLaVA-NeXT on VQA2.0 and TextVQA with only 0.95% and 2.34% accuracy drops, respectively.
Large Language Models (LLMs) have revolutionized artificial intelligence by enabling multitasking across diverse fields. However, their high computational demands result in significant environmental impacts, particularly in terms of energy and water consumption. This paper addresses these issues by proposing an innovative compression approach to reducing LLM sizes. We focus on compressing the internal transformer layers, which are critical contributors to LLMs’ computational complexity. Our approach combines new mathematical and structural key methods for model compression. We begin by applying Forward Propagation Pruning (FPP) to compress the embedding and feed-forward layers, utilizing a weight freezing and zeroing technique for suspected unused parameters. This reduces the number of trainable parameters, accelerating the overall training process and enabling faster convergence. Second, the Weight Matrix Folding method is introduced to efficiently prune the self-attention layer matrices in a simple and efficient mathematical model. This method integrates Identical Row Compression (IRC) to optimize the compression of the Query and Key matrices, alongside Diagonal Weight Compression (DWC), which reformulates the Value matrix into a diagonal structure. Consequently, this technique significantly diminishes parameter variability across the three metrics, enhancing consistency and performance while simplifying complexity. The compression approach is evaluated on three language modeling datasets and eight widely used classification datasets, comparing it to various pruning methods. Our method successfully compresses transformer layers by 99% and linear layers by 70%, resulting in an overall model compression of around 70%, while maintaining nearly the same accuracy. Notably, with moderate compression rates of 20% to 40%, model performance not only remained stable but even improved. This leads to substantial reductions in memory usage and computational demands, making LLMs more resource-efficient and highlighting the potential to optimize them for a more sustainable AI future.
This paper addresses the challenges of high computational cost and severe parameter redundancy in the fine-tuning of large language models. It proposes an efficient fine-tuning algorithm that integrates structural pruning with parameter sharing. The method operates from both the architectural and optimization perspectives. It prunes redundant connections dynamically while keeping the core model frozen and introduces task-conditioned cross-layer sharing modules to enhance representation power and parameter efficiency. A pruning residual compensation mechanism is designed to preserve semantic coherence, and a conditional sharing mapping is constructed to improve task-level consistency. The training objective jointly optimizes task loss, sparsity regularization, and inter-layer consistency constraints, achieving unified parameter compression and semantic retention. The proposed method is systematically evaluated using perplexity, accuracy, and inference speed-up across different pruning rates, learning rates, input lengths, and data distribution settings. Experimental results show that the algorithm consistently outperforms mainstream fine-tuning techniques across multiple dimensions. It achieves joint optimization of accuracy and efficiency with minimal parameter tuning, making it well-suited for large language model deployment and transfer learning across diverse scenarios.
While Large Reasoning Models (LRMs) have demonstrated success in complex reasoning tasks through long chain-of-thought (CoT) reasoning, their inference often involves excessively verbose reasoning traces, resulting in substantial inefficiency. To address this, we propose Distilled Reasoning Pruning (DRP), a hybrid framework that combines inference-time pruning with tuning-based distillation, two widely used strategies for efficient reasoning. DRP uses a teacher model to perform skill-aware step decomposition and content pruning, and then distills the pruned reasoning paths into a student model, enabling it to reason both efficiently and accurately. Across several challenging mathematical reasoning datasets, we find that models trained with DRP achieve substantial improvements in token efficiency without sacrificing accuracy. Specifically, DRP reduces average token usage on GSM8K from 917 to 328 while improving accuracy from 91.7% to 94.1%, and achieves a 43% token reduction on AIME with no performance drop. Further analysis shows that aligning the reasoning structure of training CoTs with the student's reasoning capacity is critical for effective knowledge transfer and performance gains.
Large Language Models (LLMs) pruning seeks to remove unimportant weights for inference speedup with minimal accuracy impact. However, existing methods often suffer from accuracy degradation without full-model sparsity-aware fine-tuning. This paper presents Wanda++, a novel pruning framework that outperforms the state-of-the-art methods by utilizing decoder-block-level \textbf{regional} gradients. Specifically, Wanda++ improves the pruning score with regional gradients for the first time and proposes an efficient regional optimization method to minimize pruning-induced output discrepancies between the dense and sparse decoder output. Notably, Wanda++ improves perplexity by up to 32\% over Wanda in the language modeling task and generalizes effectively to downstream tasks. Moreover, despite updating weights with regional optimization, Wanda++ remains orthogonal to sparsity-aware fine-tuning, further reducing perplexity with LoRA in great extend. Our approach is lightweight, pruning a 7B LLaMA model in under 10 minutes on a single H100 GPU.
Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of language modeling tasks. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. We have open-sourced Minitron model weights on Huggingface, with corresponding supplementary material including example code available on GitHub.
As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prunes weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method Wanda on LLaMA and LLaMA-2 across various language benchmarks. Wanda significantly outperforms the established baseline of magnitude pruning and performs competitively against recent method involving intensive weight update. Code is available at https://github.com/locuslab/wanda.
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in both the deployment, inference, and training stages. With LLM being a general-purpose task solver, we explore its compression in a task-agnostic manner, which aims to preserve the multi-task solving and language generation ability of the original LLM. One challenge to achieving this is the enormous size of the training corpus of LLM, which makes both data transfer and model post-training over-burdensome. Thus, we tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures based on gradient information, maximally preserving the majority of the LLM's functionality. To this end, the performance of pruned models can be efficiently recovered through tuning techniques, LoRA, in merely 3 hours, requiring only 50K data. We validate the LLM-Pruner on three LLMs, including LLaMA, Vicuna, and ChatGLM, and demonstrate that the compressed models still exhibit satisfactory capabilities in zero-shot classification and generation. The code is available at: https://github.com/horseee/LLM-Pruner
Large language models (LLMs) based on transformer are witnessing a notable trend of size expansion, which brings considerable costs to both model training and inference. However, existing methods such as model quantization, knowledge distillation, and model pruning are constrained by various issues, including hardware support limitations, the need for extensive training, and alterations to the model internal structure. In this paper, we propose a concise layer-wise structured pruner called \textit{Layer Collapse (LaCo)}, in which rear model layers collapse into a prior layer, enabling a rapid reduction in model size while preserving the model structure. Comprehensive experiments show that our method maintains an average task performance of over 80\% at pruning ratios of 25-30\%, significantly outperforming existing state-of-the-art structured pruning methods. We also conduct post-training experiments to confirm that the \textit{LaCo} effectively inherits the parameters of the original model. Additionally, we perform ablation studies on various settings of \textit{LaCo}. Finally, we discuss our motivation from the perspective of layer-wise similarity and evaluate the performance of the pruned LLMs across various pruning ratios\footnote{\url{https://github.com/yangyifei729/LaCo}}.
Conversational implicature is the core concept of pragmatics, which is of great significance to natural language processing (NLP). However, the current intelligent systems are insufficient in dealing with conversational implicature, especially the new generation of systems based on deep learning and data-driven. Due to the failure to fully integrate pragmatic theories, the depth and accuracy of semantic understanding are limited. Therefore, this paper models the expression and reasoning process of conversational implicature based on Bayesian theory. In the feature extraction stage, Normalized Google Distance (NGD) is introduced to measure the semantic correlation between words, and the node weights are recalculated. The key features are extracted by the improved NGD-TextRank algorithm, and the redundant attributes are removed to reduce the dimension. In the classification process, the influence degree of different feature items is divided, and their weights are integrated into the naive Bayes formula to construct a weighted naive Bayes classification algorithm. The research results show that the model can effectively solve the problems of pragmatics in conversational implicature, and provide new theoretical support and technical path for NLP and related fields.
Traditional text classification models, such as text kernels, primarily consider the syntactic aspects of text data. This paper introduces Topic-Weighted Kernels, a new text analytics framework that combines global topical themes with word-level semantics in a text kernel architecture. Three new text kernels are proposed to improve text analysis - (a) the Topic-Weighted Base Kernel, (b) the Topic-Weighted Word2Vec kernel, and (c) the Topic-Weighted BERT (Bidirectional Encoder Representations from Transformers) kernel. These kernels leverage topic modeling and deep word embeddings to capture thematic and semantic information within textual data. Text kernels consider global and local semantics for text analysis tasks and improve model performance. Experiments on diverse datasets demonstrate that Topic-Weighted Kernels outperforms existing methods for text analysis tasks. The Topic-Weighted BERT Kernel achieves top-tier performance, with F1 scores reaching 99% on lighter datasets and significantly boosting performance on more complex datasets. For the tasks of multi-label text classification on the Reuters-90 dataset and sentiment analysis on the IMDB dataset, the model achieves F1 scores of 90.76% and 96.66%, respectively, demonstrating state-of-the-art performance. The Topic-Weighted Kernel approach improves the performance while enabling a better contextual representation for various text analysis tasks such as single and multi-label classification and sentiment analysis. The proposed framework integrates semantics from word embeddings and topic models to text kernels for capturing intricate patterns in textual data that aid in more contextual text analytics.
Offensive language detection in Arabic social media remains a challenging task due to the linguistic richness, dialectal variations, and class imbalance between offensive and non-offensive content. In this paper, we introduce a Semantic Class-Aware Weighted Graph Convolutional Network (SAW-GCN), which leverages edge weights derived from semantic similarity to propagate information across related comments for enhancing Arabic offensive language classification. First, comments are encoded into high-dimensional semantic vectors using AraBERT, capturing contextual and dialectal nuances. To mitigate data imbalance, KMeans-SMOTE is employed to generate representative synthetic samples while preserving semantic coherence. A class-aware similarity graph is then constructed by connecting comments based on cosine similarity within each class, ensuring discriminative structural relationships. Experimental results demonstrate strong generalization, achieving a weighted F1-score of 94.4%, 97.5% ROC-AUC, 91% balanced accuracy, and 82% MCC, significantly improving over baseline methods. These findings highlight the effectiveness of combining transformer-based embeddings with weighted graph neural architectures for robust offensive language detection in Arabic social media.
The increasing amount of textual content across digital platforms, including social media, news and education, has made it difficult for users to extract useful information efficiently. Therefore, Automatic Text Summarization (ATS) becomes an essential tool for distilling large amount of information while maintaining the core idea. Progress in Arabic ATS remains limited due to the scarcity of annotated datasets, the lack of Arabic-specific NLP tools and the high computational cost of LLM. Additionally, traditional methods often fail to capture sentence-level semantics, limiting summary quality. To address this, we propose a scalable, unsupervised framework that uses TF-IDF-weighted AraBERT embeddings to generate rich sentence representations. To further capture document structure, sentences are grouped using k-means clustering. From each cluster, we identify the most representative sentences using centroid similarity and apply Maximal Marginal Relevance (MMR) as a post-processing redundancy to eliminate sentences that are too similar. Experimental evaluation on the EASC dataset demonstrates that our weighted AraBERT model outperforms traditional embedding techniques such as FastText and Unweighted AraBERT, achieving significant improvements across multiple ROUGE metrics.
The accuracy of traditional topic models may be compromised due to the sparsity of co-occurring vocabulary in the corpus, whereas conventional word embedding models tend to excessively prioritize contextual semantic information and inadequately capture domain-specific features in the text. This paper proposes a hybrid semantic representation method that combines a topic model that integrates conceptual knowledge with a weighted word embedding model. Specifically, we construct a topic model incorporating the Probase concept knowledge base to perform topic clustering and obtain topic semantic representation. Additionally, we design a weighted word embedding model to enhance the contextual semantic information representation of the text. The feature-based information fusion model is employed to integrate the two textual representations and generate a hybrid semantic representation. The hybrid semantic representation model proposed in this study was evaluated based on various English composition test sets. The findings demonstrate that the model presented in this paper exhibits superior accuracy and practical value compared to existing text representation methods.
In online computer systems, the detection of anomalous events is crucial for protecting the system from failures. System logs record detailed information about computing events and are widely used for system state analysis. Existing log-based anomaly detection methods are affected by the quality of semantic vectors. Semantic vectors obtained using Word2Vec or BERT only represent the semantics of sentences, disregarding the importance of individual semantics. To address these limitations, we have designed a weighted approach based on TF-IDF. Unlike LogRobust, which applies weighting during the construction of semantic features, LogTIW separately constructs semantic features and TF-IDF features of log templates after parsing. We extract semantic features using Transformer and LSTM models, and extract weighted features carried by TF-IDF using LSTM and linear layers. Then, the semantic features are weighted using the extracted weight features. Experimental results demonstrate that on the publicly available HDFS log dataset, LogTIW achieves precision, recall, and F1 scores all exceeding 99%. LogTIW outperforms state-of-the-art methods in anomaly detection.
Large-scale pre-trained sequence-to-sequence models like BART and T5 achieve state-of-the-art performance on many generative NLP tasks. However, such models pose a great challenge in resource-constrained scenarios owing to their large memory requirements and high latency. To alleviate this issue, we propose to jointly distill and quantize the model, where knowledge is transferred from the full-precision teacher model to the quantized and distilled low-precision student model. Empirical analyses show that, despite the challenging nature of generative tasks, we were able to achieve a 16.5x model footprint compression ratio with little performance drop relative to the full-precision counterparts on multiple summarization and QA datasets. We further pushed the limit of compression ratio to 27.7x and presented the performance-efficiency trade-off for generative tasks using pre-trained models. To the best of our knowledge, this is the first work aiming to effectively distill and quantize sequence-to-sequence pre-trained models for language generation tasks.
This paper presents a technique to reduce the number of parameters in a transformer-based encoder–decoder architecture by incorporating autoencoders. To discover the optimal compression, we trained different autoencoders on the embedding space (encoder’s output) of several pre-trained models. The experiments reveal that reducing the embedding size has the potential to dramatically decrease the GPU memory usage while speeding up the inference process. The proposed architecture was included in the BART model and tested for summarization, translation, and classification tasks. The summarization results show that a 60% decoder size reduction (from 96 M to 40 M parameters) will make the inference twice as fast and use less than half of GPU memory during fine-tuning process with only a 4.5% drop in R-1 score. The same trend is visible for translation and partially for classification tasks. Our approach reduces the GPU memory usage and processing time of large-scale sequence-to-sequence models for fine-tuning and inference. The implementation and checkpoints are available on GitHub.
A common shortcut in dialogue summarization is to truncate a conversation to its most recent turns, assuming this reduces cost without hurting summary quality. We test this assumption through a controlled context ablation study on DialogSum using DistilBART as the main model, with T5- small and BART-large for confirmation. Truncation consistently hurts quality: BERTScore drops by –0.0115 and ROUGE-L by –0.0705 (p < 0.0001), showing that removing early context acts as lossy compression that distorts meaning. Meanwhile, efficiency barely improves—shrinking input length by about 75% yields only a ~ 10% gain in throughput on an NVIDIA T4 GPU. These findings reveal that naive truncation is counterproductive; future dialogue summarization systems should adopt adaptive, content- aware context selection rather than blunt length reduction.
本报告综合了 TextRank 算法的加权改进与 BART 模型的轻量化优化两大前沿方向。研究脉络清晰地展示了自然语言处理领域在“语义增强”与“计算效率”之间的平衡:一方面,通过融合深度学习嵌入(BERT/Word2Vec)与多维统计特征,使传统的无监督图算法 TextRank 具备了更强的语义感知能力,并广泛应用于垂直领域;另一方面,针对以 BART 为代表的预训练模型,通过知识蒸馏、结构化剪枝、量化及潜在空间压缩等技术手段,显著降低了大规模生成式模型的部署门槛,同时结合检索增强与忠实度优化策略,解决了生成内容的事实一致性挑战。这些研究共同推动了高效、精准的文本分析与生成技术在工业界的落地。