自然语言处理在上市公司财报分析中的应用综述
财务文档信息抽取与结构化解析
该组文献专注于基础NLP与计算机视觉技术的应用,旨在从非结构化财报中提取命名实体、关系、关键指标以及识别文档复杂的结构布局,实现数据的结构化处理。
- A Named Entity Recognition Model Based on BERT Model and Lexical Fusion in the Financial Regulation Field(Xiaoguo Wang, X. Pan, Chao Chen, Jianwen Cui, 2023, 2023 5th International Conference on Frontiers Technology of Information and Computer (ICFTIC))
- Financial Named Entity Recognition: How Far Can LLM Go?(Yi Lu, Yintong Huo, 2025, No journal)
- Specific customer risk disclosure and IPO approval rate: evidence based on machine learning and text analysis(Xiqiong He, Sibo Wang, Jiayue Li, 2024, Applied Economics Letters)
- Towards Cognitive Intelligence in Financial Document Analysis: A Multimodal LLM Framework for Risk Reasoning and Due Diligence(Manshan Lin, 2025, Journal of Language)
- SoarGraph: Numerical Reasoning over Financial Table-Text Data via Semantic-Oriented Hierarchical Graphs(Fengbin Zhu, Moxin Li, Junbin Xiao, Fuli Feng, Chao Wang, Tat-seng Chua, 2023, Companion Proceedings of the ACM Web Conference 2023)
- Green Innovation and Corporate Green Transformation: Evidence from Text Mining of Annual Reports(Yong Li, 2024, Proceedings of the 2024 3rd International Conference on Algorithms, Data Mining, and Information Technology)
- Detecting Document Structure in a Very Large Corpus of UK Financial Reports(Mahmoud El-Haj, Paul Rayson, S. Young, M. Walker, 2014, International Conference on Language Resources and Evaluation)
- Enhancing Financial Named Entity Recognition through Adaptive Few-Shot Learning: A Comparative Study of Pre-trained Language Models(Ziyi Wang, 2024, Journal of Advanced Computing Systems)
- Artificial intelligence in visually rich document digitization for efficient asset management(I. P. D. D. Silva, Ester Deschamps Macedo, A. Gomes, Victor Crisóstomo Mellia, Caio Henrique de Morais Ferreira Pinto, Rômulo César Dias de Andrade, B. L. Bezerra, Roberta Fagundes, 2025, CONTRIBUCIONES A LAS CIENCIAS SOCIALES)
- Training LayoutLM from Scratch for Efficient Named-Entity Recognition in the Insurance Domain(Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro, 2024, No journal)
- A dataset for document level Chinese financial event extraction(Yubo Chen, Tong Zhou, Sirui Li, Jun Zhao, 2025, Scientific Data)
- FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation(Zichen Tang, E. Haihong, Rongjin Li, Jiacheng Liu, Lin Jia, Zhuodi Hao, Zhongjun Yang, Yuanze Li, H. Tian, Xinyi Hu, Peizhi Zhao, Yuan Liu, Zhengyun Wang, Xianghe Wang, Yiling Huang, Xueyuan Lin, Ruofei Bai, Zijian Xie, Qian Huang, Rui Cao, Haocheng Gao, 2025, Proceedings of the AAAI Conference on Artificial Intelligence)
- Fine-grained document-level financial event argument extraction approach(Ze Chen, Wanting Ji, Linlin Ding, Baoyan Song, 2023, Engineering Applications of Artificial Intelligence)
- Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction(Shun Zheng, W. Cao, W. Xu, Jiang Bian, 2019, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP))
- DEXTER - Data EXTraction & Entity Recognition for Low Resource Datasets(Nihal V. Nayak, Pratheek Mahishi, Sagar Rao, 2019, AAAI Spring Symposium Combining Machine Learning with Knowledge Engineering)
- FinMind-Y-Me at the Regulations Challenge Task: Financial Mind Your Meaning based on THaLLE(Pantid Chantangphol, Pornchanan Balee, Kantapong Sucharitpongpan, Chanatip Saetia, Tawunrat Chalothorn, OpenAI Josh, Steven Achiam, Sandhini Adler, Agarwal Lama, Ilge Ahmad, Florencia Akkaya, Leoni Diogo, Janko Almeida, Sam Altenschmidt, Shyamal Alt-man, Red Anadkat, Igor Avila, Babuschkin Suchir, V. Balaji, Paul Balcom, Haim-ing Baltescu, Mo ing Bao, Jeff Bavarian, Ir-wan Belgum, Bello Jake, Gabriel Berdine, Christo-pher Bernadett-Shapiro, Lenny Berner, Oleg Bogdonoff, Made-laine Boiko, Anna-Luisa Boyd, Greg Brakman, Brockman Tim, Miles Brooks, Kevin Brundage, Trevor Button, Rosie Cai, Andrew Campbell, Brittany Cann, Carey Chelsea, Rory Carlson, B. Carmichael, Chan Che, Fotis Chang, Derek Chantzis, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chen, Chester Chess, Casey Cho, Hyung Won Chu, D. Chung, Jeremiah Cummings, Yunxing Currier, C. Dai, Thomas Decareaux, Noah Degry, Damien Deutsch, Arka Deville, D. Dhar, Steve Do-han, Sheila Dowling, A. Dunning, Atty Ecof-fet, Tyna Eleti, David Eloundou, Liam Farhi, Niko Fedus, Simon Felix, Posada, J. Fishman, Is Forte, Leo abella Fulford, Elie Gao, G. Christian, Vik Gibson, Tarun Goel, Gabriel Gogineni, Rapha Goh, Jonathan Gontijo-Lopes, Morgan Gordon, Scott Grafstein, Ryan Gray, Joshua Greene, Shixiang Shane Gross, Yufei Gu, Chris Guo, Jesse Hal-lacy, Jeff Han, Y. Harris, M. He, H. Jo, Chris hannes Heidecke, Alan Hesse, W. Hickey, Peter Hickey, Brandon Hoeschele, Kenny Houghton, Shengli Hsu, Xin Hu, Joost Hu, S. Huizinga, Shawn Jain, Joanne Jain, Angela Jang, R. Jiang, Haozhun Jiang, Denny Jin, Shino Jin, Billie Jomoto, Hee-woo Jonn, Tomer Jun, Lukasz Kaftan, Kaiser Ali, I. Kamali, Kanitscheider, Nitish Shirish, Tabarak Keskar, Logan Khan, J. Kilpatrick, Christina Kim, Yongjik Kim, Hendrik Kim, Jamie Kirch-ner, Matt Kiros, Daniel Knight, Lukasz Kokotajlo, A. Kondraciuk, Kondrich Aris, K. Konstantinidis, G. Kosic, K. Vishal, Michael Kuo, I. Lampe, Teddy Lan, L. Jan, J. Leike, Daniel Leung, Chak Ming Levy, L. Rachel, Molly Lim, Stephanie L. Lin, Mateusz Lin, Theresa teusz Litwin, Ryan Lopez, Patricia Lowe, Lue Anna, Kim Makanju, S. Malfacini, Todor Manning, Yaniv Markov, B. Markovski, Katie Martin, Andrew Mayer, Bob Mayne, S. McGrew, Chris-tine McKinney, Paul McLeavey, M. Jake, David McNeil, Aalok Medina, Jacob Mehta, Luke Menick, An-drey Metz, P. Mishchenko, Vinnie Mishkin, Evan Monaco, Daniel P Morikawa, T. Mossing, Mira Mu, Oleg Murati, D. Murk, Ashvin M’ely, Rei-ichiro Nair, Rajeev Nakano, Arvind Nayak, R. Neelakantan, Hyeonwoo Ngo, O. Noh, Cullen Long, Jakub W O’Keefe, Alex Pachocki, J. Paino, Ashley Palermo, Giambattista Pantu-liano, Joel Parascandolo, Emy Parish, Alex Parparita, Mikhail Passos, Andrew Pavlov, Adam Peng, Fil-ipe Perelman, de Avila Belbute, Peres Michael, H. Petrov, Pondé, O. Pinto, Michael Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack W. Rae, Aditya Ramesh, Cameron Raymond, F. Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, N. Ryder, Mario D. Saltarelli, Ted Sanders, Shibani Santurkar, G. Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Shep-pard, T. Sherbakov, Jessica Shieh, S. Shoker, P.M.V.G Shyam, Szymon Sidor, Eric Sigler, M. Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Stau-dacher, F. Such, Natalie Summers, I. Sutskever, Jie Tang, N. Tezak, Madeleine Thompson, P. Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, K. Dang, Xiaodong Deng, Yang Fan, Wenhang Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, K. Lu, Jianxin Ma, Rui Men, Emanuel Taropa, Dong Li, Phil Crone, Anmol Gulati, S'ebastien Cevey, Jonas Adler, Ada Ma, David Silver, Simon Tokumine, Richard Powell, Stephan Lee, Michael B. Chang, Samer Hassan, Diana Mincu, Antoine Yang, Nir Levine, Jenny Bren-nan, Mingqiu Wang, Sarah Hodkinson, Jeffrey Zhao, Josh Lipschultz, Aedan Pope, Cheng Li, L. El, M. Paganini, Sholto Douglas, Bernd Bohnet, Fabio Pardo, Seth Odoom, Mihaela Roșca, Cícero Nogueira, Santosh C. Kedar, Arthur Soparkar, T. Guez, Steve Hudson, Chulayuth Hansen, Ravichandra Asawaroengchai, Tianhe Addanki, Wojciech Yu, Mina Stokowiec, Justin Khan, Jaehoon Gilmer, Carrie Lee, Grimes Bo-stock, Keran Rong, Jonathan Caton, Pedram Pejman, Filip Pavetic, Geoff Brown, V. Sharma, Mario Luvci'c, Rajku-mar Samuel, J. Djolonga, Amol Mandhane, Lars Lowe Sjosund, Elena Buchatskaya, Elspeth White, Natalie Clay, Jiepu Jiang, Hyeontaek Lim, Ross Hemsley, Jane Labanowski, N. Cao, Sayed David Steiner, Hadi Hashemi, Jacob Austin, Anita Gergely, Tim Blyth, Joe Stanton, K. Shivakumar, Aditya Siddhant, Anders An-dreassen, Carlos L. Araya, Nikhil Sethi, Rakesh Shivanna, Steven Hand, Ankur Bapna, Ali Kho-daei, Antoine Miech, Garrett Tanzer, Andy Swing, S. Thakoor, Zhufeng Pan, Zachary Nado, Stephanie Winkler, Dian Yu, Mohammad Saleh, Lorenzo Maggiore, Iain Barr, Minh Giang, Thais Kagohara, Ivo Danihelka, Amit Marathe, Vladimir Feinberg, Mohamed Elhawaty, Nimesh Ghelani, Dan Horgan, Helen Miller, Lexi Walker, Richard Tanburn, Mukarram Tariq, Disha Shrivastava, Fei Xia, Chung-Cheng Chiu, Zoe Ashwood, Khuslen Baatar-sukh, Sina Samangooei, Fred Alcober, Axel Stjern-gren, P. Komarek, Katerina Tsihlas, Anudhyan Boral, R. Comanescu, Jeremy Chen, Ruibo Liu, Dawn Bloxwich, Charlie Chen, Yanhua Sun, Fangxi aoyu, M. Feng, Xerxes Mauger, Vincent Doti-walla, Michael Hellendoorn, I. Sharman, K. Zheng, Gabriel Haridasan, B. Craig, D. Swanson, Alek Rogozi’nska, Paul Kishan An-dreev, R. Rubenstein, D. Sang, G. Hurt, Ren Elsayed, Dave shen Wang, A. Lacey, Yao Ili’c, Woohyun Zhao, Han Lora, Chimezie Aroyo, Vitaly Iwuanyanwu, Balaji Niko-laev, S. Lakshminarayanan, Jazayeri Raphael Lopez, M. Kaufman, Chetan Varadarajan, Doug Tekur, Misha Fritz, David Khalman, R. Kingshuk, Shourya Dasgupta, T. Sarcar, Ornduff Javier, Fantine Snaider, Johnson Huot, Robin Jia, Nejc Kemp, Anitha Trdin, Lucy Vijayakumar, Kim Christof, L. Angermueller, Tianqi Lao, Haibin Liu, David Zhang, Somer Engel, Anaïs Greene, White Jessica, Lilly Austin, Shereen Taylor, Dan-gyi Ashraf, Maria Liu, I. Georgaki, Yana Cai, Sonam Kulizh-skaya, Brennan Goenka, Kiran Saeta, Christian Vo-drahalli, Dario Frank, Brona de Cesare, Harry Robenek, Mahmoud Richardson, A. Christopher, Priya Yew, M. Ponnapalli, Alex Tagliasac-chi, Yelin Korchemniy, Dinghua Kim, B. Li, Kyle Rosgen, Jérémy Levin, Praseem Ban-zal Praveen Wiesner, H. Srinivasan, Yu, Unlu David, Zora Reid, Daniel Tung, Ravin Finchelstein, Andre Kumar, Jin Elisseeff, Ming Huang, Z. Rui, R. Zhu, M. Aguilar, Jiawei Gim’enez, Olivier Xia, Willi Dousse, Gierke, Soheil Hassas, Damion Yeganeh, Komal Yates, Lu Jalan, Eric Li, Latorre-Chimoto, D. Dung, K. Nguyen, D. Praveen, Yaxin Kallakuri, Matthew Liu, Johnson Tomy, Alice Y. Tsai, J. Talbert, Alexander Liu, Chen Neitz, M. Elkind, M. Selvi, Jasarevic, Livio Baldini, Albert Soares, 2025, No journal)
- Leveraging Large Language Models for Few-Shot KPI Extraction from Financial Reports(Tobias Deußer, Cong Zhao, Daniel Uedelhoven, Lorenz Sparrenberg, L. Hillebrand, Christian Bauckhage, R. Sifa, 2024, 2024 IEEE International Conference on Big Data (BigData))
- A comparative study on ML-based approaches for Main Entity Detection in Financial Reports(Thanos Konstantinidis, Y. Xu, Tony Constantinides, D. Mandic, 2023, 2023 24th International Conference on Digital Signal Processing (DSP))
- Financial named entity recognition based on conditional random fields and information entropy(Shuwei Wang, Ruifeng Xu, Bin Liu, Lin Gui, Yu Zhou, 2014, 2014 International Conference on Machine Learning and Cybernetics)
- KPI-BERT: A Joint Named Entity Recognition and Relation Extraction Model for Financial Reports(L. Hillebrand, Tobias Deußer, T. Khameneh, Bernd Kliem, Rüdiger Loitz, C. Bauckhage, R. Sifa, 2022, 2022 26th International Conference on Pattern Recognition (ICPR))
- A Generative Approach for Comprehensive Financial Event Extraction at the Document Level(Jinan Zou, Yanxi Liu, Yuankai Qi, Hai Cao, Lingqiao Liu, Javen Qinfeng Shi, 2023, 4th ACM International Conference on AI in Finance)
- A Prior Information Enhanced Extraction Framework for Document-level Financial Event Extraction(Haitao Wang, Tong Zhu, Mingtao Wang, Guoliang Zhang, Wenliang Chen, 2021, Data Intelligence)
- Financial Report Entity-Relation Extraction Combining BiLSTM-CRF and FGM Adversarial Training(Robert J. H. Miller, Sarah E. Hamilton, 2026, Frontiers in Business and Finance)
- Advancements in Financial Document Structure Extraction: Insights from Five Years of FinTOC (2019-2023)(Juyeon Kang, Mauli Mehulkumar Patel, A. Agrawal, Simhadri Sevitha, R. Srinivasa, Sandra Bellato, M. A. Kumar, Ngawang Dempa Tsang, Mo El-Haj, 2023, 2023 IEEE International Conference on Big Data (BigData))
- Computer Vision-Based Framework for Data Extraction From Heterogeneous Financial Tables: A Comprehensive Approach to Unlocking Financial Insights(Iftakhar Ali Khandokar, Priya Deshpande, 2025, IEEE Access)
- Domain Adaption of Named Entity Recognition to Support Credit Risk Assessment(J. Alvarado, Karin M. Verspoor, Timothy Baldwin, 2015, Australasian Language Technology Association Workshop)
- Named Entity Recognition Based Approach for Automatic Turkish Financial Document Verification(A. Toprak, Metin Turan, 2025, 2025 10th International Conference on Computer Science and Engineering (UBMK))
- The Financial Document Causality Detection Shared Task (FinCausal 2025)(Antonio Moreno-Sandoval, Jordi Porta-Zamorano, Blanca Carbajo-Coronado, Yanco Torterolo, D. Samy, 2025, No journal)
- Semantic Causality Knowledge Graph with Ontology Integration for Financial Analysis(Chinthakunta Manjunath, L. H, Pooja Maken, Manohar M, Samuel Shaju, 2025, ITM Web of Conferences)
- Enhanced Named Entity Recognition algorithm for financial document verification(A. Toprak, M. Turan, 2023, The Journal of Supercomputing)
- Transformer-Based Approach for Automatic Semantic Financial Document Verification(A. Toprak, Metin Turan, 2024, IEEE Access)
- A Pluggable Multi-Task Learning Framework for Sentiment-Aware Financial Relation Extraction(Jinming Luo, Hailin Wang, 2025, 2025 International Joint Conference on Neural Networks (IJCNN))
- SEHF: A Summary-Enhanced Hierarchical Framework for Financial Report Sentiment Analysis(Haozhou Li, Qinke Peng, Xinyuan Wang, Xu Mou, Yonghao Wang, 2024, IEEE Transactions on Computational Social Systems)
- A Named Entity Extraction System for Historical Financial Data(Wassim Swaileh, T. Paquet, S. Adam, Andres Rojas Camacho, 2020, Lecture Notes in Computer Science)
- Multi-Stage Retriever Model for Document Classification Using Fine Tuned Embedding Model(Navaneeth Amarnath, Saiteja Tallam, Pratyusha Rasamsetty, Deepak Kumar, 2025, 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC))
- Deep Learning for Effective Classification and Information Extraction of Financial Documents(Valentin-Adrian Serbanescu, Maruf A. Dhali, 2025, Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods)
- Hybrid Model for Financial Named Entity Recognition in Ukrainian using CRF, BiLSTM, and BERT(V. Ivanenko, 2025, WSEAS TRANSACTIONS ON SYSTEMS)
- A Domain-Specific Transformer Approach for Financial Statement Fraud Detection(Matin N. Ashtiani, Shanzi Rui, B. Raahemi, 2025, Proceedings of the Canadian Conference on Artificial Intelligence)
- Optimizing Financial Named Entity Recognition with Pre-Trained Language Model(Victor Gayuh Utomo, G. F. Shidik, Muljono, Purwanto, 2025, 2025 9th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE))
- Finalyze: A RAG-Based Framework for Intelligent Financial Document Analysis(Shloka Mehta, Tisha Negandhi, Sunil Ghane, 2025, 2025 5th International Conference on Emerging Research in Electronics, Computer Science and Technology (ICERECT))
- Comparative Evaluation of Transformer-Based and Baseline NER Models for Kenyan Financial Text(Chandaka Giri Babu, P. Soni, Ramesh Chundi, S. S., 2026, 2026 9th International Conference on Computational Intelligence in Data Science (ICCIDS))
- Table Extraction from Financial and Transactional Documents(Rama Krishna Raju Samantapudi, 2025, International journal of IoT)
- A DOMAIN-ADAPTIVE QUESTION ANSWERING FRAMEWORK FOR FINANCIAL TEXTS WITH MULTI-TASK SEMANTIC REASONING(Ali Raza, Fatima Noor, 2024, Computing and Applications reviews)
财报情感叙事、可读性与信息披露质量
该组研究侧重于文本语调、语言复杂性对企业披露质量、ESG表现及利益相关者心理的影响,分析管理层如何通过语言策略影响市场预期。
- NLP-Driven Sentiment Analysis of Earnings Calls and Its Impact on Stock Volatility(Venkata Sai Nageen, 2024, International Journal of AI, BigData, Computational and Management Studies)
- Financial sentiment analysis with FUNNEL: filtered UNion for NER-based ensemble labeling(William Nordansjö, Fredrik Fourong, M. Qasim, 2025, Digital Finance)
- Annual report tone and divergence of opinion: evidence from textual analysis(Zhihao Qin, Menglin Cui, 2024, Journal of Applied Economics)
- Digital Text Analytics in Narrative Analysis of Financial Reports: A Computational Approach to Interpret Accounting Language(Khairul Khairul, Sulaiman Ahmad, Syahrudin Marpaung, Ratna Ratna, Jonni Hamonangan Silaen, 2025, International Journal of Business and Applied Economics)
- Cost of Equity Capital and Annual Report Tone Manipulation(Jiang Jian, Fan Yang, Minglang Liu, Yi Liu, 2023, Emerging Markets Finance and Trade)
- Sentiment Analysis for Financial Markets(Piyush Jawale, Saahil Jawale, Dhaval Ingale, M. Shetty, 2023, International Journal for Research in Applied Science and Engineering Technology)
- Sentiment Analysis of ESG Disclosures on Annual Report in Thailand(Tharnsaithong Hirunsri, W. Nadee, 2024, 2024 8th International Conference on Information Technology (InCIT))
- Enhancing Financial Sentiment Analysis Ability of Language Model via Targeted Numerical Change-Related Masking(Hui Do Jung, Beakcheol Jang, 2024, IEEE Access)
- The Impact of Annual Report Tone on Corporate Financing: A Case Study of Tesla, Inc.(Zhujun Zhang, 2026, Exploring Science Academic Conference Series)
- A Sentiment Analysis Model for Annual Reports Based on FinBERT and Dual Channel Attention(Chuan Zhan, Xiaoyu Hu, 2024, Proceedings of the 2024 4th International Conference on Artificial Intelligence, Big Data and Algorithms)
- Unraveling the synergy of environmental information disclosure quality and tone on corporate green innovation: based on text mining technology(Shu Hu, Chen Zhang, Yuanpu Ji, Di Pan, Dandan Zhu, 2025, Chinese Management Studies)
- Market sentiment analysis using multimodal transformers: Integrating earnings calls, social media, and technical indicators(Rahul Modak, 2023, Global Journal of Engineering and Technology Advances)
- Performance measurement through corporate communication: evidence from Indian manufacturing firms(P. Pant, Ashok Nimiwal, Shantanu Dutta, S.P. Sarmah, 2025, International Journal of Productivity and Performance Management)
- IFRS adoption and the readability of corporate annual reports: evidence from an emerging market(Ibrahim El-Sayed Ebaid, 2023, Future Business Journal)
- Development of an Automatic Summarization System based on Large Language Models for Annual Report Analysis(M. Rizki, Yudi Wibisono, Eddy Prasetyo Nugroho, 2025, Brilliance: Research of Artificial Intelligence)
- Automated Financial Report Summarization Using Python: A PDF-Based Approach(Fahmi Rizky Nugraha, 2025, Scientific Journal of Information System)
- From complexity to clarity: unraveling the determinants of annual report readability and tone in an emerging market(S. Samarakoon, Rudra Pratap Pradhan, ·. R. P. C. R. Rajapakse, Premjit Sahoo, R. Rajapakse, 2025, Quality & Quantity)
- A critical genre analysis of MD&A discourse in corporate annual reports(Yubin Qian, 2020, Discourse & Communication)
- ESG disclosure, public perception and corporate financial performance: An empirical study based on textual analysis.(Yuangao Chen, Ziye Xie, Lu Wang, Liyuan Zhu, 2025, Journal of Environmental Management)
- Measuring Technostress in Corporate Culture: Insights from the 10-K Annual Reports(Nayera Eltamboly, Magdy Farag, Mohamed Gomaa, Maysa Ali Mohamed Abdallah, 2026, Journal of Risk and Financial Management)
- The Impact of Corporate Digital Transformation Disclosures on Accounting Information Quality: An ESG Governance Perspective Using Text Mining Techniques(Meng Li, 2024, Proceedings of the 2024 5th International Conference on Big Data Economy and Information Management)
- Assessment of TCFD Voluntary Disclosure Compliance in the Spanish Energy Sector: A Text Mining Approach to Climate Change Financial Disclosures(Matías Domínguez-Quiñones, I. Aliende, L. Escot, 2025, World)
- The readability of corporate sustainability narratives in a crisis: an analysis of linguistic features using Coh-Metrix(L. Bini, Silvia Fissi, 2025, Social Responsibility Journal)
- A Text Analysis Approach for Assessing Annual Report Quality: A Comparative Case Study of Vietnamese and Chinese Companies(Truong Thi Minh Duc, 2026, International Journal of World Economic Research)
- Identifying Sustainability Efforts in Company’s Reports Using Text Mining and Machine Learning(Evangelos Xevelonakis, Tanbir Mann, 2024, Athens Journal of Sciences)
- The impact of green information disclosure on stock price synchronicity: an analysis based on NER and TF-BIDF textual techniques(Yining Cao, Suchang Yang, 2025, Future Business Journal)
- Evaluating Green Supply Chain Practices in Southeast Asia: A Text Mining Approach on Corporate Sustainability Reports(Andi Dwi Anjani, Iqlillah Nur Aida, Faishal Muhammad, 2025, Journal of Management and Informatics)
- Measuring the Readability of Sustainability Reports: A Corpus-Based Analysis Through Standard Formulae and NLP(Nils Smeuninx, Bernard De Clerck, W. Aerts, 2020, International Journal of Business Communication)
- Sustainable disclosures of polish banks – text mining analysis(Martyna Białek-Szkudlarek, 2025, Journal of Finance and Financial Law)
- Incorporating an Unsupervised Text Mining Approach into Studying Logistics Risk Management: Insights from Corporate Annual Reports and Topic Modeling(David L. Olson, Bongsug Chae, 2023, Information)
- The impact of supply chain uncertainty on dual innovation within manufacturing enterprises: An empirical analysis based on computational modeling and text mining(Rongyan Jia, Jiaqi Meng, 2025, Proceedings of the 2025 International Conference on Information Economy, Data Modeling and Cloud Computing)
- Board Media Background on Annual Report Readability: Evidence from Indonesia(Nirwanda Nila Sari, Siti Nur 'Aini, Iman Harymawan, 2025, InFestasi)
- AI-POWERED METADATA INTELLIGENCE: CLUSTERING FINANCIAL REPORTS FOR DYNAMIC DISCOVERY(Preeta Pillai, 2024, INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING AND TECHNOLOGY)
- Information Disclosure and Corporate Earnings Management:A Textual Analysis Based on the Annual Report(Mei Zhou, 2019, Proceedings of the 2019 4th International Conference on Modern Management, Education Technology and Social Science (MMETSS 2019))
- Fostering Digital Business Model Innovation: The Role of Government Venture Capital in High‐Tech SMEs(Jiaxin Liu, Hang Yin, Bowen Liu, 2026, Creativity and Innovation Management)
- Annual report text’s positive tone and corporate green innovation: Evidence from China(Yange Gao, Jian Feng, 2024, PLOS ONE)
- The Effect of Annual Report Readability on Analysts’ Earnings Forecast Accuracy(Seung Jae Lee, 2025, Korean Accounting Information Association)
- Trends of CEO Messages in Corporate Sustainability Reports: Text Mining and CONCOR Analysis(Yoojin Shin, Hyejin Lee, 2026, Sustainability)
- Mining the emotional information in the audio of earnings conference calls : A deep learning approach for sentiment analysis of securities analysts' follow-up behavior(Chen Yuan, Dongmei Han, Xiaofeng Zhou, 2023, International Review of Financial Analysis)
- Future Financial Impact Analysis from Sentiment and Indicators Analysis(Oleksii Ivanov, V. Kobets, 2025, Computational Economics)
- Exploring gender stereotypes in financial reporting: An aspect-level sentiment analysis using big data and deep learning(Fabiola Jeldes, Tiago Ferreira, David Díaz, Rodrigo Ortiz, 2024, Heliyon)
- Does the annual report readability improve corporate R&D investment? Evidence from China(Ya-Guang Du, S. Li, Nan‐Ting Kuo, Danielle Li, 2024, International Journal of Disclosure and Governance)
- Legitimating Financial Deficit in Corporate Annual Reports: Evidence in the U.S.(Xiaoling Zhong, Y. Zhou, 2024, Journal of Education and Culture Studies)
- The Impact of Couple's Joint Holdings on the Annual Report’s tone: An Analysis Based on Text Information(Qianqian Zhu, Peiyao Hu, J. Han, 2023, Highlights in Business, Economics and Management)
- Revealing Insights: Sentiment Analysis of Indian Annual Reports(Chaithra Chaithra, Biju R. Mohan, 2024, 2024 3rd International Conference for Innovation in Technology (INOCON))
- Does the linguistic complexity of annual reports affect the corporate leasing decision?(Danlin Chi, Hasibul Chowdhury, Nicolas Eugster, Jiayi Zheng, 2025, Journal of Financial Research)
- Performance pressure and annual report text manipulation: Evidence from China(Yanxi Li, Delin Meng, Lan Wang, 2024, Business Ethics, the Environment & Responsibility)
- Are Delayed Annual Reports a Sign of Bad News? A Predictive Analysis Using Machine Learning And Sentiment Analysis(Abdullah Kursat Merter, Yavuz Selim Balcıoğlu, Sedat Çerez, Gökhan Özer, 2024, International Conference on Eurasian Economies)
- Identifying greenwashing in corporate‐social responsibility reports using natural‐language processing(Nina Gorovaia, Michalis Makrominas, 2024, European Financial Management)
财务绩效预测、舞弊检测与风险智能审计
该组文献结合文本语义特征与财务指标,利用机器学习及智能审计框架,构建模型进行财务舞弊识别、经营困境预警及市场趋势预测。
- Big data analytics for financial auditing practices; Identification of conceptual patterns, implications and challenges using text mining(K. Musunuru, 2024, Contaduría y Administración)
- Financial Statement Fraud Detection using Analysis of Corporate Social Responsibility Disclosure and Annual Report Readability with Earnings Management as Moderating Variable(B. L. Handoko, Agata Laras Permana Gita Prastiwi, 2022, Proceedings of the 9th International Conference on Management of e-Commerce and e-Government)
- An Analysis on Financial Statement Fraud Detection for Chinese Listed Companies Using Deep Learning(Xiuguo Wu, Shengyong Du, 2022, IEEE Access)
- Application of Machine Learning for Fraud Detection in Corporate Annual Financial Reports(M. Syahrudin, Jusra Tampubolon, Fuad Fazil Osmanov, 2025, Journal of Investigative Auditing & Financial Crime)
- Leveraging Rough Set Theory to Enhance the Performance of Financial Statement Fraud Detection Model(Chengzhi Niu, Yougan Zhu, Hong He, Boyang Li, Yijia Ren, 2024, Proceedings of the 2024 4th International Conference on Artificial Intelligence, Big Data and Algorithms)
- A Technology-Driven Auditing Framework for Standardizing ESG Reporting Across Global Disclosure Systems(Anshor Alfayed Tanjung, Jevon Nainggolan, I. Muda, R. Regin, M. Thariq, 2025, AVE Trends in Intelligent Social Letters)
- Financial text analysis and credit risk assessment using a GPT-4 and improved BERT fusion model(H. Tan, Y. Xie, 2025, PLOS One)
- Algorithmic Models and Practical Applications in Intelligent Financial Systems(Man Zhou, Chunhui Bi, Jie Chen, 2025, Proceedings of the 2025 2nd International Conference on Digital Economy, Blockchain and Artificial Intelligence)
- Research and Empirical Evidence of Machine Learning based Financial Statement Analysis Methods(Yaotang Fan, 2024, Scalable Computing: Practice and Experience)
- Developing a Novel Audit Risk Metric Through Sentiment Analysis(Xiao Wang, Feng Sun, Min Gyeong Kim, H. Na, 2025, Sustainability)
- An analysis on Financial Statement Fraud Detection for Listed Companies using DCNN-LSTM-AE-AM Model(Rajib Bhattacharya, Rakesh Kumar, U. Rajeswari, Jagtap Aparna Prakash, Neha Barodia, Sushma Sawadatkar, 2024, 2024 Asian Conference on Intelligent Technologies (ACOIT))
- Leveraging Large Language Models in Financial Statement Fraud Detection of Listed Companies(Changhao Song, Min Liu, Chuanghao Dong, Lu Zhang, Changjian Fang, 2025, 2025 Thirteenth International Conference on Advanced Cloud and Big Data (CBD))
- Is It Worth the Effort? Considerations on Text Mining in AI-Based Corporate Failure Prediction(Tobias Nießner, Stefan Nießner, M. Schumann, 2023, Information)
- Fine-Grained Sentiment Analysis for Enhanced Financial Distress Prediction(Ming Zhang, Jiazhen Chen, V. Palade, 2024, 2024 IEEE 3rd International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA))
- Financial Forecasting Using Character N-Gram Analysis and Readability Scores of Annual Reports(Matthew Butler, Vlado Keselj, 2009, Lecture Notes in Computer Science)
- Predicting Stock Market Trends Through a Hybrid Machine Learning Framework That Combines Technical Indicators with News Sentiment Analysis(Jianjiang Li, 2025, Applied and Computational Engineering)
- Temporal Evolution of Sentiment in Earnings Calls and Its Relationship with Financial Performance(Zhuxuanzi Wang, Toan Khang Trinh, Wenbo Liu, Chenyao Zhu, 2025, Applied and Computational Engineering)
- Crypto Accounting Market Dynamics: Advanced Econometric Analysis of Earnings Impact with BERT-Powered Gen AI Models(Karina Kasztelnik, Steven Campbell, Eva Jermakowitz, 2025, Journal of Global Awareness)
- Semantic Graph Learning for Trend Prediction from Long Financial Documents(Bolun Xia, Aparna Gupta, Mohammed J. Zaki, 2024, 2024 IEEE Symposium on Computational Intelligence for Financial Engineering and Economics (CIFEr))
- Security exchange commission forms K-10 filings – Positive and negative word occurrence dataset 1995–2008(Piotr Staszkiewicz, Richard Staszkiewicz, 2022, Data in Brief)
- Financial Statement Analysis Based on RNN-RBM Model(Fang Liu Fang Liu, 2024, Journal of Electrical Systems)
- Estimating Profitability Decomposition Frameworks via Machine Learning: Implications for Earnings Forecasting and Financial Statement Analysis(Oliver Binz, Katherine Schipper, Kevin R. Standridge, 2025, Journal of Accounting and Economics)
- Leveraging Large Language Models and Prompt Settings for Context-Aware Financial Sentiment Analysis(Rabbia Ahmed, Sadaf Abdul Rauf, Seemab Latif, 2024, 2024 5th International Conference on Advancements in Computational Sciences (ICACS))
- Study of the Impact of Annual Report Text Tone on Corporate Financing Constraints(Mingjie Wu, 2023, Highlights in Business, Economics and Management)
- Self-Supervised Learning for Financial Statement Fraud Detection with Limited and Imbalanced Data(Jianlin Lai, Anzhuo Xie, Hanrui Feng, Yi Wang, Ruoyi Fang, 2025, Proceedings of the 4th International Conference on Artificial Intelligence and Intelligent Information Processing)
- Financial Statement Fraud Detection Through an Integrated Machine Learning and Explainable AI Framework(Tsolmon Sodnomdavaa, Gunjargal Lkhagvadorj, 2025, Journal of Risk and Financial Management)
- Enhancement of fraud detection for narratives in annual reports(Yuh-Jen Chen, Chun-Han Wu, Yuh-Min Chen, Hsi Li, Huei-Kuen Chen, 2017, International Journal of Accounting Information Systems)
- Research on financial fraud detection by integrating latent semantic features of annual report text with accounting indicators(Weilong Liu, Zhongguo Wang, Xv Zhang, 2025, Journal of Accounting & Organizational Change)
- Attention-Driven Dual-Level Cost-Sensitive Stacking for Financial Statement Fraud Detection(Matin N. Ashtiani, B. Raahemi, 2025, 2025 International Conference on Advanced Machine Learning and Data Science (AMLDS))
- AI-Driven Accounting Automation: Leveraging NLP for Financial Document Processing(Namanyay Goel, S.P.N. Singh, 2025, International Journal of Research in Modern Engineering & Emerging Technology)
- Managerial long‐termism and corporate innovation: Evidence from China through text mining approaches(Ning Xu, Di Zhang, Yingjie Bai, 2023, Business Strategy and the Environment)
- FINANCIAL STATEMENT ANALYSIS OF HEROMOTOCORP(S. kumar, M.Rajeshwar Reddy, R.Gowthami, 2025, Journal of Science Engineering Technology and Management Sciences)
- FinReflectKG: Agentic Construction and Evaluation of Financial Knowledge Graphs(Abhinav Arun, Fabrizio Dimino, T. Agarwal, Bhaskarjit Sarmah, Stefano Pasquali, 2025, International Conference on AI in Finance)
- Mining Executive Compensation Data from SEC Filings(Chengmin Ding, Ping Chen, 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06))
- Data Asset Information Disclosure and Capital Market Efficiency: Empirical Research Based on Big Data Text Mining(Shuhui Wang, 2024, Membrane Technology)
- Corporate digitalization and green innovation: Evidence from textual analysis of firm annual reports and corporate green patent data in China(Liting Fang, Zhaohua Li, 2024, Business Strategy and the Environment)
- AI-Augmented Compliance Adaptive Risk Intelligence for Detecting Emerging Financial Crime Patterns in Multinational Corporations(Ogunkola Michael, 2025, International Journal of Scientific Research and Modern Technology)
- Fine-tuning ClimateBert transformer with ClimaText for the disclosure analysis of climate-related issues in corporates’ financial and non-financial reports(E.C. Garrido-Merchán, Cristina González-Barthe, M. Coronado-Vaca, 2026, Neural Computing and Applications)
- Equity Research Report-Driven Investment Strategy in Korea Using Binary Classification on Stock Price Direction(Poongjin Cho, Ji Hwan Park, Jae Wook Song, 2021, IEEE Access)
- Unlocking Financial Statement Fraud Detection: Tracking Disclosure Changes via Representation Learning(Yue Yu, Zhen Wu, Yanni Han, Zhuoqun Li, Wenqi Wei, 2025, ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
- Beyond Compliance: Multi-Dimensional Text Mining Analysis of Corporate Sustainability Reporting(Irge Sener, Ahmet Anıl Karapolatgil, 2025, Amfiteatru Economic)
- Topic Trends in Sustainability Disclosure of German DAX 40 Companies—A Text Mining-Based Analysis(Tobias Contala, A. Gerk, Johannes Hoettler, Ricardo Buettner, 2024, IEEE Access)
- Speech emotion recognition and text sentiment analysis for financial distress prediction(P. Hájek, Michal Munk, 2023, Neural Computing and Applications)
- SGFNet: A semantic graph-based multimodal network for financial invoice information extraction(Shun Luo, Juan Yu, 2024, Expert Systems with Applications)
- Evaluating the Impact of Digital Transformation and Sustainability Strategies on Earnings Management: A Text Mining Approach(A. Wibowo, Iis Istianah, Nia Pramita Sari, Dovi Septiari, 2024, Asia Pacific Fraud Journal)
- Research on Financial Statement Analysis Methods Based on Machine Learning(Xinyu Zhu, Leyan Jiang, Yixin Gao, Yuqian Yin, 2023, Proceedings of the 2023 3rd Guangdong-Hong Kong-Macao Greater Bay Area Artificial Intelligence and Big Data Forum)
- Text Mining and Machine Learning on 10-K Risk Factors and Net Income: Evidence from Apple(Sherry Huang Sherry Huang, Shi-Ming Huang Sherry Huang, 2025, International Journal of Computer Auditing)
- Exploration of the relationship between SDGs and CSR reports with text mining techniques for stock exchange companies in Taiwan(Tai-Kuei Yu, Jeou-Shyan Horng, I. Chang, Chih‐Hsing Liu, Sheng-Fang Chou, Tai-Yi Yu, 2025, Discover Computing)
- AI Focus and Organizational Agility: A Text Mining Analysis of KOSPI 200 Corporate Annual Reports(Byeong-Gi Kim, Han-Gyun Woo, 2025, Journal of Korea Technology Innovation Society)
- Conceptual ascendant feature extraction of a financial corpus(Ali Mohamed Al-Jaoua, J. M. Alja'am, Helmi Hammami, Fethi Ferjani, Firas Laban, N. Semmar, H. Essafi, S. Elloumi, 2010, 2010 IEEE International Conference on Progress in Informatics and Computing)
- The Circular Economy in Corporate Reporting: Text Mining of Energy Companies’ Management Reports(Márcia R. C. Santos, A. Rolo, Dulce Matos, L. Carvalho, 2023, Energies)
- Financial Statement Fraud Detection via Large Language Models(Zehra Erva Ergun, Emre Sefer, 2025, Intelligent Systems in Accounting, Finance and Management)
- Construction of Financial Statement Analysis and Prediction Model Based on Deep Learning Algorithm(Jingjing Liu, Yanwei Li, Degang Yang, 2024, Proceeding of the 2024 5th International Conference on Computer Science and Management Technology)
其他宏观与专项关联研究
该组文献探讨财报在宏观监管环境、增长率预测以及公司间网络结构关系中的作用,提供跨领域的补充视角。
- Prediction of Growth Rate of Operating Income Using Securities Reports(Tetsuya Nakatoh, H. Amano, S. Hirokawa, 2013, 2013 Second IIAI International Conference on Advanced Applied Informatics)
- Impact of EU non-financial reporting regulation on Spanish companies’ environmental disclosure: a cutting-edge natural language processing approach(Javier Villacampa-Porta, M. Coronado-Vaca, E.C. Garrido-Merchán, 2025, Environmental Sciences Europe)
- Study on the Relationship Between Corporate Culture and Corporate Performance in IT industry in Japan via Text Mining(Eri Uchida, Yasunobu Kino, 2023, Procedia Computer Science)
- Lazy Network: A Word Embedding‐Based Temporal Financial Network to Avoid Economic Shocks in Asset Pricing Models(George Adosoglou, Seonho Park, Gianfranco Lombardo, S. Cagnoni, P. Pardalos, 2022, Complexity)
- A Retrieval-Augmented Multi-Agent System for Financial Statement Analysis(2024, Journal of Computational Analysis and Applications)
- How does managerial perception of uncertainty affect corporate investment during the COVID-19 pandemic: A text mining approach(Ying Chen, Yosuke Kimura, Kotaro Inoue, 2024, Pacific-Basin Finance Journal)
综述显示,自然语言处理在财报分析中的应用已演进为以深度学习和大型语言模型为核心的系统性方法论。当前研究主要划分为四大支柱:一是基于深度模型实现的高精度信息抽取与结构化分析;二是围绕财报情感、可读性及其对市场信号、披露质量影响的叙事分析;三是结合文本信息与量化财务数据的预测建模,涵盖舞弊识别、合规预警及财务预测;四是针对宏观监管环境与公司关系网络等专项议题的补充研究。总体而言,该领域正从传统的词频统计迈向复杂的多模态语义推理与智能审计决策支持。
总计152篇相关文献
A textual analysis of corporate‐social responsibility (CSR) reports reveals that companies engaged in environmental violations report differently from firms with a clean record. The violators issue longer, more positive and more frequent reports to relay environmental content that is more copious but less readable. The violator firms appear to modify their reporting practices right after committing a violation. The findings suggest that culpable firms exploit the current unregulated–unaudited state of CSR reporting as a means of greenwashing and call for institutional change. Our results are robust to a number of industry‐firm characteristics, including board composition, ownership dispersion and international presence.
Financial reports are often lengthy, complex, and filled with domain-specific jargon, making itdifficult for analysts and stakeholders to extract key insights efficiently. This study proposes anautomated summarization system using Natural Language Processing (NLP) techniques to generateconcise and coherent summaries of financial reports. The system employs a two-stage summarizationarchitecture combining extractive and abstractive methods based on Transformer models such asBART, PEGASUS, and T5. Evaluation on simulated financial document datasets demonstrates thatthe hybrid two-stage model achieves the highest ROUGE scores and information retention ratescompared to single-model baselines. The results indicate that NLP-driven summarization cansignificantly reduce analysts’ workload and improve financial decision-making speed
Financial reports contain not only numerical data but also narratives that reflect managerial perceptions, strategies, and communications. This study aims to develop a digital text analytics approach to interpret accounting language in annual reports of companies on the Indonesia Stock Exchange (IDX) for the 2020–2024 period. Using Natural Language Processing (NLP) methods, this study analyzed 50 annual reports in the Management Discussion and Analysis (MD&A) section. The analysis stages included text pre-processing, tokenization, sentiment analysis, and topic modeling using the Latent Dirichlet Allocation (LDA) approach. The results show that financial report narratives in Indonesia tend to have a highly optimistic tone, although this does not always align with actual financial performance. The topic analysis revealed three main narrative patterns: (1) growth and strategy narratives, (2) risk and mitigation narratives, and (3) social responsibility narratives. This study provides a theoretical contribution to the development of a computational linguistics approach in accounting and offers practical implications for auditors, analysts, and regulators in assessing the transparency of financial information.
With the rapid development of artificial intelligence technology, intelligent financial systems are gradually replacing traditional financial management models. This research explores the practical applications of commonly used algorithmic models in intelligent financial systems, focusing on financial forecasting, risk assessment, anomaly transaction detection, and automated decision support. Experimental results show that deep learning-based prediction models improved accuracy by 23.7% compared to traditional statistical methods; the integration of knowledge graphs and natural language processing technology enhanced financial report analysis efficiency by 35%; and intelligent risk assessment models can identify financial risks 8-12 days before risk events occur. Comprehensive empirical analysis confirms that the rational application of algorithmic models can significantly improve financial management efficiency, reduce operational costs, and enhance enterprise decision-making intelligence.
Advancing growth and internationalisation in the financial system increased the contribution of multinational corporations (MNCs) to becoming the most vulnerable target of advanced financial crimes. Rule-based compliance environments cannot keep up with new threats, which usually results in violations of regulations, loss of finances, and reputational harm. This conceptual research paper will examine how Artificial Intelligence (AI) could offer an effective compliance process and highlight emerging trends of financial crimes through the combination of Artificial Intelligence (AI) and adaptive risk intelligence. Based on the existing theories of Fraud Triangle, Enterprise Risk Management (ERM), and Adaptive Systems Theory, the paper suggests an AI-augmented compliance framework integrating such technologies as machine learning, natural language processing and anomaly detection. It also looks at some critical challenges, such as data privacy, regulatory limitations, a lack of standardization, and AI systems' explainability. The report shows the advantages MNCs could enjoy, such as active search of risks, enhanced worldwide compliance administration, and smart utilization of resources. The paper has established the importance of dynamic and self-learning compliance architectures that can balance technology developments and regulatory requirements through a thorough literature review and theoretical underpinnings. It ends with strategic propositions of ethical and scalable application of AI in cross-border compliance functions. Nevertheless, with the supplement of AI-powered adaptive systems, there is a chance to redesign how financial crimes are detected and raise corporate resilience in the fast-changing, highly risky market environment.
No abstract available
As enterprises grapple with vast volumes of financial data and report artifacts, finding relevant reports in legacy and cloud-based business intelligence systems becomes increasingly complex. This article explores how machine learning (ML) can be used to enable intelligent metadata classification and clustering of financial reports to support dynamic, user-centric discovery. By applying natural language processing (NLP), unsupervised learning algorithms, and user interaction analytics, organizations can shift from static folder-based hierarchies to adaptive, recommendation-driven interfaces that improve efficiency, security, and decision-making.
Public companies in the US stock market must annually report their activities and financial performances to the SEC by filing the so‐called 10‐K form. Recent studies have demonstrated that changes in the textual content of the corporate annual filing (10‐K) can convey strong signals of companies’ future returns. In this study, we combine natural language processing techniques and network science to introduce a novel 10‐K‐based network, named Lazy Network, that leverages year‐on‐year changes in companies’ 10‐Ks detected using a neural network embedding model. The Lazy Network aims to capture textual changes derived from financial or economic changes on the equity market. Leveraging the Lazy Network, we present a novel investment strategy that attempts to select the least disrupted and stable companies by capturing the peripheries of the Lazy Network. We show that this strategy earns statistically significant risk‐adjusted excess returns. Specifically, the proposed portfolios yield up to 95 basis points in monthly five‐factor alphas (over 12% annually), outperforming similar strategies in the literature.
The exponential growth of environmental, social, and governance (ESG) reporting has resulted in a bewildering variety of frameworks and mixed data quality. There is a possibility that greenwashing and comparability concerns for stakeholders will arise in the absence of a single auditing standard. This is because firms report on their non-financial performance more frequently. This study presents a novel technology-driven auditing system to integrate multiple global reporting standards into a single compliance metric. In this article, Natural Language Processing and statistical scoring methods are utilised to evaluate the credibility of self-reported environmental, social, and governance (ESG) information against global standards. The assessment is based on a sample of 477 different CSR reports. To identify gaps in governance transparency and social responsibility indicators, the research uses Python-based text-mining algorithms and domestically built scoring matrices. The findings suggest that, despite environmental reporting progressing towards maturity, the social and governance features remain inadequately adapted to subjective interpretation and inconsistency. The proposed method offers a standardised framework for external auditors to assess environmental, social, and governance (ESG) assertions, thereby contributing to greater openness and accountability in the capital market.
This study aims to address the limited understanding of how linguistic signals in corporate disclosures influence firm performance in emerging economies, where unique ownership structures, market concentration and policy uncertainties shape disclosure practices. By exploring these dynamics, the study provides insights into the complex interplay between communication strategies and financial outcomes. The study uses annual reports to extract the embedded tones/signals using natural language processing (NLP) techniques. Specifically, this paper performs the sentence segmentation, preprocessing and parsing of textual content of the annual reports using Gensim, Spacy and the Regular Expression package in Python at several phases of parsing the text. Furthermore, we test our proposed hypotheses using a panel data regression approach. The findings report that uncertain, litigious and financial constraint signals have a negative and significant effect on the firm’s financial outcomes. It highlights that the uncertain or litigation-related content in the annual report disclosure deteriorates the financial outcomes. Most importantly, the study reveals that promoter ownership and the Herfindahl index positively moderate the relationship between embedded signals and financial performance. However, policy uncertainties and business group (BG) affiliation negatively moderate the relationship between embedded signals and financial performance. This study uniquely contributes to the literature by employing advanced NLP techniques to decode embedded signals in corporate disclosures and assess their impact on financial performance. It provides novel insights into the moderating roles of promoter ownership, market concentration, EPU and BG – factors that are especially pertinent in the context of emerging economies.
This study introduces an innovative approach for quantifying the technostress phenomenon, drawing on textual narratives from the firm’s annual report. Based on a dataset covering the Standard and Poor’s 500 (S&P 500) index firms, we analyze 2532 10-K annual reports and highlight the key contributors of technostress across six different dimensions of technostress using a combined score. A major advantage of the new six-dimensional scoring framework is that it offers a set of objective metric proxies to capture technostress without bias, utilizing a refined list of 42 key clues derived through factor analysis. Also, it adopts natural language processing, revealing hidden patterns and anomalies that indicate technostress. We further validate this framework by applying fixed-effect regression models to examine the impact of technostress on productivity. The main results imply that the four technostress dimensions presented in techno-risks, insecurity, uncertainty, and invasion negatively impact firms’ productivity. This framework offers practical implications for firms, allowing them to generate a rich profile concerning the degree of technostress associated with existing practices, highlighting the crucial need for advanced interventions, facilitating comparisons with other firms from the same or different industries, as well as cross-country comparisons.
The advancement of digital technology plays a critical role in shaping business model transformation among high‐tech SMEs. Government venture capital (GVC) provides both financial and strategic resources for digital business model innovation (DBMI) and has attracted growing scholarly attention. Drawing on resource‐based theory, this study develops a theoretical framework to examine how GVC influences DBMI in high‐tech SMEs, with particular attention to the GVC certification hypothesis and resource mechanisms. We measure DBMI in Chinese listed high‐tech SMEs using a text analysis approach. This involves constructing and expanding a DBMI lexicon with word vectors based on natural language processing, preprocessing annual report texts, and calculating DBMI levels based on keyword frequencies. Investment data on GVC participation are collected to form an unbalanced panel dataset, and a multiperiod difference‐in‐differences (DID) model is employed to estimate the direct, mediating, and heterogeneous effects of GVC on DBMI. In addition, a PSM‐DID approach is applied to further test the robustness of the results. The empirical results indicate that GVC significantly enhances DBMI in high‐tech SMEs. This effect is particularly pronounced in firms characterized by high ownership concentration, nonstate ownership, and CEO nonduality. Furthermore, GVC promotes DBMI by alleviating financing constraints and enhancing firms' intellectual capital. This study contributes to resource‐based theory by elucidating the role of digital technology in business model innovation and highlighting the importance of GVC in fostering innovation and growth among high‐tech SMEs. The findings offer important theoretical and practical implications for policymakers and business managers, helping high‐tech SMEs leverage policy support to advance DBMI.
No abstract available
This research examines and proposes an investment strategy by combining the natural language processing on the equity research reports published in the Korean financial market and machine learning algorithms for binary classification. At first, we deduce the part-of-speech from the report using the KoNLPy and Mecab. Then, we define 33 features as the input variables and perform the binary classification on the price direction of the stocks recommended in the report using various machine learning algorithms. Note that we investigate the model performance in detail by dividing the entire period into three sub-periods, including pre-COVID-19 for the sideways market, COVID-19 for the crashing market, and post-COVID-19 for the extreme bullish market. We confirm that the random forest is the best classifier for all periods, so we utilize its results on positively predicted stocks in the test set as the investment universe for the monthly re-balancing and buy-and-hold investment. The proposed strategy shows a significantly higher return on investment than benchmarks during the pre-COVID-19 and COVID-19 periods, whereas the comparable return during the post-COVID-19.
No abstract available
Among the most significant mediums where companies report to investors, analysts and regulating bodies are the earnings calls where companies report their performance in the financial realm, their strategy and future prospects. Unlike regulatory filings which are usually stagnant and very formal, earnings calls are two-way communications, and they contain undertones, language choices, and feelings that might cause serious impacts to the market perception. Recent advances in the field of natural language processing (NLP) have assisted researchers to quantitatively measure such sentiments, offering predictive data as to how the markets would react. The paper will analyze the extent to which the sentiment as it is determined by the earnings call transcripts can predict and explain short-term stock volatility. We train both lexicon-based and transformer-based deep learning models, including FinBERT, to learn sentiment dimensions, including positivity, negativity, uncertainty, and litigious tone. Volatility is measured by realized volatility realized through intraday prices per event of earnings calls. Regression based models and machine learning classifiers are then employed to find predictive relationships. The findings point to the fact that the more sophisticated NLP models are more effective than the methods, which rely on dictionaries, and that uncertainty and negative tones are very closely connected with the volatility. The work has resulted in the area of financial text analytics as it has served to address the gaps in studying the interaction between the analysis of narrative disclosure and the model of market risk and its practical and theoretical implications on the investors, analysts, and policymakers.
No abstract available
No abstract available
No abstract available
No abstract available
Corporate disclosure became more descriptive rather than quantitative over time. Thus, textual analysis gained popularity in finance and business, however, it requires massive computing power. The paper presents the panel set of the raw frequencies of positive and negative words across 90,463 Forms 10-K filed at Security Exchange Commission (SEC) in EDGAR (the Electronic Data Gathering, Analysis, and Retrieval system) over the period 1995–2008. The dataset consists of 456 variables. The texts of the forms were retrieved from the SEC servers and processed using text mining techniques. The data relevant for archive analysis on the sentiment of the financial statements and financial reporting on SEC registrants. Potential reuse for creation of the tone or sentiments indexes. Long-time data series allows for dynamic analysis. The data set allows reducing the computer power requirements for further research.
No abstract available
This study investigates how organisations respond to Corporate Sustainability Reporting Directive (CSRD) requirements through systematic analysis of corporate communications and sustainability reports. We employ multi-dimensional text mining analysis of 500 companies across 12 industries, using lexical analysis, sentiment analysis
Purpose This paper aims to explore how different textual features of environmental information disclosure (EID) individually and synthetically affect corporate green innovation (CGI) from the comprehensive perspective of information quality and tone. Design/methodology/approach Chinese A-share listed manufacturing enterprises from 2010 to 2022 are research objects. Based on constructing the evaluation framework of EID quality with more fine-grained evaluation methods and capturing intonation features by applying text mining technologies, this study uses the Poisson regression model to investigate the influence mechanism of EID textual features on CGI. Findings The results indicate that both the quality and tone of EID have positive influences on CGI. There exists a positive interactive effect between EID quality and tone in promoting CGI. Mechanism tests demonstrate that EID can affect CGI through stakeholders’ resource effects, including increasing environmental subsidies, easing financing constraints and enhancing green reputation. Government environmental penalties can weaken the positive effects of EID textual features on CGI. The relationship between EID and CGI is distinct concerning whether enterprises are heavily polluted. Originality/value First, this study innovatively investigates the synergy of EID quality and tone on CGI from the view of information decision-making usefulness and reveals the influence mechanism of EID textual features on CGI from the perspective of stakeholders’ resource effect, which can advance the understanding of the signaling roles of EID textual features. Second, the proposed novel fine-grained measurement of EID quality with text mining technologies of this paper can improve the accuracy and specificity compared with the traditional discrete scoring method.
This study addresses the growing imperative for environmentally responsible supply chain management in Southeast Asia and the challenges of assessing corporate sustainability disclosures. Although companies increasingly produce sustainability reports, the extent to which these documents reflect genuine green practices remains unclear. This research systematically evaluates how five major Southeast Asian firms, including Unilever SEA, Nestlé Indonesia, Indofood, Danone, and ThaiBev, articulate green supply chain initiatives in reports published between 2022 and 2023. Employing a qualitative exploratory design, the study integrates document analysis with text mining and thematic coding; approximately 33,000 words from the five reports were processed, yielding 1,300 occurrences of green supply chain terms categorized into three themes: eco-packaging, green logistics, and carbon tracking. The results reveal a pronounced imbalance: eco-packaging comprised 54 percent of keywords (n = 702), green logistics 29 percent (n = 377), and carbon tracking 17 percent (n = 221). Unilever’s 9,300-word report contained 350 mentions of eco-packaging, while Danone’s 5,900-word report featured 310; carbon tracking averaged under 45 references per report. The study introduces a replicable text mining framework for ESG disclosure analysis and underscores the need for more balanced reporting, including Scope 3 emissions data. Future mixed-method approaches that combine computational analysis with qualitative validation are advocated. The findings provide evidence for policymakers and investors to refine ESG guidelines and highlight the potential of computational tools to enhance corporate accountability in sustainability reporting
No abstract available
This study employs text-mining techniques to obtain information regarding corporate digital transformation disclosures and the exploitation of yearly report development. It aims to confirm the presence of information manipulation in such disclosures and its later impact on the quality of accounting information. A-share listed companies in China from 2013 to 2023 have been examined to analyze the impact of digital transformation information disclosure on accounting information quality. The regression analysis yields the following results: (1) Increased digital transformation disclosures are significantly associated with a decline in accounting information quality, (2) moderation tests reveal that robust corporate ESG governance alleviates the negative effects of digital transformation disclosures on accounting information quality, (3) mechanism analysis suggests that digital transformation disclosures contribute to heightened manipulation of the tone in annual reports, indicating potential information manipulation in corporate disclosures. The use of a one-period lag and the substitution of dependent and independent factors further support the findings of this study. Furthermore, empirical data from influence mechanism tests suggests that increased corporate ESG governance does increase the quality of accounting information by avoiding information manipulation in digital transformation reporting.
Taking listed firms from 2010 to 2023 as research samples, we measure the degree of corporate green transformation by extracting keywords from annual reports based on text mining and empirically examining green innovation's impact on corporate green transformation. The results show that green innovation can improve corporate green transformation. The moderating mechanism analysis demonstrates that when the degree of market competition increases, the relationship between green innovation and corporate green transformation is stronger. In addition, the relationship between them is more significant in areas with low degree of intellectual property protection, and the moderating effect of market competition is further enhanced. Finally, heterogeneity test shows that the positive impact of green innovation on corporate green transformation for non-state-owned firms and firms in heavily polluting industries is more significant. CCS CONCEPTS • Computing methodologies∼ Artificial intelligence∼ Natural language processing∼ Information extraction • Applied computing ∼Law, social and behavioral sciences ∼Economics
No abstract available
Managerial long‐termism and corporate innovation: Evidence from China through text mining approaches
Adhering to long‐termism is essential for sustaining and promoting corporate innovation in the face of future uncertainty. In this study, we use machine learning and text analysis methods to construct a proxy variable for managerial long‐termism and explore how managerial long‐termism affects corporate innovation using a sample of 13,117 firm‐year observations from 2010 to 2020 in China. We find managerial long‐termism has a positive effect on corporate innovation, which is mainly achieved by mitigating agency problems. The positive effect of managerial long‐termism on corporate innovation is more pronounced when an enterprise faces stronger economic policy uncertainty and has more slack resources. Additional analysis shows that managerial long‐termism accelerates exploratory innovation and contributes to the high‐quality development of enterprises. Our findings advance research on long‐termism in corporate governance and generate meaningful insights into the antecedents of corporate innovation from the perspective of time preference.
How can useful information extracted from unstructured data be used to contribute to a better prediction of corporate failure or bankruptcy? In this research, we examine a data set of 2,163,147 financial statements of German companies that are triple classified, i.e., solvent, financially distressed, and bankrupt. By classifying text features in terms of granularity and linguistic level of analysis, we show results for the potentials and limitations of approaches developed in this way. This study gives a first approach to evaluate and classify the likelihood of success of text mining approaches for extracting features that enhance the training database of AI-based solutions and improve corporate failure prediction models developed in this way. Our results are an indication that the adaptation of additional information sources for the financial evaluation of companies is indeed worthwhile, but approaches adapted to the context should be used instead of unspecific general text mining approaches.
This paper explores the implementation of the circular economy in the energy sector. The research findings contribute to our understanding of the practical application of the circular economy, enabling policymakers and stakeholders to make informed decisions and develop targeted strategies. The study analyzes 88 Portuguese companies’ reports, examining the presence of circular economy strategies and initiatives. The results reveal that energy sector companies tend to prioritize reporting their greenhouse gas reduction efforts over their circular economy strategies. The findings align with previous studies in the oil and gas industry, emphasizing the significance of sustainability reporting and potential biases in reporting practices. The study also identifies a gap between circular economy terminology and its representation in reports, indicating the need for greater incorporation of circular economy-oriented initiatives in the energy sector. The research highlights the role of technology in fostering innovation and calls for strategic alliances and knowledge sharing to drive circular economy practices. Further research is recommended to understand the barriers to implementing circular economy practices and identify effective solutions. Overall, this paper provides valuable insights for advancing the circular economy in the energy sector and achieving broader sustainability goals.
This study examined the Security and Exchange Commission (SEC) annual reports of selected logistics firms over the period from 2006 through 2021 for risk management terms. The purpose was to identify which risks are considered most important in supply chain logistics operations. Section 1A of the SEC reports includes risk factors. The COVID-19 pandemic has had a heavy impact on global supply chains. We also know that trucking firms have long had difficulties recruiting drivers. Fuel price has always been a major risk for airlines but also can impact shipping, trucking, and railroads. We were especially interested in pandemic, personnel, and fuel risks. We applied topic modeling, enabling us to identify some of the capabilities of unsupervised text mining as applied to SEC reports. We demonstrate the identification of terms, the time dimension, and correlation across topics by the topic model. Our analysis confirmed expectations about COVID-19’s impact, personnel shortages, and fuel. It also revealed common themes regarding the risks involved in international trade and perceived regulatory risks. We conclude with the supply chain management risks identified and discuss means of mitigation.
Sustainability has become a central concern globally, and efforts to enhance it are being made across various fields. In line with this trend, corporate sustainability reports have become more widely published. These reports provide both financial and non-financial information on a company’s sustainability. In this context, this study aims to, first, analyze the key keywords contained in CEO messages. Second, it examines whether the keywords emphasized by CEOs change in response to shifts in corporate risk under economic uncertainty. Finally, it identifies how the categories of words included in these messages are classified. To address these research questions, text analysis was selected as the methodology. Specifically, a qualitative research approach using text mining and CONCOR analysis was conducted on the text from sustainability report. According to the Term Frequency and Term Frequency-Inverse Document Frequency analyses, the most frequently occurring keywords were ESG, Sustainable, Society, Stakeholders, Growth, Environment, Effort, and Future. Centrality analysis identified the following keywords as having high centrality: Sustainable, ESG, Society, Environment, Growth, Effort, and Stakeholders. Finally, CONCOR analysis revealed four clusters: Eco-friendly Energy, ESG Management, Global Crisis, and Technological Competitiveness. This study is significant in that it analyzes the major keywords and their changes within unstructured text data using text mining and CONCOR analysis, and it suggests the possibility of future quantitative analysis of non-financial information using these keywords.
No abstract available
This study investigates whether risk factor disclosures in 10-K annual filings contain predictive signals about firms’ future financial performance. Both structured (financial statements) and unstructured (risk factor narratives) data were analyzed using text mining and machine learning techniques. The JCAATs XBRL Connector and AI audit functions were employed to streamline data extraction, sentiment analysis, clustering, machine learning modeling, and SHAP-based interpretability. Sentiment scores and textual clusters were constructed as independent variables to explain subsequent- year net income. Empirical findings demonstrate that both sentiment and textual style are significant predictors of net income, supporting the view that risk disclosures provide forward-looking information. These findings extend language signal theory to the risk factor section and underscore the practical value of JCAATs in auditing and regulatory monitoring, highlighting how AI-driven text analytics can enhance disclosure assessment and financial supervision.
No abstract available
This study investigates voluntary compliance with the Task Force on Climate-Related Financial Disclosures (TCFD) framework in 64 financial, Environmental, Social, and Governance (ESG) reports from six Spanish IBEX-35 energy firms (2020–2023) and explores the implications for intangible assets and corporate reputation, employing empirical quantitative text mining and Natural Language Processing (NLP) in Python. A validated scale-based taxonomy within the TCFD framework applies query-driven rules to extract relevant text. This enables an evaluation of aspects of the reports, facilitating the development of a compliance index measuring each company’s adherence to TCFD recommendations. All companies showed year-on-year improvements (2023 was the most comprehensive), yet none fully adhered due to information gaps. Disparities in the disclosures of Scope 1,2 and 3, persisted, suggesting reputational risks. A replicable methodological model generating a compliance index that assesses the ‘being’ (‘true performance’) versus ‘seeming’ (‘external perception’) dichotomy within sustainability reports and acts as a potential reputational barometer for stakeholders. By providing unprecedented evidence of TCFD reporting in the Spanish energy sector, this study closes a significant academic gap. Future research may analyze ESG reports using AI agents, study the impact of ESG on energy-intensive companies from AI data centers, supporting services like Copilot, ChatGPT, Claude, Gemini, and extend this methodology to other industrial sectors.
Innovation is a key force behind social and economic advancement. This study employs computational econometric methods and text mining techniques to investigate the relationship between supply chain uncertainty and dual innovation in manufacturing enterprises. Using panel data from Chinese manufacturing firms between 2019 and 2023, we construct fixed-effects regression models and moderation effect models to empirically test the interaction mechanisms. Through natural language processing algorithms, we quantify digital transformation levels by analyzing keyword frequencies in corporate annual reports. The findings reveal that supply chain uncertainty significantly impairs both exploratory and exploitative innovation. Furthermore, digital transformation demonstrates a substantial negative moderating effect, with advanced text analytics showing that higher digitalization levels mitigate the adverse impacts of supply chain uncertainty on innovation activities. The research incorporates robustness checks through bootstrap sampling and alternative variable measurements, ensuring computational reliability. These findings provide valuable insights for manufacturing enterprises to develop data-driven supply chain strategies and digital transformation initiatives.
The purpose of the article. Sustainability development issues, particularly Environmental, Social, and Governance (ESG) factors, are becoming increasingly relevant in corporate reporting, driven by rising environmental awareness and regulatory requirements. The aim of the study is the evaluation of ESG disclosures of selected Polish banks. The aim of the paper fills a gap in the Polish academic literature in economics and finance by analyzing the volume and size of non-financial ESG disclosures through computer text mining techniques. Methodology. This study applies text mining techniques to evaluate the ESG volume and size from 107 financial reports issued by Polish banks between 2006 and 2023. For this purpose, selected tools for computer-based analysis of textual data (text mining) are used. The primary methods include the emotional attitude (sentiment), analysis of the number of words regarding ESG, and analysis of the readability of ESG volume and size contained in company reports. Results of the research. The study reveals that ESG excerpts are more neutral or less optimistic compared to integrated reports, which tend to have a more positive tone. Additionally, sustainability disclosures are written in a complex language, and the volume of these reports has been increasing over time, likely due to new regulations and growing awareness of sustainability issues. The study focuses on Polish banks but suggests expanding future research to other sectors.
Digital transformation has the potential to fundamentally change companies and create value. This study examines the impact of digital transformation and sustainability strategies on the earnings quality of Indonesian listed companies on three sectors: infrastructures, transportation and logistics, and consumer non-cyclicals. Using text mining techniques to measure intensity of digital transformation and sustainability strategies, we found that sustainability strategy shown in the company’s annual report reflects lower level of earnings management especially in accrual earnings management, while companies with digital transformation strategy, particularly in artificial intelligence technology, are less likely to engage real earnings management. Findings of this study provide insights into the effect of digital transformation and sustainability on the quality of accounting information and corporate governance, and offer implications for corporate digital transformation and government regulation.
Big data analytics and the practice of related technologies is rampant in the corporate world across the globe. The ability of companies to collect, store and analyze massive amounts of data and use such data for decisions is considered critical for a firm’s success. The auditing industry is not up to the mark, lacks sufficient emphasis and practice of big data analytics. This study assumes that big data processing technologies can impact financial auditing practices positively. Data sets were mined from literature using text mining methodology. Conceptual patterns such as Auditing, Fraud, Risk, and Security were found to be highly influential in the literature. Opinion in the literature is diverge for conceptual patterns such as Auditing, Fraud, and Risk but not for Security. Few potential implications under four main categories such as technologies, enablers, challenges, and compliance were identified. Digital technologies, specifically Artificial Intelligence (AI) and Blockchains were found to be enablers for firm’s performance and growth. Fraud detection, forensics, legitimacy were found to be few challenges for compliance. In addition to this, big data analytics (BDA) was found to be a moderating variable for technologies like Blockchains, Artificial Intelligence (AI) and for challenges such as Risk, Security but not for Fraud.
This study delves into the utilization of text mining to scrutinize social and environmental reports of companies, showcasing its effectiveness in evaluation. It explores various text mining techniques and practically applies decision tree, k-nearest neighbors, and naïve Bayes methods. The paper offers guidance on extracting pertinent terms related to four CSR dimensions: Environment, Employee, Social responsibility, and Human rights. Results demonstrate the successful differentiation of text based on these dimensions, leveraging a CSR-relevant dictionary by Pencel and Malascue. Employing document classification techniques, the study constructs four models using distinct text mining approaches for comparative analysis. Through this research, the valuable role of text mining in assessing social and environmental disclosures is underscored, providing insights into optimizing these techniques for evaluations and emphasizing their potential to enhance understanding and decision-making in corporate social responsibility assessments. Keywords: sustainability, text mining, machine learning, Corporate Social Responsibility - CSR, environmental reports
In the era of big data, the significance of data has skyrocketed. It has transformed into a new type of asset and a significant element of production, which contributes immensely to the sustainable development and competitiveness enhancement of enterprises.This research focuses on the A-share listed companies in China over the period of 2007 to 2022, and uses Word2Vec neural network model and deep learning text mining method to investigate the ramifications and mechanism of data asset information disclosure on the effectiveness of corporate capital market. The findings indicate that: 1) data asset information disclosure noticeably enhances the efficiency of corporate capital market; 2) data asset information disclosure have a positive impact on corporate capital market efficiency by enhancing corporate governance, increasing institutional investors’ shareholding and improving stock liquidity; 3) the data asset information disclosure has a more pronounced improvement effect on corporate capital market efficiency in growing and mature firms, and stronger in high-technology, state-owned, media-focused and investor-focused enterprises. The research conclusions are of great significance to improve the behavior of enterprise data assets information disclosure, enhance the capital market operation efficiency, as well as fostering the development of a robust and superior quality capital market.
Given the increasing importance of sustainability in the business world, there has been growing interest in using topic modeling approaches to identify reported topics in corporate sustainability reports (CSR). Due to the inconsistent legal foundation and different sustainability standards, the content of individual reports can vary greatly. In this paper, the corporate social responsibility reports of DAX 40 companies from 2017 to 2021 are therefore analyzed using Latent Dirichlet Allocation (LDA). In particular, we attempt to identify topics that are suggested by the Global Reporting Initiative (GRI) Sustainability Standard for large public companies. In addition, a comparison is made throughout the years. The study shows that specific guidelines of the GRI can only be identified to a certain degree using LDA. Although some topics that partly reflect the content of the GRI can be found by the model, the overall structure of the GRI can’t be replicated. Overall, this study shows that an evaluation of the content of sustainability reports can be successful in terms of the relevance of the reported topics, although the results depend heavily on the respective pre-processing steps.
Annual reports are the corporate documents companies publish every year. These documents contain crucial company performance information and are often analyzed manually and objectively. The Investor often ignores the annual report’s qualitative data and focuses only on quantitative data. In literature, it has been demonstrated that managers’ word choices, CSR initiatives, and sentiments expressed in annual reports are related to future stock returns, earnings, and management fraud. Therefore, the study aims to observe sentiment orientation in CEO letters, Management Discussion and Analysis(MD&A), and Corporate Social Responsibility (CSR) and examine the sentiment relation with company performance. The NSE-listed company annual reports are considered for the study. In the proposed approach, the results of the LM Dictionary-Based technique, Naive Bayes, SVM, RF, LSTM, and FinBERT model are considered to determine the final sentiment. The annual report tone is calculated and compared with the performance indicators, i.e., Return on Assets(ROA) and Return on Equity(ROE).
ABSTRACT By utilizing web-crawling and text analysis techniques on unstructured big data (text sets), this study examines to what extent investors disagree with the sentiment conveyed in annual reports. The main empirical findings suggest that the tone of annual reports significantly influences investor opinions. Specifically, a negative tone in annual reports is associated with high levels of divergence among investors’ opinions, whereas a positive tone correlates with lower divergence. In the robustness tests, the results remain consistent after controlling for various factors. After we control for Management Discussion and Analysis (MD&A), both positive and negative tones in annual reports continue to be significant predictors of divergences in investor opinions. Additionally, after controlling for future earnings quality, future cash flows, and future earnings surprises, investors still present high/low divergence of opinion in response to a negative/positive tone in annual reports. Moreover, the robustness of our analysis is assessed by employing alternative sentiment analysis word lists.
Financial reports serve as crucial resources for investors and researchers, providing analysts’ assessments of stocks that play a vital role in stock market applications. However, detecting analysts’ opinions and sentiments in financial reports is challenging. First, the formal and professional language used in these reports makes it difficult for previous methods to comprehend domain-specific knowledge. Second, financial reports often adopt lengthy and elaborate expressions to convey rich semantics, which exposes the existing methods to contextual information loss, especially on long-term dependencies. To address these problems, we propose a summary-enhanced hierarchical framework (SEHF), which leverages summary information to enhance financial report sentiment analysis. Our framework incorporates financial bidirectional and auto-regressive transformer (FinBART), equipped with extended position encoding to summarize lengthy report articles and capture long-range interactions. To mitigate information loss, we initially divide each report into segments and then propose the hierarchical analyst sentiment representation network (ASRN), which utilizes financial bidirectional encoder representation from transformer (FinBERT), bidirectional long short-term memory (BiLSTM)-Attention, and dendrite (DD) network to fuse information in the generated summary and report segments. Notably, FinBART and FinBERT are pretrained on large-scale financial corpora to effectively understand professional expressions. Furthermore, we construct a new dataset large-scale Chinese financial report (LCFR) for the lack of supervised datasets. Experimental results on LCFR and a benchmark dataset show that SEHF significantly outperforms state-of-the-art (SOTA) baselines, and the ablation study highlights the effectiveness of aggregating sentiment information in the summary and report segments.
The concept of ESG has been gaining popularity in recent years in Thailand. ESG disclosures can be studied through the annual report or Form 56-1 One Report. However, the content of the annual report contains a number of statements, which take a long time to analyze manually. To address these problems, this research therefore uses the Transformers model to process and deal with the large amount of text data within annual reports. The purpose of this paper is to provide investors with knowledge on how to perform textual analysis of ESG disclosures, which is an analysis method that goes beyond analyzing financial numbers through the development of language models specifically focused on ESG within annual reports. Then evaluate the performance of models that specifically focus on ESG aspects compared to models that focus more generally. Models focused specifically on ESG were found to have higher prediction accuracy. The next step was to use this ESG-specific model to do a sentiment analysis of annual report between 2020 and 2022 for 27 companies in the resource group. The study found that most companies gave the most weight to disclosing social aspects, and the sentiment of ESG disclosures was revealed in the neutral polarity.
This research introduces a novel multimodal transformer architecture for comprehensive market sentiment analysis that integrates traditionally disparate data sources: earnings call transcripts, social media content, and technical market indicators. Our approach addresses the limitations of unimodal sentiment analysis systems by leveraging the complementary nature of textual, social, and quantitative data streams. We developed a specialized transformer-based model with modality-specific encoders and cross-modal attention mechanisms, achieving superior performance compared to unimodal baselines. Experimental results on a diverse financial dataset demonstrate significant improvements in sentiment classification accuracy (87.2% vs. 78.4% baseline) and directional market prediction (74.3% vs. 63.8% baseline). The model shows particular strength in detecting subtle sentiment shifts during market volatility periods, with a 9.3% performance improvement over traditional methods. Our findings illustrate that integrating multiple information channels enables more nuanced market sentiment understanding, potentially offering valuable insights for financial decision-making and risk management.
No abstract available
This study investigates the temporal evolution patterns of sentiment in earnings call transcripts and their relationship with subsequent financial performance. While existing financial sentiment analysis typically examines isolated communication events, this research adopts a longitudinal approach to detect temporal sentiment patterns across consecutive quarterly reports. We develop a multimodal analytical framework that integrates lexicon-based and deep learning methods for sentiment extraction, applied to a comprehensive dataset of 18,240 earnings call transcripts from S&P 500 companies spanning 20 consecutive quarters (2018-2022). The WMSA-Bi-LSTM architecture with multi-head self-attention mechanisms is employed to capture contextual relationships in financial discourse. Our analysis reveals distinct industry-specific sentiment evolution patterns, with Information Technology and Communication Services sectors exhibiting higher sentiment volatility (0.342 and 0.327) compared to Utilities and Consumer Staples (0.128 and 0.145). Significant correlations are identified between sentiment momentum and subsequent quarter operating margin changes (r=0.523, p<0.001). Predictive modeling demonstrates that sentiment trajectory features effectively forecast operational metrics with accuracy up to 79.3% for one-quarter horizons, though performance degrades for market-based metrics and longer prediction windows. These findings extend existing financial sentiment analysis by establishing the temporal dimension as critical for understanding the relationship between corporate communication patterns and financial outcomes.
This paper proposes a hybrid model based on LSTM(Long Short-Term Memory)and BERT(Bidirectional Encoder Representations from Transformers)for stock market trend prediction, which innovatively integrates the quantitative analysis of technical indicators and qualitative analysis of news sentiment. The framework first utilizes a pre-trained FinBERT model--optimized for financial text processing--to extract fine-grained sentiment features from multi-source financial data, including official earnings reports, mainstream financial news, and social media financial discussions. These sentiment features, together with key technical indicators derived from historical stock price data(such as moving averages, relative strength index, and trading volume), are standardized and concatenated into a comprehensive feature vector, which is then fed into a two-layer LSTM network for capturing temporal dependencies and dynamic sequence prediction. The experimental results, derived from a dataset encompassing 50 top stocks from the S&P 500 over a span of 5 years, show that the proposed hybrid model consistently surpasses single-model methods--such as standalone LSTM, BERT, and traditional ARIMA--in terms of both short-term(1-day ahead)and medium-term(5-day ahead)prediction accuracy, as well as its ability to generalize across various market conditions, including bullish, bearish, and volatile environments. This confirms the significant value of multi-source data fusion in financial forecasting, as it effectively complements the limitations of single data types and enhances the models robustness to market noise.
This study introduces the Audit Risk Sentiment Value (ARSV), a novel audit risk proxy that leverages sentiment analysis to address limitations in traditional audit risk measures such as audit fees (LNFEE), audit hours (LNHOUR), and discretionary accruals (|MJDA|). Traditional proxies primarily capture quantitative dimensions, overlooking qualitative insights embedded in audit report narratives. By systematically analyzing sentiment and tone, ARSV captures nuanced audit risk dimensions that reflect the auditor’s risk perception. The study validates ARSV using a dataset of South Korean firms listed on the KOSPI from 2018 to 2023. The results demonstrate the ARSV’s superior explanatory power, as confirmed through the Vuong test, showing consistent performance across binary and continuous measures of explanatory language. ARSV bridges the gap between qualitative and quantitative audit risk assessments, offering significant benefits to auditors, regulators, and investors. Its ability to enhance the interpretability of audit reports improves transparency and trust in financial reporting, addressing stakeholder demands for actionable, forward-looking information. Furthermore, ARSV aligns with global trends emphasizing sustainability and accountability by integrating qualitative insights into audit practices. While this study provides robust evidence supporting ARSV effectiveness, its focus on South Korean firms may limit its generalizability. Future research should explore ARSV application in diverse regulatory and cultural contexts and refine the sentiment analysis tools using advanced machine learning techniques. Expanding ARSV to include other unstructured data, such as management commentary, could further enhance its applicability. This study marks a significant step toward modernizing audit methodologies, aligning them with evolving demands for comprehensive and transparent financial reporting. The empirical analysis reveals that ARSV outperforms traditional audit risk proxies with significantly higher explanatory power. Specifically, ARSV achieved a pseudo R2 of 0.786, compared to 0.608 for LNFEE, 0.604 for LNHOUR, and 0.578 for |MJDA|. The Vuong test results further validate ARSV superiority, with Z-statistics of −12.168, −12.492, and −9.775 when compared against LNFEE, LNHOUR, and |MJDA|, respectively. The model incorporating ARSV demonstrated a 62.454 F-value and an Adjusted R2 of 0.599, highlighting its robustness and reliability in audit risk assessment. These quantitative metrics underscore ARSV’s effectiveness in capturing qualitative audit risk dimensions, offering a more precise and informative measure for stakeholders.
Carefully crafted prompts can significantly enhance the accuracy and effectiveness of sentiment classification models. This paper explores the use of prompt engineering and large language models for financial sentiment analysis on financial reports of companies. Zero-shot and few-shot with prompts are designed to extract sentiment and contextual information. AI-generated synthetic examples were created for few-shot settings. Human-evaluated results are compared with four LLMs. Results show varying performance and output quality among LLMs, influenced by prompt design, report content, and task complexity. The LLMs’ responses varied in length, detail, and style, affecting their readability and usefulness. The paper discusses the implications and limitations of these findings, suggesting future research directions.
Sentiment analysis is a critical task that is highly beneficial to various financial tasks such as stock-price prediction, corporate credit rating, economic report analysis, and investment decision support. Researchers have used various methods to train pretraining language models (PLMs) for these tasks. Although most PLMs have achieved excellent performance, they can be further improved. In this study, we propose a new framework to strengthen numerical understanding, in particular for the FinBERT(Financial Bidirectional Encoder Representations from Transformers) model released in 2019, thus improving model performance in the task of sentiment analysis on financial news sentences. This method selects sentences containing numerical words from financial news articles, preferentially masks the words, and post-train the PLM. To evaluate the proposed methodology quantitatively, we apply the same post-training to different financial language models and compare the performance before and after the application using Financial Phrasebank, which is a representative benchmark dataset used in financial sentiment analysis. The experimental results show that the best performance is achieved when 50,000 sentences are used to post-train FinBERT, thus confirming the advantage of the proposed methodology for downstream tasks and highlighting the importance of using the correct amount of data. Additionally, we show that applying the proposed method to different language models improves the performance, particularly in low-resource environments with less training data. The findings of this study suggest that the PLM can improve aspects that it does not understand well, and that thd PLM performance can be improved by post-training it with task- and domain-appropriate datasets, in not only finance but also in other domains.
Deeply mining and quantifying the sentiment information implied in the text of these annual reports can provide investors with a comprehensive understanding of the company's operating conditions, serving as a valuable reference. Currently there are fewer sentiment analysis models for annual report terminology, this paper proposes a sentiment analysis model that extracts contextual semantic features based on bidirectional long short-term memory (BiLSTM), extracts local features based on text convolution neural network (TextCNN), and introduces an attention mechanism in the two feature channels to enhance attention to important information, and finally fuses the dual-channel feature vectors to obtain the sentiment classification results. Experimental results show that the accuracy of the annual report sentiment classification model proposed in this paper is better than the existing sentiment classification models targeting the financial domain.
This study delves into the intricate interplay between gender stereotypes and financial reporting through an aspect-level sentiment analysis approach. Leveraging Big Data comprising 129,251 human face images extracted from 2085 financial reports in Chile, and employing Deep Learning techniques, we uncover the underlying factors influencing the representation of women in financial reports. Our findings reveal that gender stereotypes, combined with external economic factors, significantly shape the portrayal of women in financial reports, overshadowing intentional efforts by companies to influence stakeholder perceptions of financial performance. Notably, economic expansion periods correlate with a decline in women's representation, while economic instability amplifies their portrayal. Furthermore, the financial inclusion of women positively correlates with their presence in financial report images. Our results underscore a bias in image selection within financial reports, diverging from the neutrality principles advocated by the International Accounting Standards Board (IASB). This pioneering study, combining Big Data and Deep Learning, contributes to gender stereotype literature, financial report soft information research, and business impression management research.
This research offers a pioneering investigation into the potential implications of late annual report announcements in the context of BIST100 companies. Annual reports serve as a pivotal source of comprehensive information, enabling stakeholders to evaluate a company's financial health, growth prospects, operational strategies, and potential risks. Recognizing this, our study harnesses artificial intelligence (AI) to scrutinize the timeliness of annual report announcements and their inherent sentiment. Using AI, we constructed a predictive model based on the announcement dates of annual reports between 2009 and 2021. The model estimated the release dates for 2022, facilitating the identification of companies that released their reports later than expected. These delayed reports were then subjected to an in-depth text mining, sentiment analysis, and schema analysis. Our text mining process utilized the robust Hidden Dirichlet Allocation (LDA) Subject Salient Terms (TST) method, known for its efficacy in revealing concealed topics in large text volumes. Our findings were striking; around 87% of statements in the delayed annual reports reflected a negative sentiment, while only 13% displayed a positive tone. Thus, late annual report announcements tend to have a generally pessimistic outlook, indicating they might indeed be bearers of adverse news. This research offers a unique perspective on the relationship between the timeliness of financial reports and the contained sentiment, ultimately contributing to the enhanced transparency and informed decision-making in financial markets. The study underscores the necessity for companies to maintain timely communication and suggests potential areas of interest for future investigations.
In recent years, there has been an increasing interest in text sentiment analysis and speech emotion recognition in finance due to their potential to capture the intentions and opinions of corporate stakeholders, such as managers and investors. A considerable performance improvement in forecasting company financial performance was achieved by taking textual sentiment into account. However, far too little attention has been paid to managerial emotional states and their potential contribution to financial distress prediction. This study seeks to address this problem by proposing a deep learning architecture that uniquely combines managerial emotional states extracted using speech emotion recognition with FinBERT-based sentiment analysis of earnings conference call transcripts. Thus, the obtained information is fused with traditional financial indicators to achieve a more accurate prediction of financial distress. The proposed model is validated using 1278 earnings conference calls of the 40 largest US companies. The findings of this study provide evidence on the essential role of managerial emotions in predicting financial distress, even when compared with sentiment indicators obtained from text. The experimental results also demonstrate the high accuracy of the proposed model compared with state-of-the-art prediction models.
No abstract available
Abstract: Predicting movements in the stock market is a novel use of sentiment analysis's growing body of knowledge. The purpose of this study is to investigate the potential of NLP for predicting stock market movements by analyzing textual data sources such as news articles, social media posts, as well as earnings reports. The research examines current approaches, applications, and difficulties in sentiment analysis by drawing on extensive surveys and reviews [1], [2]. It also investigates the use of pre-trained models in NLP and the potential biases of such models [6]. Important research findings [3], [17] suggest that NLP-based sentiment analysis models, especially those employing deep learning architectures, show promising results in financial forecasting. There are, however, several difficulties associated with these models. These include the requirement of huge datasets and the elimination of biases. This study has far-reaching ramifications. One benefit is a more nuanced comprehension of the potential and pitfalls of natural language processing for sentiment analysis in the financial markets, which is useful for both analysts and investors. Second, it provides opportunities for more study to enhance the precision and dependability of such models, which ultimately aids in the development of more steady and well-informed monetary judgements.
This study is, the authors believe, a groundbreaking investigation into the impact of cryptocurrency news on the earnings of publicly traded companies. Using advanced Generative AI (GenAI) models and the BERT framework for sentiment analysis, we integrated comprehensive data from the Financial Modeling Prep API. This enabled us to employ a rigorous event study methodology and advanced machine learning algorithms. Valuable insights were derived from the BERT model, shedding light on the reasons behind abnormal returns and facilitating a thorough analysis of material and immaterial impacts. The study’s findings highlight the significant influence of both positive and negative cryptocurrency news on cumulative abnormal returns (CAR), particularly among firms deeply involved in crypto activities. Notably, deliberate news, including official announcements, had a more pronounced impact than unintentional ones on market reactions. This innovative approach provides actionable insights into financial services, investment management, and corporate communication, offering a framework for improving predictive models, investment decisions, and risk management strategies.
This study addresses the challenges of scarce fraudulent samples, complex data distributions, and the limited adaptability of traditional methods in financial statement fraud detection by proposing a self-supervised learning algorithm. The approach first standardizes multidimensional financial indicators to mitigate scale differences, then employs an encoder to construct latent representations that capture high-order nonlinear relationships among indicators. A reconstruction task is introduced as an auxiliary signal, where a decoder approximates the input and minimizes reconstruction error to enhance the fidelity of representations. In parallel, a classification module distinguishes normal from fraudulent statements, with the model jointly optimizing reconstruction and classification losses to improve both feature completeness and discriminative ability. Experiments on a public financial fraud dataset show that the proposed method significantly outperforms existing baselines on Precision, Recall, F1-Score, and AUC, with particular strength in minority class recognition under imbalanced and limited data. Additional sensitivity experiments demonstrate that the method remains stable and robust across variations in optimizer type and imbalance ratios, confirming its effectiveness in complex financial environments. Overall, the algorithm provides an efficient and reliable pathway for fraud detection and exhibits distinctive advantages in accuracy and adaptability.
Financial statement fraud remains a substantial risk in environments marked by weak regulatory oversight and information asymmetry. This study develops a decision-centric framework that integrates machine learning, explainable artificial intelligence, and decision curve analysis to improve fraud detection under severe class imbalance. Using 969 firm-year observations from 132 Mongolian firms (2013–2024), we evaluate 21 financial ratios with models including Random Forest, XGBoost, LightGBM, MLP, TabNet, and a Stacking Ensemble trained with SMOTE and class-weighted learning. Performance was assessed using PR-AUC, F1-score, Recall, and DeLong-based significance testing. The Stacking Ensemble achieved the strongest results (PR-AUC = 0.93; F1 = 0.83), outperforming both classical and modern baseline models. Interpretability analyses (SHAP, LIME, and counterfactual explanations) consistently identified leverage, profitability, and liquidity indicators as dominant drivers of fraud risk, supported by a SHAP Stability Index of 0.87. Decision curve analysis showed that calibrated thresholds improved decision efficiency by 7–9% and reduced over-audit costs by 3–4%, while an audit cost simulation estimated annual savings of 80–100 million MNT. Overall, the proposed ML–XAI–DCA framework offers a transparent, interpretable, and cost-efficient approach for enhancing fraud detection in emerging-market contexts with limited textual disclosures.
The rapid dissemination of information through digital platforms has revolutionized the way we access and consume data, creating conditions that may lead to an increase in financial statement fraud, which jeopardizes the efficient functioning of capital markets. This paper propose a sophisticated representation learning method to detect financial statement fraud by tracking detailed changes in a firm’s Management Discussion and Analysis (MD&A) documents over time. Unlike traditional word frequency approaches, we start by aligning paragraphs between consecutive disclosures based on their representation-level similarities. Given the paragraph-embedding similarity, we categorize paragraphs into three types: added, deleted and matched. Next, we construct multivariate change trajectory representations based on fraud-relevant word categories, such as sentiment and uncertainties. Finally, we develop a fraud detection model using these word-level change trajectory representations. Extensive experiments on 24 years of financial report data, from 1995 to 2019, show that our representation learning approach significantly improves financial statement fraud detection performance across 7 different backbone machine learning models, consistently outperforming traditional word frequency-based approaches. Our method marks a new paradigm in feature engineering for financial statement fraud detection.
Financial statement fraud, as a critical risk factor threatening the healthy development of capital markets, has long been a focal point of both academic research and practical concern. In recent years, Large Language Models(LLMs), with their advanced capabilities in text comprehension and logical reasoning, have opened new avenues for financial report analysis. This paper focuses on the financial reports of publicly listed companies and proposes a novel fraud detection method that integrates structured operational indicators with semantic information from key financial report sections. Specifically, the approach introduces SAGE prompt templates and a multi-step reasoning mechanism to guide the model in identifying potential fraudulent risks within texts and generating rationales. Furthermore, task prompts incorporating both operational metrics and textual analysis results are designed to enhance the model's fraud detection capability. Experimental results demonstrate that the proposed method outperforms traditional approaches across key evaluation metrics, including Accuracy, Precision, Recall, and F1-score, thereby validating its effectiveness and superiority in financial statement fraud detection. This study not only offers a new technical solution for identifying financial risks but also explores a viable path for applying LLMs in the financial domain.
With the widespread adoption of Internet‐based AI technologies, addressing financial fraud has become increasingly critical, particularly within the realm of machine learning. In this case, deep learning and natural language processing (NLP) techniques offer powerful means of detecting fraudulent activity by analyzing financial documents, thereby enhancing both the efficiency and precision of such assessments and supporting financial security. In this study, we introduce deep representation learning‐based approaches relying mainly on large language models (LLMs) for identifying fraud in financial statements by examining temporal changes in the Management Discussion and Analysis (MD&A) sections of corporate disclosures. Departing from conventional techniques that rely only on word frequency analysis, we propose D eep F raud that combines time‐evolving financial LLM embeddings, such as FinBERT, FinLlama, and FinGPT embeddings, of paragraphs and uses long short‐term memory (LSTM) to predict frauds via historical textual embeddings. In addition to LLM embeddings, we also integrate (1) time‐evolving word frequencies of words relevant to fraud detection, such as those expressing sentiment or uncertainty, and (2) time‐evolving financial ratios. Trajectories of paragraph‐level embeddings, frequencies, and ratios are used to construct a fraud detection model, which we evaluate against machine learning methods and deep time‐series models. Using 30 years of financial report data (from 1995 to 2024), our experiments demonstrate that D eep F raud on average enhances fraud detection performance across a number of scenarios and on average outperforms the competing approaches as well as conventional word frequency approaches. Our framework introduces a novel direction for deep feature engineering in the field of financial statement fraud detection.
Detecting fraud in financial statements is essential for maintaining transparency and accountability. In this study, we propose a novel dual-level cost-sensitive stacking ensemble framework for financial fraud detection, integrating a range of traditional classifiers with a state-of-the-art attention-based neural model. Our approach applies cost weighting at both the base and meta levels to amplify minority-class detection and pairs it with SMOTE-based oversampling. Additionally, we combine heterogeneous financial ratios and raw accounting variables in a single pipeline, enabling the model to derive new ratio-like features that capture real-world financial complexity.Comprehensive experiments demonstrate that this dual-layer cost weighting, together with SMOTE, significantly enhances fraud detection. Notably, incorporating TabTransformer within the ensemble achieves an AUC of 89.10% on a benchmark dataset, surpassing single-stage cost-sensitive methods and prior studies. These findings underscore how multi-level cost sensitivity, attention-driven modeling, and domain-specific feature integration can effectively tackle skewed data in financial statement fraud. Our framework offers an adaptable and robust solution for class-imbalanced financial contexts and may extend to related anomaly detection scenarios across various domains.
Financial fraud has extremely damaged the sustainable growth of financial markets as a serious problem worldwide. Nevertheless, it is fairly challenging to identify frauds with highly imbalanced dataset because ratio of non-fraud companies is very high compared to fraudulent ones. Intelligent financial statement fraud detection systems have therefore been developed to support decision-making for the stakeholders. However, most of current approaches only considered the quantitative part of the financial statement ratios while there has been less usage of the textual information for classifying, especially those related comments in Chinese. As such, this paper aims to develop an enhanced system for detecting financial fraud using a state-of-the-art deep learning models based on combination of numerical features that derived from financial statement and textual data in managerial comments of 5130 Chinese listed companies’ annual reports. First, we construct financial index system including both financial and non-financial indices that previous researches usually excluded. Then the textual features in MD&A section of Chinese listed company’s annual reports are extracted using word vector. After that, powerful deep learning models are employed and their performances are compared with numeric data, textual data and combination of them, respectively. The empirical results show great performance improvement of the proposed deep learning methods against traditional machine learning methods, and LSTM, GRU approaches work with testing samples in correct classification rates of 94.98% and 94.62%, indicating that the extracted textual features of MD&A section exhibit promising classification results and substantially reinforce financial fraud detection.
Financial fraud is a huge problem all over the world and has a negative impact on the growth of financial markets. It is challenging to identify frauds with a heavily skewed dataset since there are so many legitimate firms compared to fraudulent ones. Algorithms that can intelligently detect financial statement fraud have so been created to assist stakeholders in making decisions. Therefore, this research aims to improve financial statement fraud detection by integrating numerical data with cutting-edge deep learning algorithms. It is important to perform data preparation, feature extraction, and model training in this specific order. Fraud detection for narratives in annual reports is a part of data preprocessing that includes establishing a library of fraudulent feature terms and clustering yearly reports. Methods for Evaluating Vocabulary Similarity rely on feature extraction. Training DCNN-LSTMAE-AM models begins with feature retrieval. The suggested approach achieves better results than two state-of-the-art methods: LSTM and DCNN. After using the method, accuracy went up by $\mathbf{9 7. 4 8 \%}$.
No abstract available
In this study, we have developed a hybrid financial statement fraud detection model by combining rough set theory and ensemble learning. In this research, we have developed a pre-processing filter utilizing rough set theory to assist the model in selecting the most appropriate features. Additionally, we have also incorporated a feature extracted from the text information of the Management Discussion and Analysis (MD&A) section in the financial report, which effectively captures the positive tone expressed by the management. To address the challenge of imbalanced classes in fraud detection, we have applied the Synthetic Minority Oversampling Technique (SMOTE) algorithm. The optimization procedures significantly improved the performance of our model. We have compared our model to other regular machine learning methods and observed its superiority. The results demonstrate that the integration of rough set theory attribute reduction algorithms has substantially enhanced the performance of the ensemble learning model. Furthermore, the inclusion of the feature extracted from text information has proven to be effective in detecting financial fraud behaviors.
The financial statement analysis of a company serves as a vital instrument in assessing its performance, sustainability, and long-term value creation for stakeholders. In this research, we conduct a detailed financial statement analysis of Hero MotoCorp Ltd., the world’s largest manufacturer of two-wheelers, by integrating conventional ratio analysis with advanced software-enabled techniques, including machine learning (ML) and deep learning (DL). The purpose of this hybrid framework is to not only evaluate historical and current financial performance but to also develop models that can forecast future trends, detect anomalies
No abstract available
Financial statement analysis is a critical component of decision-making for businesses, investors, and financial professionals. To enhance the accuracy and effectiveness of such analysis, this paper introduces the application of an innovative approach known as the Intelligent Swarm Regression ARIMA Model. This advanced model combines the power of swarm intelligence with ARIMA (AutoRegressive Integrated Moving Average) time series forecasting, offering a robust methodology for predicting and analyzing key financial metrics. The study begins by providing an overview of the Intelligent Swarm Regression ARIMA model and its application to financial data. Through a comprehensive analysis of financial statements, including market capitalization, revenue, net income, and other crucial indicators, the model's efficacy in predicting future values is evaluated. Additionally, the paper examines the deviations between predicted and actual financial values, offering insights into the model's accuracy and areas for potential improvement. The findings of this research are invaluable for investors, financial analysts, and companies seeking to optimize their financial performance and strategic decision-making. By leveraging the Intelligent Swarm Regression ARIMA Model, stakeholders can make well-informed choices that lead to better financial outcomes and a competitive advantage in a dynamic economic landscape. This paper represents a significant step forward in the financial analysis, providing a practical methodology and a pathway to enhanced financial decision-making. As the importance of financial data continues to grow, this research offers a promising avenue for achieving financial success and stability.
In order to understand the construction of financial statement analysis and prediction models based on deep learning algorithms, a research on the construction of financial statement analysis and prediction models based on deep learning algorithms is proposed. This paper first proposes a model for the intelligence of the financial information of enterprises based on deep learning. The next generation method transforms non-time indicators into real-time indicators and divides and evaluates financial data using neural networks with 4 hidden layers. Real financial data was used to evaluate the data and analysis, and the experimental results showed that the intelligent method focused on abnormal financial data based on deep learning is feasible. best and correct.
This study presents a novel approach called FinAnalytix which merging machine learning’s prowess in pattern recognition with financial statement analysis. This integrated algorithm combines deep neural networks and recurrent neural networks for predictive accuracy in stock return analysis, alongside logistic regression and random forest models for robust fraud detection in financial statements. The empirical evidence demonstrates FinAnalytix effectiveness in identifying abnormal financial patterns and predicting market reactions to earnings announcements. The study utilizes extensive data from listed companies, ensuring a comprehensive and practical application. FinAnalytix represents a significant advancement in the field, providing a dual approach to financial analysis for enhancing investment strategies through accurate stock return forecasts and bolstering financial integrity by detecting fraudulent activities. The simulation of the study based on the financial data of 100 sample listed companies. This research not only bridges the gap between traditional financial analysis and modern machine learning techniques but also offers a powerful tool for investors and regulatory bodies in navigating the complex financial landscape.
No abstract available
Machine learning (ML) approaches have become effective tools for a variety of analytical tasks, including financial statement analysis, in recent years. This study aims to improve current analytical techniques and offer insights to decision-makers by providing a thorough overview of the application of ML methods to financial statement data. We collected data from a wide range of businesses and processed it before using it in various ML algorithms. We assessed the relative effectiveness of these algorithms in foretelling important financial outcomes and spotting patterns in financial data through a series of tests. Our findings show that machine learning models, especially [certain algorithms, for example, "random forests, neural networks,"] outperform traditional approaches in terms of accuracy and predictive capacity. The results highlight how ML has the ability to revolutionize financial statement analysis, offering opportunities for improved efficiency, precision, and insights in the financial domain.
The surge of large language models (LLMs) has revolutionized the extraction and analysis of crucial information from a growing volume of financial statements, announcements, and business news. Recognition for named entities to construct structured data poses a significant challenge in analyzing financial documents and is a foundational task for intelligent financial analytics. However, how effective are these generic LLMs and their performance under various prompts are yet need a better understanding. To fill in the blank, we present a systematic evaluation of state-of-the-art LLMs and prompting methods in the financial Named Entity Recognition (NER) problem. Specifically, our experimental results highlight their strengths and limitations, identify five representative failure types, and provide insights into their potential and challenges for domain-specific tasks.
Named Entity Recognition (NER) is a task within Natural Language Processing (NLP) that involves sequence labeling with labels such as person, location, or organization. While general domain NER focuses on identifying general proper nouns, financial NER has its own unique characteristics, including entities in the form of numbers, percentages, and domain-specific terms like stock tickers. Transformer-based Pre-trained Language Models (PLMs) have revolutionized NER by providing contextual understanding through the selfattention mechanism. These PLMs leverage self-supervised learning and can be adapted for specific domains or tasks with minimal labeled data. This study explores the application of PLMs as tokenizers and vectorizers in financial NER, particularly on the FiNER-139 dataset, which is rich in numeric entities. The study compares the performance of BERT, a general domain PLM, with FinBERT and SEC-BERT, two PLMs specifically designed for the financial domain. PLM, combined with BiLSTM and CRF layers without further finetuning, demonstrate remarkable performance, achieving micro F1-Scores ranging from 0.75 to 0.88. Notably, financial domainspecific PLMs outperform general domain models, with SECBERT variants exhibiting the best results. The sec-bert-base model stands out as the most effective variant, achieving impressive micro and macro F1-Scores of 0.87 and 0.86, respectively. Variants designed for numeric expressions, like sec-bert-num and sec-bert-shape, underperform sec-bert-base due to labeling constraints. Despite strong performance, some entity types still have F1-Scores below 0.5, even with balanced data and Shannon Entropy analysis. The reasons remain elusive and are areas for future research. Further work will explore datasets with both numeric and textual entity expressions to understand PLM capabilities in financial Named Entity Recognition.
This study proposes a hybrid Named Entity Recognition based method for automatic verification of Turkish financial documents. A fine-tuned Bidirectional Encoder Representations from Transformers model is used to extract named entities, supported by rule-based regex for types not covered by the model, such as currency codes and emails. Similarity between summary and full-text sentences is calculated using Simhash, and sentence-level entity matches are used to determine verification accuracy. Spell checker integration is also evaluated. Two datasets—financial and sports—were used to evaluate the method, achieving 91.6% and 79% average verification accuracy, respectively. The results demonstrate that combining machine learning with domain-specific rules can significantly improve verification performance, particularly in low-resource Turkish natural language processing settings.
The rapid development of the information space urges the named entities recognition, particularly financial ones. The high accuracy of detection of such entities allows to significantly improve the quality of data analysis in the financial sector. Therefore, this study aimed to develop and evaluate the effectiveness of a hybrid model that combines modern natural language processing methods for recognizing financially named entities in texts in the Ukrainian language. The following methods are used for this purpose: data analysis, modeling, experimental method, and comparative analysis. CRF, BiLSTM, and BERT models, as well as their combination, are applied for the recognition of named entities. The experiments demonstrated that the hybrid model proved to be effective for recognizing financially named entities, evidencing an advantage over traditional methods. The average accuracy indicators of the model are: Precision at 94%, Recall at 93%, F1-Score at 94%, and Accuracy at 94%. This emphasizes the appropriateness of using complex models for analyzing domain-specific texts. Research prospects include the improvement of hybrid models and their adaptation to other languages and mixed data.
No abstract available
Based on the needs of financial regulation, in view of the features of high proportion of long entities and rich professional vocabulary information in the text of this field, we proposes model Lexical-Fusion BERT(LF-BERT), which is a named entity recognition (NER) model suitable for this field. LF-BERT obtains character vectors based on BERT, integrates relevant lexical information and inter-lexical correlation information with attention mechanism and bidirectional long and short-term memory network (BiLSTM), and adopts the entity head-end label prediction to realize named entity recognition. Compared with other baseline named entity recognition models, experimental results indicate that LF-BERT obtains the best F1 value on both public and financial regulation field NER datasets, verifying the effectiveness of our method.
No abstract available
We present KPI-BERT, a system which employs novel methods of named entity recognition (NER) and relation extraction (RE) to extract and link key performance indicators (KPIs), e.g. "revenue" or "interest expenses", of companies from real-world German financial documents. Specifically, we introduce an end-to-end trainable architecture that is based on Bidirectional Encoder Representations from Transformers (BERT) combining a recurrent neural network (RNN) with conditional label masking to sequentially tag entities before it classifies their relations. Our model also introduces a learnable RNN-based pooling mechanism and incorporates domain expert knowledge by explicitly filtering impossible relations. We achieve a substantially higher prediction performance on a new practical dataset of German financial reports, outperforming several strong baselines including a competing state-of-the-art span-based entity tagging approach.
Generic pre-trained neural networks may struggle to produce good results in specialized domains like finance and insurance. This is due to a domain mismatch between training data and downstream tasks, as in-domain data are often scarce due to privacy constraints. In this work, we compare different pre-training strategies for LayoutLM. We show that using domain-relevant documents improves results on a named-entity recognition (NER) problem using a novel dataset of anonymized insurance-related financial documents called Payslips. Moreover, we show that we can achieve competitive results using a smaller and faster model.
No abstract available
The rapid digitization of financial markets has resulted in an explosion of unstructured financial data, primarily in the form of textual reports, earnings calls, and regulatory filings. Extracting structured knowledge from these documents is critical for automated risk assessment, quantitative investment strategies, and market surveillance. However, financial texts differ significantly from general domain corpora due to high terminological density, complex nested sentence structures, and subtle semantic dependencies. Traditional named entity recognition and relation extraction models often suffer from overfitting when applied to limited labeled financial datasets, leading to poor generalization on unseen data. This paper proposes a robust entity-relation extraction framework that integrates a Bidirectional Long Short-Term Memory (BiLSTM) network with a Conditional Random Field (CRF) layer, augmented by Fast Gradient Method (FGM) adversarial training. The BiLSTM layer captures long-range semantic dependencies, while the CRF layer ensures the validity of the predicted tag sequences. Crucially, the FGM adversarial training mechanism introduces perturbations to the embedding layer during training, effectively regularizing the model and enhancing its robustness against noise and data sparsity. Experimental results demonstrate that the proposed model achieves superior performance in terms of precision, recall, and F1-score compared to baseline methods, particularly in identifying complex financial entities and their interrelations.
Modern AI technologies which exploit the classification and/or prediction capacities of Deep Neural Architectures demonstrate superior performance to traditional approaches in most cases. However, they come with the unavoidable shortcoming of lack of transparency in their outcomes. This attribute renders them unsuitable for big industrial sectors, such as finance, investment management, etc. Specifically, their "black-box" nature makes them unattractive in cases where human understanding in the decision making process is required and may be legally mandatory. In such cases, traditional (i.e., non-deep learning) ML approaches are still preferred, to minimize for example the presence of false positives. In this context, this paper introduces an unsupervised, trustful, bottom-up probabilistic approach for Named Entity Recognition (NER) in financial reports, while in parallel it provides a comparative study on well-known ML-approaches in terms of their performance. The proposed approach builds on the probability of appearance of representative tokens within the given reports and utilizes Kronecker’s Delta and the Total Probability Theorem to construct a probabilistic model that estimates the overall classification probability of a document.
No abstract available
No abstract available
No abstract available
No abstract available
This study aims to improve the identification of potential credit risks in unstructured financial texts. It addresses the core problem of financial text analysis and credit risk assessment by proposing a hybrid model that combines the generative semantic understanding of Generative Pre-trained Transformer-4 (GPT-4) with the enhanced feature extraction of Bidirectional Encoder Representations from Transformers (BERT). To overcome the limitations of traditional methods—such as weak contextual reasoning in long texts, insufficient recognition of industry-specific terminology, and implicit credit risk expressions—the model incorporates a financial dictionary enhancement module and a named entity recognition (NER) component. GPT-4 is leveraged for prompt-based generation to extract latent risk information from complex texts, including annual reports. A dual-model semantic fusion mechanism with attention weighting constructs a multi-level risk assessment system that integrates contextual understanding, industry adaptability, and interpretability. Experiments on multiple publicly available financial datasets and real-world annual reports demonstrate the model’s effectiveness. Results show that the proposed approach outperforms representative baseline models in accuracy, adaptability, and interpretability. This work carries both theoretical and practical significance for research at the intersection of financial technology and natural language processing.
No abstract available
ABSTRACT This paper employs Named Entity Recognition (NER) techniques and text analysis methods to examine how the disclosure of specific customer risks influences IPO approval rates by identifying customer names in the risk sections of prospectuses. Evidence from the study in China shows that companies that proactively disclose specific customer risk information are more likely to obtain IPO approval. After excluding the weakening effect of the registered system on the disclosure of the customer name, the conclusion is still established. This result remains robust after controlling for endogeneity through Heckman two-stage test and PSM method. This empirical finding serves as a crucial foundation for incentivizing companies to offer more specific customer risk information and alleviating information asymmetry in the IPO market.
Named Entity Recognition (NER) plays a critical role in extracting relevant information from larger texts. Their role in financial text processing allows information extraction. from customer transactions, banking statements and regulatory communications.The Kenyan financial domain presents unique challenges due to mixed-language usage, informal token patterns and scarcity of annotated corpora. This study conducts a comparative evaluation of baseline and transformer based NER models on a newly prepared Kenyan financial text dataset. The evaluated models include a rule based SpaCy NER baseline, BERT, DistilBERT, FinBERT (domain adapted), AfriBERTa (African language adapted) and XLM RoBERTa (multilingual). Each model was fine tuned on identical training and validation splits and assessed using Precision, Recall, F1 Score and token level Accuracy. Experimental results indicate that SpaCy performs poorly (F1 = 0.2674), indicating limited suitability in this domain. BERT and DistilBERT have a moderate level of performance (F1 = 0.79) while FinBERT and AfriBERTa indicate an enhanced domain adaptation (F1 = 0.82). The highest accuracy is achieved by XLM RoBERTa (F1=0.8480, Accuracy=0.9774) Benefit of multilingual contextual representations for Kenyan English and code-switch text. These findings show that is accomplished by using language model that is multilingual and regionally sensitive is greatly improves financial NER accuracy. The research culminates in recommendations for extending the types of entities.
We explore the use of Large Language Models (LLMs) for automating the extraction of Key Performance Indicators (KPIs) from diverse financial reports without any additional fine-tuning. We focus on evaluating various proprietary and open-source LLMs to address the joint named entity recognition and relation extraction tasks essential for accurately linking KPIs to their corresponding values and attributes. Our study highlights the technical challenges involved in the extraction process and presents a comprehensive evaluation of the models’ effectiveness. Our results reveal significant insights into handling these LLMs in such a crucial environment and showcase the transformative potential of LLMs in enhancing financial analysis and decision-making.
The rapid expansion of financial markets and the increasing complexity of financial instruments have created an urgent need for intelligent systems capable of understanding and responding to domain-specific inquiries. Traditional question answering (QA) models often fail to generalize beyond narrow datasets and exhibit poor comprehension of financial terminology, leading to suboptimal performance in real-world financial contexts. To address these limitations, this study introduces a domain-adaptive financial QA framework that integrates a financial knowledge graph with a multi-task semantic reasoning mechanism based on the T5 transformer architecture.The proposed system leverages three interconnected sub-tasks—named entity recognition (NER), entity linking, and domain-specific answer generation—trained collaboratively on a curated dataset titled “FinQ-Fuse,” which consists of over 120,000 annotated financial QA pairs. The NER module utilizes a BiLSTM-CRF architecture, while entity linking combines rule-based filtering with a support vector machine (SVM) classifier. Domain-specific question answering is powered by a fine-tuned T5 model that incorporates contextual and graph-based knowledge.Experimental evaluations conducted on benchmark datasets (FiQA, FinQA, and a custom Chinese financial QA dataset) demonstrate that the proposed framework achieves substantial improvements over baseline models. The system records an accuracy of 88%, a BLEU score of 0.72, and an entity coverage rate of 78%, indicating its superior performance in both linguistic and knowledge-aware tasks.This research validates the effectiveness of integrating semantic reasoning with structured domain knowledge for financial QA systems. By significantly enhancing answer quality, contextual awareness, and entity understanding, the proposed model sets a new benchmark for intelligent financial information retrieval. The findings hold strong practical implications for investment analysis, customer service automation, and financial education, paving the way for more adaptive and explainable AI systems in the financial sector..
Corporate annual reports serve not only as key carriers of financial data but also as critical tools for management to engage in impression management through linguistic strategies. This study aims to deeply explore the impact of annual report tone, i.e., the emotional sentiment in the Management Discussion and Analysis section, on corporate financing costs. Taking Tesla, Inc., a representative multinational company, as the research subject, the study quantifies the positivity of tone in Tesla’s annual reports from 2019 to 2024 using the Loughran-McDonald financial sentiment word list. For the first time, it combines this with contemporaneous changes in the company’s bond credit spreads and bank loan interest rates to comparatively analyze the pathways through which tone differences affect financing costs. The results show that when management adopts excessively optimistic tone without sufficient performance support, it exacerbates information asymmetry, prompting creditors to demand higher risk premiums and thereby increasing financing costs. In contrast, a prudent and neutral tone helps build market trust and reduce financing costs. This study provides reference for companies in selecting linguistic strategies during information disclosure and offers investors a new perspective for interpreting textual information in annual reports.
[Purpose] This study empirically examines how the readability of annual reports, as one of their linguistic characteristics, affects the accuracy of analysts’ earnings forecasts. Readable annual reports can enhance users’ understanding and improve information processing efficiency, which may meaningfully influence analysts’ judgments. The primary objective of this study is to explore the relationship between the corporate information environment and analysts’ decision-making behavior. [Methodology] Using annual reports of firms listed on the KOSPI and KOSDAQ markets from 2010 to 2020, this study measures readability through text analysis, specifically utilizing report length and file size as proxies. A regression analysis is conducted to investigate the relationship between annual report readability and analysts’ forecast errors. Additionally, the study analyzes differences between optimistic and pessimistic forecast samples. [Findings] The results indicate that firms with more readable annual reports tend to exhibit significantly higher accuracy in analysts’ earnings forecasts. In the optimistic forecast group, higher readability is associated with smaller forecast errors. Conversely, in the pessimistic group, higher readability correlates with larger forecast errors, suggesting a tendency toward more conservative forecasting. [Implications] This study provides empirical evidence that the non-financial attribute of readability in annual reports has a tangible impact on information users’ decision-making processes. These findings imply that improvements in disclosure practices-particularly through clearer reporting standards or readability-enhancing guidelines-are necessary to strengthen the effectiveness of corporate disclosures.
The increasing interest in stock market investment in Indonesia has highlighted a significant challenge for retail investors: the difficulty of analyzing lengthy and complex corporate annual reports. These documents, essential for fundamental analysis, are often hundreds of pages long and contain detailed narrative sections that require considerable time and effort to comprehend. This research addresses this issue by developing an automatic summarization system using a Large Language Model (LLM) to generate concise and insightful summaries of such reports. The primary objective was to develop and evaluate an LLM-based system specifically adapted for the structure and content of annual reports. The method involved creating a tailored dataset comprising 2,008 narrative text excerpts and their corresponding manual summaries sourced from the annual reports of companies listed on the Indonesia Stock Exchange (IDX). The open-source Llama-3.2-3B-Instruct model was then fine-tuned using the Parameter-Efficient Fine-Tuning (PEFT) technique, specifically Low-Rank Adaptation (LoRA). The research results demonstrated a significant improvement in the model's performance after fine-tuning. Quantitative evaluation using ROUGE metrics showed a relative increase of 18.63% in ROUGE-1, 44.45% in ROUGE-2, and 33.83% in ROUGE-L compared to the base model. Qualitative analysis confirmed that the fine-tuned model was capable of generating informative and relevant summaries aligned with the context of annual report analysis. In conclusion, this study demonstrates that fine-tuning LLMs with document-specific data is an effective approach for specialized tasks such as annual report summarization.
No abstract available
From the perspective of annual report text information, we study the relationship between the annual report text’s positive tone and corporate green innovation. Taking listed companies from 2010 to 2022 as a sample, we found that the positive tone of the annual report text significantly improves the company’s green innovation while improving the quantity and quality of green innovation. The mechanism test shows that the main channels are easing corporate financing constraints and enhancing external attention. Regarding heterogeneity analysis, we found that the positive annual report text has a more significant effect on corporate green innovation in companies with high economic policy uncertainty and non-heavily polluting industries. Finally, we found that the positive tone of the annual report text can ultimately improve the company’s long-term value through green innovation. Our study has enriched the theoretical research on the annual report text tone and provided empirical evidence for promoting enterprise green innovation.
According to historical data, financial statement fraud considered as the type fraud scheme that causes the most losses compared to the other schemes. Fraud theories has been introduced and used to examine factors in detecting financial statement fraud. However, studies that analyze the influence of corporate ethical culture in relation to fraudulent financial statements are still very limited. Thus, the purpose of this research is to analyze the influence of corporate social responsibility disclosure and annual report readability which reflects the company's ethics in disclosing its performance in detecting fraudulent financial statements while assessing the effect of earnings management as the moderating variable. Sources of data used in this study are secondary data that obtained from the Indonesia Stock Exchange website. The population of this research were food and beverages manufacturing companies listed on Indonesia Stock Exchange for the period 2017-2019 since the most common financial statement fraud schemes appear in the manufacturing industry and the food and beverages subsector are the largest among the other. Our research is quantitative study. Using the purposive sampling technique, researchers found 19 companies that fulfill the criteria. Within 3 years of observation, the total number of samples was 57. Multiple linear regression and moderated regression analysis are used to analyze data. The research results indicated that annual report readability with a moderating effect of earnings management has an influence in detecting financial statement fraud. Meanwhile annual report readability without moderation effect of earnings management and corporate social responsibility disclosure with or without moderating effect of earnings management, has no influence in detecting financial statement on listed food and beverages manufacturing companies.
Using a sample of 94,697 US firm‐year observations from 1994 to 2017, we document that annual report complexity is positively and significantly associated with a firm's operating lease ratio. In addition, we find that financially constrained and weakly governed firms with complex financial reports lease more. Finally, by employing a difference‐in‐differences method with the Plain Writing Act 2010 and a regression discontinuity design with eXtensible Business Reporting Language (XBRL) adoption, we find that the positive association is highly likely to be causal. Overall, our study shows that firms with linguistically complex annual reports strategically choose to use leasing as an alternative source of funding.
Purpose This study aims to conduct an exploratory analysis to investigate whether changes occurred in the readability of sustainability reporting narratives following a reputational crisis. Design/methodology/approach Drawing from crisis communication and impression management studies, the authors explore whether a reputational crisis may influence management to adapt the readability of narratives. Using the reputational crisis following the Costa Crociere incident in 2012 as a case study, the authors analyse the readability of sustainability reports before and after the crisis through the multidimensional linguistic software Coh-Metrix. Findings Comparisons across the three main sections of sustainability reporting narratives (letter to stakeholders, environmental performance and social performance) reveal substantial changes in linguistic characteristics after the crisis, especially concerning the sections devoted to environmental performance. However, the different trends of changes among sections do not allow for a clear interpretation of manipulative intent. Originality/value This study contributes to crisis communication research by focusing on readability, an under-explored yet crucial aspect of crisis communication strategies. It also enhances impression management literature by investigating readability manipulation in post-crisis sustainability reporting narratives. Furthermore, the study offers a methodological contribution by proposing the use of multidimensional linguistic software like Coh-Metrix to explore deep language structure levels not easily detectable with commonly used readability measures.
No abstract available
Based on the incremental information theory, this paper examines the impact of annual report text tone on corporate financing constraints using a fixed-effects model with a sample of listed Chinese A-share enterprises in Shanghai and Shenzhen from 2008 to 2021. The findings show that positive annual report text tone can effectively alleviate firms' financing constraints, which remains robust after considering the self-selection problem of the model, omitted variables and the lagged effects of the independent variables. Further, the paper analyses the mediating and moderating effects of the relationship between the tone of annual report texts and corporate financing constraints. For the mechanism analysis, the paper examines the mechanism role of media attention. In terms of moderating effects, two macro-level variables, namely the level of regional economic development and the quality of the institutional environment; one meso-level variable, industry concentration; and three micro-level variables, namely the quality of internal control, equity structure and life cycle, are selected for analysis. The findings obtained from this paper have implications for the expansion of effective disclosure of annual reports by companies and for investors' proper understanding of the role of incremental textual information in annual reports for investment decisions.
This study develops and validates a multidimensional framework for assessing corporate annual report quality through advanced text analysis, with a comparative focus on Vietnam and China, two dynamic yet institutionally distinct emerging markets. Leveraging a corpus of 60 English-language annual reports (2020–2024) from listed companies across seven industries, we extract 13 textual indicators spanning readability, sentiment, thematic focus, and disclosure volume. These indicators are integrated into an Annual Report Quality Index (ARQI) using a Multi-Criteria Decision Making (MCDM) framework that combines the Entropy Weight method for objective indicator weighting and the PROMETHEE method for nonlinear ranking. A two-tier research design enables both cross-industry characterization and a focused intra-industry banking case study, where three systemically important banks from each country are compared. Results reveal significant disparities: Chinese banks demonstrate consistently higher and more stable ARQI scores, reflecting greater institutional maturity under the China Securities Regulatory Commission (CSRC) framework. Vietnamese banks exhibit greater volatility but a clear upward convergence trend, particularly after 2022, suggesting adaptive responses to evolving disclosure regulations such as Circular 22/2019/TT-BTC. Component-level analysis shows that Chinese reports excel in risk transparency, sentiment stability, and thematic integration, while Vietnamese reports demonstrate emerging strengths in sustainability and digital transformation narratives. An internal consistency test confirms strong alignment between ARQI scores and task group assessments (Pearson’s r = 0.927, p < 0.001), validating the index’s reliability. This study contributes a scalable, replicable methodology for assessing narrative disclosure quality and provides empirical evidence on how institutional contexts shape corporate communication in emerging Asian economies. The ARQI framework offers practical utility for investors seeking transparent information, corporate managers aiming to benchmark reporting practices, and regulators striving to enhance market transparency.
When enterprises disclose information, management, as the information supplier, usually uses surplus means to improve the self-interest to whitewash the contents of the annual report. The text analyzes and quantifies the risk information in the annual report by grabbing the annual report of all A-share listed companies from 2010 to 2017, and examines the impact of disclosure level on the quality of corporate information. The research results show that the annual report disclosure level is positively correlated with the information quality. The more the risk disclosure, the higher the information quality. This research is beneficial to the company's information users to make a reasonable assessment of the risks and values of the company, and to improve the company's internal management.
Building on attribution theory, this study investigates the antecedents of corporate annual report text manipulation from the perspective of managerial performance pressure. Using a data set comprising 15,076 samples from companies listed on China's Shanghai and Shenzhen A‐share markets between 2009 and 2021, we reveal that as management faces performance pressure, they tend to increase content about external environmental risk and policy uncertainty in annual reports due to actor–observer bias. When incorporating Confucian cultural factors into the research framework, the study found that the strength of Confucian culture mitigates the correlation between the pressure of management performance and annual report manipulation. However, the level of marketization where a corporation is located can undermine the ethical norms upheld by Confucian culture. In additional analysis, we also discovered that the longer the duration of performance pressure, the greater the extent of manipulation within the annual reports initiated by the management. Interestingly, the social performance pressure generated from ESG performance also positively influences annual report textual manipulation, compared to operational performance pressure. This study explores the impact of management performance pressure on information disclosure decisions from the perspective of performance attribution, not only extending the boundaries of attribution theory but also providing guidance for policymakers to enhance the supervision of annual report disclosures and for capital market investors to optimize investment decisions.
This study investigates the application of supervised learning algorithms for detecting financial statement fraud in annual corporate reports. Financial reporting fraud remains a critical challenge for auditors and regulators, as traditional detection methods, such as ratio analysis and manual auditing, often fail to identify complex anomalies in large datasets. The research aims to evaluate the effectiveness of several machine learning algorithms in improving fraud detection accuracy and reliability. A dataset consisting of 500 annual financial statements from 2020 to 2024, including 50 identified cases of potential fraud, was preprocessed through data cleaning, normalization, and labeling. Algorithms tested include Decision Tree, Support Vector Machine (SVM), Naïve Bayes, K-Nearest Neighbors (K-NN), and Neural Network. The results indicate that Neural Network achieves the highest accuracy (94.5%), followed by SVM (91.6%), while simpler algorithms such as Naïve Bayes and K-NN demonstrate moderate performance. Comparative analysis highlights that ensemble and deep learning models are more capable of capturing complex patterns in financial data, providing a significant advantage over traditional methods. The findings suggest that integrating machine learning into auditing practices can enhance the detection of fraudulent activities, improve decision-making processes, and increase the reliability of audit outcomes. This research underscores the importance of combining advanced computational techniques with professional auditor oversight to ensure accuracy, transparency, and accountability in financial reporting.
This study aims to investigate the impact of adopting International Financial Reporting Standards (IFRS) on the readability of corporate annual reports of Saudi companies. Data have been collected for a sample of 67 companies listed on the Saudi Stock Exchange for the period 2014–2019. Statistical methods such as the independent sample t test, the Wilcoxon matched-pair test, and the multiple regression analysis have been used to examine the effect of adopting IFRS on the readability of the corporate annual report. The results of the study reveal that the adoption of IFRS has led to a decrease in the readability of the corporate annual report. The results also indicate that there is a significant impact of the company’s size and profitability on the readability of the corporate annual report, while the leverage and industry in which the company operates do not have a significant impact on the readability of the corporate annual report. Since the annual reports of Saudi companies are published in Arabic, the study is not able to use the most popular readability indexes in the literature such as the Fog Index, Gunning Fox Index, Flesch–Kincaid Grade Index, and Flesch Reading Ease Index. Instead, the study uses three readability measures appropriate to the readability of annual reports prepared in Arabic, namely report length, report size, and LIX formula. The study contributes to the global debate about the economic consequences of adopting International Financial Reporting Standards (IFRS) by examining the impact of adopting IFRS on the readability of corporate annual reports, considering that this report is the main and official communication tool between the company and its stakeholders. This study is the first study to examine the impact of adopting IFRS on the readability of corporate annual reports in Saudi Arabia as one of the emerging markets.
Over the last decade especially in the wake of the COVID-19 pandemic, large companies have confronted overwhelming financial challenges, evidenced by revenue shortfalls and, in certain sectors, deficits. What’s worse, such under-performance poses a serious legitimacy threat on the companies’ right to survive and succeed in the context of heightened public scrutiny of corporate activities and results. Thus, this paper examines and analyzes 50 CEO statements of annual reports published by 21 Fortune Global 500 U.S. companies which were identified in deficit of the year. The study presents the way in which the companies attempted to divert or downplay their deficits via the strategies of Image Enhancement, Deflection, Mitigation and Admission. The findings of this study contribute to apprehending the linguistic and rhetorical elements tactically deployed by the companies for impression management, legitimacy repair and also crisis communication.
Marriage relationship is an important factor affecting the economic decision-making and long-term development of enterprises, and the annual report’s tone, as text information to record the current operation status and future performance of enterprises, is an important reflection of corporate behavior during husband and wife's shareholding. This paper uses listed family companies from 2011 to 2021 as an initial sample to study the impact of couple's joint holdings on the tone of annual reports. The study found that when the actual controlling husband and wife jointly held shares, the annual report’s tone was more positive. Further research found that the effect of husband and wife joint shareholding on the tone of annual reports is affected by the proportion of fixed assets of enterprises and whether the chairman and CEO of the enterprise are concurrent. Combining the marital relationship in the family business with the annual report information of the enterprise is helpful to deeply explore the macro-overview of the current and future development of the enterprise under the marriage relationship, and provides a new perspective for understanding the decision-making and planning of the family business.
Purpose This paper aims to integrate the latent semantic features of annual report text with accounting indicators to construct a financial fraud identification model, and quantitatively analyze the impact of different corporate risks on financial fraud behavior in different industries, providing a reference for identifying financial fraud. Design/methodology/approach This paper obtains 3,860 corporate annual report samples and accounting indicators from 2001 to 2020 through crawlers and the CSMAR database as our experimental subjects. By integrating latent semantic features with accounting indicators and textual language features, a new indicator system group is constructed. Based on this indicator system group, multiple model identification effects are compared and a stacking-based enterprise financial fraud identification model is constructed. In addition, an econometric model is established to verify the impact of latent semantic features related to enterprises on corporate financial fraud. Findings The experimental results show that the constructed stacking-based enterprise financial fraud identification model performs better than other machine learning models and can effectively identify financial fraud. The econometric model established for the latent semantic information of annual reports explains the impact of different corporate trends on fraud behavior in different industries. Originality/value This paper combines the textual latent semantic features of annual reports with accounting indicators, expands the scope of data analysis, introduces the idea of ensemble learning, updates the financial fraud identification algorithm and constructs an econometric model for further analysis, providing a reference for financial fraud identification.
ABSTRACT This paper empirically studies the relationship between the cost of equity capital and annual report tone of listed firms’ annual reports from 2007 to 2019 in China. It is discovered that the greater the firm’s cost of equity capital, the more positive the tone of the manipulated annual report. In addition, in firms with lower quality of accounting information and higher degree of industry competition, the impact of cost of equity capital on the degree of tone manipulation is more significant, indicating that the impact of cost of equity capital on the positive tone disclosure of annual report is a deliberate manipulation behavior. The heterogeneity analysis shows that the cost of equity capital has a more significant impact on the tone manipulation of corporate annual reports in non-state-owned corporates than in state-owned corporates. Further analysis shows that the increase of the cost of equity capital will lead to the decline of investor’s confidence and corporate value and is also the mechanism of the cost of equity capital affecting the annual report note manipulation. Further analysis shows that the increase of the cost of equity capital will lead to the decline of investor’s confidence and corporate value, which is also the mechanism of the cost of equity capital affecting the annual report note manipulation.
This study aims to examine the impact of board members with media backgrounds (MBD) on the readability of annual reports. It further tests the effect of MBD in the sub-samples of companies with and without Risk Management Committees (RMC) and those audited by BIG 4 and non-BIG 4 firms. This study utilized a sample of companies listed on the Indonesia Stock Exchange (IDX). Ordinary Least Squares (OLS) regression with clustering by firm was performed in STATA 17.0 to predict the relationship between MBD and annual report readability. Robustness checks were conducted using Coarsened Exact Matching (CEM) analysis. The results indicate that the presence of board members with media backgrounds significantly improves the readability of annual reports. In companies without Risk Management Committees, the positive impact of MBD on readability was significant. However, in companies with an RMC, this effect was less pronounced. Furthermore, the positive relationship between MBD and readability was more significant in companies audited by BIG 4 firms compared to those audited by non-BIG 4 firms. This study offers new insights into the role of board members with media backgrounds in enhancing corporate communication. It examines how MBD affects the readability of annual reports in different sub-samples, including companies with and without Risk Management Committees (RMC), and those audited by BIG 4 firms versus non-BIG 4 firms. The findings highlight the strategic value of MBD in improving the readability of annual reports.
Based on the texts of ‘Management Discussion and Analysis on Financial Status and Business’ (MD&A) collected from 118 American corporate annual reports, this study investigated how the professional discourse is realized from the dimensions of text, genre, professional practice and professional culture, according to the framework of the Critical Genre Analysis (CGA) theory. It was found that (1) In the text dimension, lexicon features regarding vocabulary volume, vocabulary highlighting, readability and sentiment are unique to MD&As; (2) the genre of MD&As corresponds to a rhetorical structure of 3 moves and 10 steps; (3) the professional practice of MD&As is represented by three types of interdiscursivity, that is, shifting, mixing and embedding; (4) the professional culture of MD&As is embodied through identity enculturation, human-oriented value, cooperation awareness and self-serving manner. The results of this study help to better understand the realization of professional discourse and expand the application of CGA in professional practice. The findings presented herein are expected to enhance interdisciplinary research that relates language with corporate performance.
With the proposal of dual carbon goal, environmental, social and governance (ESG) disclosure has become an important indicator for measuring corporate financial performance(CFP). This study investigates the impact of corporate ESG disclosure and public perception of ESG on CFP within the context of China's A-share market. Utilizing ESG disclosure data from the annual reports of firms listed on the Shanghai and Shenzhen stock exchanges from 2014 to 2023, the study employs the Elaboration Likelihood Model (ELM) to analyze the central route between ESG disclosure and CFP, as well as the peripheral route between public perception of ESG and CFP. Additionally, the moderating effect of social media text features on the central route is examined. The findings reveal a significant inverted U-shaped nonlinear relationship between the degree of ESG disclosure and CFP. Similarly, public perception of ESG also exhibits an inverted U-shaped relationship with CFP. Importantly, negative emotions within social media texts significantly moderate the relationship between ESG disclosure and CFP. Furthermore, the inverted U-shaped effect between ESG disclosure and CFP is more pronounced in firms with lower R&D investment ratios. This study contributes to the literature by exploring the complex interplay between ESG disclosure, public perception, and CFP within the context of social media, offering valuable insights for ESG practices in the electronics equipment manufacturing industry.
We investigate the effect of corporate digitalization capabilities on green innovation among Chinese‐listed firms. Using a panel dataset of 2908 companies from 2011 to 2020, we use textual analysis and entropy weighting on corporate annual reports to construct a yearly corporate digitalization index. Our findings show that corporate digitalization promotes green innovation, as evidenced by patent applications and grants. This relationship is stronger for firms with fewer financial constraints and in provinces with strong intellectual property protection. We also find the national digital policy of the “Internet Plus” strategy has a stronger positive effect on corporate green innovation for corporations with a higher degree of digitalization. Our results are robust to various alternative measures, econometric models, and different samples.
The financial domain poses unique challenges for knowledge graph (KG) construction at scale due to the complexity and regulatory nature of financial documents. Despite the critical importance of structured financial knowledge, the field lacks large-scale, open-source datasets capturing rich semantic relationships from corporate disclosures. We introduce an open-source, large-scale financial knowledge graph dataset built from the latest annual SEC 10-K filings of all S&P 100 companies - a comprehensive resource designed to catalyze research in financial AI. We propose a robust and generalizable knowledge graph (KG) construction framework that integrates intelligent document parsing, table-aware chunking, and schema-guided iterative extraction with a reflection-driven feedback loop. Our system incorporates a comprehensive evaluation pipeline, combining rule-based checks, statistical validation, and LLM-as-a-Judge assessments to holistically measure extraction quality. We support three extraction modes-single-pass, multi-pass, and reflection-agent-based allowing flexible trade-offs between efficiency, accuracy, and reliability based on user requirements. Empirical evaluations demonstrate that the reflection-agent-based mode consistently achieves the best balance, attaining a 64.8% compliance score against all rule-based policies (CheckRules) and outperforming baseline methods (single-pass & multi-pass) across key metrics such as precision, comprehensiveness, and relevance in LLM-guided evaluations. The utility of our KG pipeline is demonstrated through its flexible extraction modes, coupled with a multi-faceted evaluation methodology. By releasing a high-quality, thoroughly evaluated dataset along with a comprehensive KG construction & evaluation framework, we aim to advance transparency, reproducibility, and innovation in financial KG research. The dataset is publicly available at: https://anonymous.4open.science/r/KG-Financial-Datasets-SP-100-529B/README.md
This work addresses the use of digital document processing as well as its application within the financial market, with an emphasis on the automated extraction of data from balance sheet documents, data that is common in the day-to-day life of this sector and that needs to be precise due to its large number of fields and recurrence. To perform this function, optical recognition technologies such as OCR, semantic entity recognition (SER) and named entity recognition (NER) were chosen to be used. The present work is an applied research with the main objective of developing a platform with the focus of optimizing the process of extracting and digitizing data from balance sheet documents, thus reducing errors and increasing operational efficiency in the asset management process. The methodology used involves the study of financial projection layouts, the development of an updated automated data extraction model, and validation mechanisms to guarantee result consistency.
Accurate and efficient classification of business documents is essential for organizations to streamline operations and enhance data management. Traditional document classification methods, relying on manual feature extraction and annotation, often struggle with the diversity and complexity of business documents such as invoices, contracts, and financial reports. In this paper, we propose a novel approach that utilizes custom vector embeddings and predefined document classes to address these challenges. Our method leverages the semantic richness of vector embeddings to capture the inherent meanings and relationships within documents, enabling precise classification even in complex and nuanced contexts. By fine-tuning embedding models for retrieval tasks using a proprietary in-house dataset, we ensure the model's robustness and scalability across various document types. We demonstrate through extensive experiments that our approach achieves superior performance compared to traditional methods, offering a scalable, accurate, and interpretable solution for business document classification.
No abstract available
Sentiment analysis aims to identify the sentiment polarity of specific aspects within given sentences or comments, and aspect-based sentiment analysis is considered a fundamental task in sentiment analysis. With practical applications in areas such as product reviews, food delivery evaluations, and public opinion monitoring, sentiment analysis plays a crucial role. This paper focuses on the application of fine-grained sentiment analysis in financial distress prediction (FDP) to enhance early warnings of the management status of companies. In previous studies, there has been a narrow emphasis on using document-level sentiment analysis to extract overall sentiment from text, overlooking the semantic nuances conveyed by sentiments. Therefore, this paper aims to extract fine-grained sentiments from the Management Discussion & Analysis (MD&A) of Chinese listed companies. The proposed model is based on a two-step framework, consisting of an unsupervised aspect-level financial sentiment extraction phase and a model validation phase. Specifically, the former is built on a deep learning model with an attention mechanism, conducting unsupervised aspect extraction, aspect identification, and aspect-level sentiment classification in a sequential manner to obtain fine-grained sentiments. The latter is responsible for evaluating the effectiveness of the newly acquired features on benchmark machine learning models, including SVM, DT, LR, CNN, and DNN. Experimental results reveal that MD&A predominantly covers eight types of aspects, including ownership, business scope, development, capital, sales, management, prizes, and probability. Additionally, it has been observed that fine-grained sentiment features can enhance the performance of FDP. This study represents a significant innovation in existing literature, being the first to introduce aspect-level financial sentiment analysis into the realm of FDP.
No abstract available
Document verification is the process of verifying an original summary document on the original full-text document. Semantic control is very critical in these verification processes. In this study, an automatic document verification system based on Natural Language Processing techniques was designed to semantically check the consistency of the abstract summary produced especially for the original document or documents of the financial type. Verification of abstract summaries on original full-text documents was done through the Transformer-based model. Since the reference documents to be verified in the study belong to the financial type, the Transformer model was created by training with Reuters financial dataset. The proposed Transformer-based semantic document verification approach was tested on the original full-text and summary documents. The full text and summary documents were subjected to data pre-processing and Spell Checker processes. Then, since the summary document will be verified on the full-text document, the sentences most similar to the summary document sentences from the full-text document sentences were determined by using Simhash and Cross Encoder text similarity algorithms. It is a heuristic approach and completes the proposed verification system. Two (experimentally) original full-text document sentences most similar to the summary document sentence were selected. Then, these original full-text document sentences were inputted as training data to the Transformer model. Finally, the transformer model produced an abstract summary of original full-text sentences. In the last stage, the original summary and the summary produced by the Transformer model were compared with both Simhash and Cross Encoder text similarity algorithms in terms of their similarities, and the average document verification accuracy was calculated. The proposed Transformer-based semantic document verification approach achieved an average of 84.1% semantic financial document verification accuracy on the financial documents in the Reuters financial dataset. In this study, we present several key contributions to the field of semantic document verification: Firstly, we introduce a Transformer-based model tailored for financial texts, trained on the Reuters financial dataset, which offers enhanced precision in understanding financial language. Secondly, our approach employs advanced Natural Language Processing techniques for deep semantic analysis to verify the consistency of document summaries. Thirdly, we propose a novel hybrid methodology that integrates Transformer models with sentence grouping techniques for generating accurate and informative abstract summaries. These innovations collectively mark a substantial advancement in the automation and precision of document verification processes.
Financial event modeling is fundamental to financial investment decisions and risk management, crucial for the stability and growth of financial institutions, and helps ensure the stability and quality of people’s lives. Utilizing state-of-the-art natural language processing techniques for automated financial event extraction addresses the inefficiencies and high costs associated with traditional event identification and modeling, which rely heavily on financial domain experts. However, existing datasets fail to tackle the issues with long documents in practical situations. To address this, we first propose DocFEE, a large-scale Document-level Chinese Financial Event Extraction dataset. It reflects the length of announcement documents and the long-distance dependencies of event arguments in real-world scenarios.
In this comprehensive paper, we present a detailed overview of the Financial Table Of Content extraction shared task series, FinTOC, conducted over a span of five years from 2019 to 2023. This paper serves as a retrospective analysis of the key developments in the field of financial document structure extraction. The FinTOC series, hosted within the framework of the Financial Narrative Processing (FNP) workshop, has been instrumental in shaping the landscape of Natural Language Processing (NLP) in the financial domain. Our analysis delves into the diverse methodologies proposed by participants across all editions, shedding light on the innovative strategies employed to tackle the intricate challenge of extracting structured information from financial documents. We explore the evolution of techniques, from traditional rule-based approaches to cutting-edge deep learning models, showcasing the dynamic nature of NLP advancements. Furthermore, our study investigates the introduction of multilingual datasets by the organizers, highlighting the importance of cross-lingual analysis in financial document processing. We also examine the contributions made by participants in augmenting the training data with external sources, showcasing the collaborative spirit of the NLP community in enhancing the quality and size of the shared training dataset.
No abstract available
No abstract available
Accounting processes are evolving rapidly as organizations integrate artificial intelligence (AI) and natural language processing (NLP) technologies to revolutionize financial document processing. This paper explores the development and application of AI-driven accounting automation systems that harness the power of NLP to extract, analyze, and interpret data from a myriad of financial documents, including invoices, receipts, and regulatory filings. The study investigates the ability of machine learning algorithms to understand context, manage unstructured information, and detect underlying patterns, thereby improving accuracy and reducing manual intervention. Through a comprehensive review of existing methodologies and experimental implementations, the research highlights the transformative impact of these technologies on conventional accounting practices. The integration of NLP not only enhances data extraction efficiency but also supports compliance and risk management by identifying anomalies and inconsistencies in financial records. Moreover, the proposed system offers scalability, adapting to varying data volumes while ensuring real-time processing and precise reporting. By addressing challenges such as data heterogeneity, linguistic ambiguity, and domain-specific terminology, this paper presents a robust framework for implementing AI-driven accounting solutions that optimize operational workflows. The findings indicate that embracing AI and NLP in accounting automation can lead to significant cost reductions, enhanced decision-making, and overall performance improvements. This research paves the way for future advancements in intelligent financial systems, underlining the importance of ongoing innovation and the strategic integration of emerging technologies in the accounting sector. By continuously refining these innovative approaches, organizations can achieve sustainable growth and maintain competitive advantage in a dynamic economic landscape globally.
Financial due diligence requires intensive analysis of vast unstructured documents (e.g., contracts, statements, invoices). However, traditional manual processing is inefficient, costly, and prone to subjectivity, and the existing automation solutions primarily focus on single-modal text recognition, lacking the capacity for joint understanding of multimodal features (e.g., layout, seals, table structures) and deep risk reasoning. This study proposes an end-to-end framework based on a Multimodal Large Language Model (MLLM) to bridge this gap. The framework not only performs accurate multimodal information extraction but also, integrates domain-specific knowledge (e.g regulatory clauses) to emulate expert-like reasoning. By constructing a dynamic risk knowledge graph that captures entities and relations across documents, it enables cross-document correlation analysis and anomaly detection. We will validate the framework on curated financial datasets, assessing both its information processing accuracy and risk diagnosis capability. Our contributions are threefold: 1) providing a novel computational linguistics solution that addresses the semantic and pragmatic challenges in financial document understanding; 2) advancing financial AI from perceptual to cognitive intelligence through explainable, knowledge-integrated reasoning; 3) offering a transparent, automated decision-support tool for high-stakes due diligence.
Financial event extraction enables the extraction of comprehensive and accurate information about financial events from documents. This paper explores the current methods for extracting events at the financial document level, which often involve custom-designed networks and processes. We question whether such extensive efforts are truly necessary for this task. Our research is motivated by recent developments in generative event extraction, which have shown success in sentence-level extraction but have yet to be explored for financial document-level extraction. To fill this gap, we propose a generative solution for document-level event extraction, which is more challenging due to the presence of scattered arguments and multiple events. We introduce an encoding scheme to capture entity-to-document level information and a decoding scheme that makes the generative process aware of all relevant contexts. Our results indicate that using our method, a generative-based solution can perform as well as state-of-the-art methods that use a specialized structure for document event extraction, providing an easy-to-use, strong baseline for future research.
Financial document understanding remains a critical challenge for Large Language Models, primarily due to the complex interplay between narrative text and structured numerical tables. Existing Retrieval-Augmented Generation (RAG) systems often treat these modalities in isolation, leading to significant failures in tasks requiring joint reasoning. This study introduces HierFinRAG, a novel hierarchical multimodal framework designed to unify tabular and textual data processing. Our approach employs a Table-Text Graph Neural Network (TTGNN) to explicitly model semantic and structural dependencies between table cells and corresponding text, coupled with a Symbolic–Neural Fusion module that routes queries between a neural generator and a symbolic calculator for precise arithmetic operations. We evaluate the system on the FinQA and FinanceBench datasets, comparing performance against strong baselines including Vanilla RAG and GPT-4o with Code Interpreter. Results demonstrate that HierFinRAG achieves an Exact Match score of 82.5% on FinQA, surpassing the best baseline by 6.5 percentage points, while maintaining a 3.5× faster inference latency than agentic approaches. These findings indicate that integrating hierarchical structural awareness with hybrid reasoning significantly enhances the accuracy and interpretability of financial artificial intelligence systems.
No abstract available
In recent years, knowledge graphs have become vital for financial decision support by allowing structured representation and reasoning through complex textual data. However, traditional methods such as Financial Causality Knowledge Graph (FinCaKG) fail in inconsistent or semantically weak settings due to their dependance on raw text strings without ontology alignment. Hence, a Financial Causality Knowledge Graph with Ontology Integration (FinCaKG-Onto) framework was proposed to confirm accurate causality extraction and semantically consistent financial knowledge representation. Initially, financial reports (10-K filings) from FinCausal benchmark dataset were gathered. After that, a Bidirectional Encoder Representations from Transformers (BERT)-based causality detection module studies to recognize cause effect spans in financial reports by state observations from labeled datasets. Furthermore, the extracted spans were dynamically mapped to standardized financial concepts through entity linking with Financial Industry Business Ontology (FIBO). Then, causality bonding mechanism creates explicit cause–effect relations among normalized entities, whereas ontology integration preserves semantic consistency and hierarchical structure. Subsequently, a schema-based organization was applied to allow lightweight reasoning across financial concepts, in which nodes were aligned to their ontology classes and subclasses. Finally, the experimental results showed the proposed FinCaKG-Onto outperformed FinCaKG by attaining an ontology consistency of 95.6% across large-scale financial reports.
The advent of large language models (LLMs) has initiated much research into their various financial applications. However, in applying LLMs on long documents, semantic relations are not explicitly incorporated, and a full or arbitrarily sparse attention operation is employed. In recent years, progress has been made in Abstract Meaning Representation (AMR), which is a graph-based representation of text to preserve its semantic relations. Since AMR can represent semantic relationships at a deeper level, it can be beneficially utilized by graph neural networks (GNNs) for constructing effective document-level graph representations built upon LLM embeddings to predict target metrics in the financial domain. We propose FLAG: Financial Long document classification via AMR-based GNN, an AMR graph based framework to generate document-level embeddings for long financial document classification. We construct document-level graphs from sentence-level AMR graphs, endow them with specialized LLM word embeddings in the financial domain, apply a deep learning mechanism that utilizes a GNN, and examine the efficacy of our AMR-based approach in predicting labeled target data from long financial documents. Extensive experiments are conducted on a dataset of quarterly earnings calls transcripts of companies in various sectors of the economy, as well as on a corpus of more recent earnings calls of companies in the S&P 1500 Composite Index. We find that our AMR-based approach outperforms fine-tuning LLMs directly on text in predicting stock price movement trends at different time horizons in both datasets. Our work also outperforms previous work utilizing document graphs and GNNs for text classification.
Financial documents are packed with valuable insights but are often long, complex, and hard to interpret—especially under rapidly changing market conditions. This paper introduces Finalyze, an intelligent assistant built using Retrieval-Augmented Generation (RAG) and large language models (LLMs) to automate financial document analysis. Finalyze supports multi-format inputs (PDF, CSV, XLSX), semantically segments content, and enables users to ask natural language questions or obtain contextual summaries. It also integrates real-time stock market sentiment using live news feeds. Compared to existing tools like Bloomberg and AlphaSense, Finalyze offers broader document support, interactive query handling, and real-time analysis—while remaining accessible to individuals and small teams. Experimentally, Finalyze achieves a context-recall of 78.26%, a context relevance of 86.99%, and an interface response time of 1.8-2.3 seconds—demonstrating high accuracy and responsiveness. These results highlight the system's ability to bridge static document data with dynamic financial context, making it a practical tool for analysts, auditors, and researchers.
Most existing event extraction (EE) methods merely extract event arguments within the sentence scope. However, such sentence-level EE methods struggle to handle soaring amounts of documents from emerging applications, such as finance, legislation, health, etc., where event arguments always scatter across different sentences, and even multiple such event mentions frequently co-exist in the same document. To address these challenges, we propose a novel end-to-end model, Doc2EDAG, which can generate an entity-based directed acyclic graph to fulfill the document-level EE (DEE) effectively. Moreover, we reformalize a DEE task with the no-trigger-words design to ease the document-level event labeling. To demonstrate the effectiveness of Doc2EDAG, we build a large-scale real-world dataset consisting of Chinese financial announcements with the challenges mentioned above. Extensive experiments with comprehensive analyses illustrate the superiority of Doc2EDAG over state-of-the-art methods. Data and codes can be found at https://github.com/dolphin-zs/Doc2EDAG.
Document-level financial event extraction (DFEE) is the task of detecting events and extracting the corresponding event arguments in financial documents, which plays an important role in information extraction in the financial domain. This task is challenging as the financial documents are generally long text and event arguments of one event may be scattered in different sentences. To address this issue, we proposed a novel Prior Information Enhanced Extraction framework (PIEE) for DFEE, leveraging prior information from both event types and pre-trained language models. Specifically, PIEE consists of three components: event detection, event argument extraction, and event table filling. In event detection, we identify the event type. Then, the event type is explicitly used for event argument extraction. Meanwhile, the implicit information within language models also provides considerable cues for event arguments localization. Finally, all the event arguments are filled in an event table by a set of predefined heuristic rules. To demonstrate the effectiveness of our proposed framework, we participated in the share task of CCKS2020 Task 4-2: Document-level Event Arguments Extraction. On both Leaderboard A and Leaderboard B, PIEE took the first place and significantly outperformed the other systems.
Information extraction from financial document images is crucial in computer vision and NLP, as financial data often exists in image or PDF format, enabling organizations to analyze and make informed business decisions using OCR advancements. The table contents of financial document images are one of the prominent structures to confine important portions of data of the document and many Deep learning-based methods have been proposed to detect Table regions inside document images. The shortcomings of the current approach are that it is bounded within the detection of the table region and struggles in cases such as handling different layouts and preserving the relation among the different attributes of the table. Therefore, in this work, we proposed an end-to-end architecture to extract information from Financial table images while preserving the column row structures of the attributes within the table. We divided the task into four modules and generated synthesized data with different augmentation techniques to overcome data scarcity challenges and boost the performance of the pipeline modules. In terms of information extraction, the proposed method acquired 85% accuracy in the target invoice dataset.
: The financial and accounting sectors are encountering increased demands to effectively manage large volumes of documents in today’s digital environment. Meeting this demand is crucial for accurate archiving, maintaining efficiency and competitiveness, and ensuring operational excellence in the industry. This study proposes and analyzes machine learning-based pipelines to effectively classify and extract information from scanned and photographed financial documents, such as invoices, receipts, bank statements, etc. It also addresses the challenges associated with financial document processing using deep learning techniques. This research explores several models, including LeNet5, VGG19, and MobileNetV2 for document classification and RoBERTa, LayoutLMv3, and GraphDoc for information extraction. The models are trained and tested on financial documents from previously available benchmark datasets and a new dataset with financial documents in Romanian. Results show MobileNetV2 excels in classification tasks (with accuracies of 99.24% with data augmentation and 93.33% without augmentation), while RoBERTa and LayoutLMv3 lead in extraction tasks (with F1-scores of 0.7761 and 0.7426, respectively). Despite the challenges posed by the imbalanced dataset and cross-language documents, the proposed pipeline shows potential for automating the processing of financial documents in the relevant sectors.
We introduce FinMMDocR, a novel bilingual multimodal benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning. Compared to existing benchmarks, our work delivers three major advancements. (1) Scenario Awareness: 57.9% of 1,200 expert-annotated problems incorporate 12 types of implicit financial scenarios (e.g., Portfolio Management), challenging models to perform expert-level reasoning based on assumptions; (2) Document Understanding: 837 Chinese/English documents spanning 9 types (e.g., Company Research) average 50.8 pages with rich visual elements, significantly surpassing existing benchmarks in both breadth and depth of financial documents; (3) Multi-Step Computation: Problems demand 11-step reasoning on average (5.3 extraction + 5.7 calculation steps), with 65.0% requiring cross-page evidence (2.4 pages average). The best-performing MLLM achieves only 58.0% accuracy, and different retrieval-augmented generation (RAG) methods show significant performance variations on this task. We expect FinMMDocR to drive improvements in MLLMs and reasoning-enhanced methods on complex multimodal reasoning tasks in real-world scenarios.
With the proliferation of digital financial services and digital transactional documents, data volumes are vastly increasing, including invoices, receipts, bank statements, and balance sheets. The document has garnered massive interest and a keen interest in handling Information extraction from these documents. For such documents, manual data extraction is time-consuming and prone to human error as the documents come in many formats. This paper covers techniques, tools, and technology in the case of extracting tables from financial and transactional documents, specifically in the case of vertical tables and in the presence of mixed-type data representations. Table extraction means extracting tabular data from a readable image schema document and transforming it into a structured format (CSV / JSON). The paper discusses other extraction methods, such as rule-based extraction, optical character recognition (OCR), and machine learning models. The book also covers some use cases from industry banking, e-commerce, or accounting, amongst other industries. The paper then discusses ethical and legal implications such as GDPR, HIPAA, compliance with data privacy laws, and how it should be transparent and fair for AI systems. Last but not least, the future trends of table extraction, including integration of generative AI and large language models (LLMs) and robotic process automation (RPA), as well as real-time data extraction, are discussed. This paper presents the growing demand for advanced extraction technologies to increase financial document processing accuracy, efficiency, and scalability.
Towards the intelligent understanding of table-text data in the finance domain, previous research explores numerical reasoning over table-text content with Question Answering (QA) tasks. A general framework is to extract supporting evidence from the table and text and then perform numerical reasoning over extracted evidence for inferring the answer. However, existing models are vulnerable to missing supporting evidence, which limits their performance. In this work, we propose a novel Semantic-Oriented Hierarchical Graph (SoarGraph) that models the semantic relationships and dependencies among the different elements (e.g., question, table cells, text paragraphs, quantities, and dates) using hierarchical graphs to facilitate supporting evidence extraction and enhance numerical reasoning capability. We conduct our experiments on two popular benchmarks, FinQA and TAT-QA datasets, and the results show that our SoarGraph significantly outperforms all strong baselines, demonstrating remarkable effectiveness.
Relation Extraction (RE) aims to extract semantic relationships in texts from given entity pairs, and has achieved significant improvements. However, in different domains, the RE task can be influenced by various factors. For example, in the financial domain, sentiment can affect RE results, yet this factor has been overlooked by modern RE models. To address this gap, this paper proposes a Sentiment-aware-SDP-Enhanced-Module (SSDP-SEM), a multi-task learning approach for enhancing financial RE. Specifically, SSDP-SEM integrates the RE models with a pluggable auxiliary sentiment perception (ASP) task, enabling the RE models to concurrently navigate their attention weights with the text’s sentiment. We first generate detailed sentiment tokens through a sentiment model and insert these tokens into an instance. Then, the ASP task focuses on capturing nuanced sentiment information through predicting the sentiment token positions, combining both sentiment insights and the Shortest Dependency Path (SDP) of syntactic information. Moreover, this work employs a sentiment attention information bottleneck regularization method to regulate the reasoning process. Our experiment integrates this auxiliary task with several prevalent frameworks, and the results demonstrate that most previous models benefit from the auxiliary task, thereby achieving better results. These findings highlight the importance of effectively leveraging sentiment in the financial RE task.
综述显示,自然语言处理在财报分析中的应用已演进为以深度学习和大型语言模型为核心的系统性方法论。当前研究主要划分为四大支柱:一是基于深度模型实现的高精度信息抽取与结构化分析;二是围绕财报情感、可读性及其对市场信号、披露质量影响的叙事分析;三是结合文本信息与量化财务数据的预测建模,涵盖舞弊识别、合规预警及财务预测;四是针对宏观监管环境与公司关系网络等专项议题的补充研究。总体而言,该领域正从传统的词频统计迈向复杂的多模态语义推理与智能审计决策支持。