大语言模型(LLM)在智能审计、财报分析中的效率与偏差
智能审计自动化与组织数字化转型
该组文献探讨了AI和LLM在审计职能中的宏观应用与流程优化,包括内部审计自动化、勾稽关系检测、审计效率提升以及审计机构在数字化转型中的组织适应性与动态能力。
- Research on Financial Statement Checking Relationship Recognition System Based on Large Language Models(Haichao Zhang, Jie Zhang, Jiancheng Zhou, 2025, Proceedings of the 2nd Guangdong-Hong Kong-Macao Greater Bay Area International Conference on Digital Economy and Artificial Intelligence)
- AI-Enhanced Intelligent Internal Audit Automation Systems for Continuous Compliance Monitoring and Multi-Layer Risk Mitigation(Laith Ali Muttar, Majid M. Manhosh, Hasan Abd Al-Hussein, Areej Abdalghfou, Abdulsalam Ali Hussein Alnoori, Abdulrazaq Shabeeb, S. R. Jasim, Hussain D.D, Wisam A Mohammedhasan, 2025, 2025 3rd International Conference on Cyber Resilience (ICCR))
- Automating Financial Statement Audits with Large Language Models(Rushi Wang, Jiateng Liu, Weijie Zhao, Shenglan Li, Denghui Zhang, 2025, ArXiv)
- Enhancing Audit Efficiency Using Deep Learning for Automated Financial Statement Analysis(Liam Edwards, R. Hughes, 2025, International Journal of Global Economics and Management)
- The Impact of AI-Integrated Drone Technology and Big Data on External Auditing Performance, Sustainability, and Financial Reporting Quality on the Emerging Market(Abdulkarim Hamdan J. Alhazmi, Sardar Islam, M. Prokofieva, 2025, Accounting and Auditing)
- The Impact of the Use of Artificial Intelligence on the Development of External Audit Efficiency in Jordanian Mining and Extractive Corporations(Ali Mustafa Magablih, 2025, Accounting and Finance Research)
- AI and Auditing: Enhancing Audit Efficiency and Effectiveness with Artificial Intelligence(Lidiana Lidiana, 2024, Accounting Studies and Tax Journal (COUNT))
- Audit efficiency: Modern challenges and promising approaches(Yashar Mammadov, 2025, JOURNAL OF ECONOMIC GROWTH AND SOCIAL WELFARE)
- Improving Zero-Shot Text Matching for Financial Auditing with Large Language Models(L. Hillebrand, Armin Berger, Tobias Deußer, Tim Dilmaghani, Mohamed Khaled, Bernd Kliem, Rüdiger Loitz, Maren Pielka, David Leonhard, C. Bauckhage, R. Sifa, 2023, Proceedings of the ACM Symposium on Document Engineering 2023)
- Enhancing Internal Audit Efficiency for Effective Risk Management and Corporate Governance Frameworks(Onyenum Ruth Udoh, 2024, International Journal of Research Publication and Reviews)
- The Role of Artificial Intelligence in Enhancing Global Internal Audit Efficiency: An Analysis(Iyad Ghafar, Widya Perwitasari, Rama Kurnia, 2024, Asian Journal of Logistics Management)
- Research on the Impact of Digital Transformation on the Audit Quality and Efficiency of Accounting Firms(Fengyu Wang, Yanqi Tang, Xiangwei Meng, Ruiyun Wang, 2025, Journal of Modern Business and Economics)
- Generative AI-enabled intelligent auditing: an organizational adaptation mechanism study based on dynamic capability theory(Deng Wei, Obed Rashdi Syed, Xiaoli Xu, H. Sang, Jiang Wang, 2025, Future Technology)
- TURKISH COURT OF ACCOUNTS: ANALYZING FINANCIAL AUDIT, DIGITALIZATION, AI IMPACT(Muhammet Damar, Ömer Aydın, Eren Özoğuz, Üzeyir Aydın, Ahmet Özen, 2024, EDPACS)
- The Future of AI-Powered Auditing: Enhancing Accuracy and Reducing Errors(Siddharth S Karale, Sudip Debkumar Chatterji, Jaya Krishna Modadugu, A. Ghadage, H. Alsailawi, Mustafa Mudhafar, 2025, 2025 IEEE 5th International Conference on ICT in Business Industry & Government (ICTBIG))
财务舞弊检测与智能风险识别监测
此类研究聚焦于利用LLM、图神经网络(GNN)及深度学习技术识别财报中的欺诈行为、异常指标及潜在风险,通过融合管理层讨论(MD&A)语义与财务数据提升检测精度。
- Leveraging Large Language Models in Financial Statement Fraud Detection of Listed Companies(Changhao Song, Min Liu, Chuanghao Dong, Lu Zhang, Changjian Fang, 2025, 2025 Thirteenth International Conference on Advanced Cloud and Big Data (CBD))
- Enterprise financial fraud identification based on Transformer and SMOTE algorithm(Hao Hu, 2026, No journal)
- Financial Statement Fraud Detection via Large Language Models(Zehra Erva Ergun, Emre Sefer, 2025, Intell. Syst. Account. Finance Manag.)
- APPLYING MACHINE LEARNING TO AUDIT DATA: ENHANCING FRAUD DETECTION, RISK ASSESSMENT AND AUDIT EFFICIENCY(Nihan Özbaltan, 2024, EDPACS)
- Detecting Financial Fraud Through AI-Powered Analysis of GPT-Generated Text(Amitabha Maheshwari, Praveen Aronkar, 2025, FMDB Transactions on Sustainable Technoprise Letters)
- Applying Natural Language Processing to Financial Risk Disclosures and Audit Trails(Prashant Singh, 2023, Journal of Advances in Developmental Research)
- Multimodal detection framework for financial fraud integrating LLMs and interpretable machine learning(Hui Nie, Zhaoye Long, Ze-jun Fang, Lu Gao, 2025, Journal of Data and Information Science)
- Intelligent BiLSTM-Attention-IBPNN Method for Anomaly Detection in Financial Auditing(Shui-Bo Wang, 2024, IEEE Access)
- Machine Learning based Enterprise Financial Audit Framework and High Risk Identification(Ting Yuan, Xi Zhang, Xuanjing Chen, 2025, ArXiv)
- An Intelligent Financial Fraud Detection Support System Based on Three-Level Relationship Penetration(Xiang Li, Lei Chu, Yujun Li, Zhanjun Xing, Fengqian Ding, Jintao Li, Ben Ma, 2024, Mathematics)
- FraudGT: A Simple, Effective, and Efficient Graph Transformer for Financial Fraud Detection(Junhong Lin, Xiaojie Guo, Yada Zhu, S. Mitchell, Erik Altman, Julian Shun, 2024, Proceedings of the 5th ACM International Conference on AI in Finance)
- Enhancing Financial Risk Analysis using RAG-based Large Language Models(A.A. Darji, Fenil Kheni, Dhruvil Chodvadia, Parth Goel, Dweepna Garg, Bankim Patel, 2024, 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS))
- Identifying Financial Risk Information Using RAG with a Contrastive Insight(A. Elahi, 2025, ArXiv)
- Research on Financial Risk Intelligent Monitoring and Early Warning Model Based on LSTM, Transformer, and Deep Learning(Yunan Song, Huaqing Du, Tianyu Piao, Hongyu Shi, 2024, J. Organ. End User Comput.)
- Financial text analysis and credit risk assessment using a GPT-4 and improved BERT fusion model(H. Tan, Y. Xie, 2025, PLOS One)
- Leveraging Internet-Sourced Text Data for Financial Analytics in Supply Chain Finance: A Large Language Model-Enhanced Text Mining Workflow(Jiaxing Wang, Guoquan Liu, Yang Cheng, Xiaobo Xu, Zhongyun Li, 2025, IEEE Transactions on Engineering Management)
财报深度解析、业绩预测与辅助决策
该组文献关注LLM在财报摘要生成、情感分析、KPI预测及股权研究中的表现,验证了模型模拟人类分析师进行财务比率解释、盈利预测及ESG报告分析的能力。
- Characterizing Multimodal Long-form Summarization: A Case Study on Financial Reports(Tianyu Cao, Natraj Raman, Danial Dervovic, Chenhao Tan, 2024, ArXiv)
- SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation(Qilong Wu, Xiaoneng Xiang, Hejia Huang, Xuan Wang, Yeo Wei Jie, Ranjan Satapathy, Ricardo Shirota Filho, B. Veeravalli, 2024, ArXiv)
- Can AI Read Between The Lines? Benchmarking LLMs On Financial Nuance(Dominick Kubica, Dylan T. Gordon, Nanami Emura, Derleen Saini, Charlie Goldenberg, 2025, ArXiv)
- Augmenting Financial Planning and Analysis: Leveraging AI and LLMs for Predictive Insights and Strategic Foresight(Gautham Panneer Selvam, 2025, European Modern Studies Journal)
- Can Large language model analyze financial statements well?(Xinlin Wang, M. Brorsson, 2025, No journal)
- A Preliminary Fundamental Financial Analysis Framework Using Structured LLM Prompting - A Case Study(Ishan Gupta, N. Sharma, Abhay Kaushal, Rajeswara Rao Kvs, 2025, 2025 9th International Conference on Computational System and Information Technology for Sustainable Solutions (CSITSS))
- Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams(Ethan Callanan, A. Mbakwe, Antony Papadimitriou, Yulong Pei, Mathieu Sibue, Xiaodan Zhu, Zhiqiang Ma, Xiaomo Liu, Sameena Shah, 2023, ArXiv)
- FinRobot: AI Agent for Equity Research and Valuation with Large Language Models(Tianyu Zhou, Pin Wang, Yilin Wu, Hongyang Yang, 2024, ArXiv)
- Financial Text Analysis Using 1D-CNN: Risk Classification and Auditing Support(Xinyu Du, 2025, Proceedings of the 2025 International Conference on Artificial Intelligence and Computational Intelligence)
- Intelligent Information Processing for Corporate Performance Prediction: A Hybrid Natural Language Processing (NLP) and Deep Learning Approach(Qidi Yu, Chen Xing, Yanjing He, Sunghee Ahn, H. Na, 2026, Electronics)
- Financial Statement Analysis with Large Language Models(Alex G. Kim, Maximilian Muhn, Valeri V. Nikolaev, 2024, ArXiv)
- Predicting Numeric Financial KPIs From Unstructured Text: a Comparative Study of LLM-Based Embeddings and Traditional NLP Techniques(Lord Coffie, Melvin Ajuluchukwu, Michael Nsor, 2025, 2025 IEEE/ACIS 29th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD))
- Evaluating Large Language Models for Stance Detection on Financial Targets from SEC Filing Reports and Earnings Call Transcripts(Nikesh Gyawali, Doina Caragea, A. Vasenkov, Cornelia Caragea, 2025, ArXiv)
- Fact or Opinion? – Essential Value for Financial Results Briefing(Yutaka Kuroki, Tomonori Manabe, Kei Nakagawa, 2023, 2023 14th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI))
- Beyond Surface Similarity: Detecting Subtle Semantic Shifts in Financial Narratives(Jiaxin Liu, Yi Yang, K. Tam, 2024, No journal)
金融专用技术架构:RAG、多模态与多智能体协作
研究侧重于针对金融场景优化的底层架构,包括检索增强生成(RAG)、知识图谱(KG)集成、多智能体(Multi-agent)协作系统、NL2SQL技术以及处理表格数据的专用Transformer模型。
- LLMs for Financial Document Processing(Venkata Sai Nageen, 2024, International Journal of Artificial Intelligence, Data Science, and Machine Learning)
- Swin Transformer and dual-layer routing attention for enhanced financial accounting prediction(Xianghan Zhang, Daowen Ren, 2025, Journal of Computational Methods in Sciences and Engineering)
- Spatial ModernBERT: Spatial-Aware Transformer for Table and Key-Value Extraction in Financial Documents at Scale(Amrendra Singh, M. Shah, Dharshan Sampath, 2025, ArXiv)
- Integrating AI-powered knowledge graphs and NLP for intelligent interpretation, summarization, and cross-border financial reporting harmonization(Oriyomi Badmus, Olumide Johnson Ikumapayi, Rebecca Olubunmi Toromade, A. Adebayo, 2025, World Journal of Advanced Research and Reviews)
- Design of an intelligent optimization framework for corporate financial management based on GA-FL-transformer(Fengnian Zhu, Shaotian Liu, Feng Yuan, Muddassira Arshad, 2026, PeerJ Computer Science)
- GraphRAG: Leveraging Graph-Based Efficiency to Minimize Hallucinations in LLM-Driven RAG for Finance Data(Ma Barry, Gaëtan Caillaut, Pierre Halftermeyer, Raheel Qader, Mehdi Mouayad, Fabrice Le Deit, Dimitri Cariolaro, Joseph Gesnouin, 2025, No journal)
- Sustainable Digitalization of Business with Multi-Agent RAG and LLM(Muhammad Arslan, Saba Munawar, Christophe Cruz, 2025, No journal)
- Structured Financial QA with LLMs: Fine-Tuning vs. Code-Augmented Retrieval(Alperen Çağlayan, Saliha Nur Gökçe, Değer Ayata, 2025, 2025 10th International Conference on Computer Science and Engineering (UBMK))
- Advancing Retrieval-Augmented Generation for Financial Question Answering(Akmal Ali Jasmin, Indika Perera, Muaadh Mohamed, Mohamed Mushraf, 2025, 2025 Moratuwa Engineering Research Conference (MERCon))
- GraphRAG Analysis for Financial Narrative Summarization and A Framework for Optimizing Domain Adaptation(Neelesh K. Shukla, Prabhat Prabhakar, Sakthivel Thangaraj, Sandeep Singh, Weiyi Sun, Prasanna Venkatesan, Viji Krishnamurthy, 2025, No journal)
- Knowledge Graph Construction for Stock Markets with LLM-Based Explainable Reasoning(Cheonsol Lee, Youngsang Jeong, J. Shin, Huiju Kim, Jidong Kim, 2025, ArXiv)
- Research and Practice of NL2SQL Technology Based on LLM for Big Data of Enterprise Finance(Jianfeng Zhang, Yingying Li, Yunhao Liu, Limiao Xie, 2024, 2024 4th International Conference on Advanced Enterprise Information System (AEIS))
- FinReflectKG: Agentic Construction and Evaluation of Financial Knowledge Graphs(Abhinav Arun, Fabrizio Dimino, T. Agarwal, Bhaskarjit Sarmah, Stefano Pasquali, 2025, Proceedings of the 6th ACM International Conference on AI in Finance)
- QuantMCP: Grounding Large Language Models in Verifiable Financial Reality(Yifan Zeng, 2025, ArXiv)
- Multimodal retrieval-augmented generation for financial documents: image-centric analysis of charts and tables with large language models(Cheng Jiang, Pengle Zhang, Ying Ni, Xiaoli Wang, Hanghang Peng, Sen Liu, Mengdi Fei, Yuxin He, Yaxuan Xiao, Jin Huang, Xingyu Ma, Tiankun Yang, 2025, The Visual Computer)
- Table Extraction from Financial and Transactional Documents(Rama Krishna Raju Samantapudi, 2025, International journal of IoT)
- GPT-FinRE: In-context Learning for Financial Relation Extraction using Large Language Models(P. Rajpoot, Ankur Parikh, 2023, ArXiv)
- FinTeam: A Multi-Agent Collaborative Intelligence System for Comprehensive Financial Scenarios(Yingqiang Wu, Qiushi Wang, Zefei Long, Rong Ye, Zhongtian Lu, Xianyin Zhang, Bingxuan Li, Wei Chen, Liwen Zhang, Zhongyu Wei, 2025, ArXiv)
- FinRpt: Dataset, Evaluation System and LLM-based Multi-agent Framework for Equity Research Report Generation(Song Jin, Shuqi Li, Shukun Zhang, Rui Yan, 2025, ArXiv)
- CHATREPORT: Democratizing Sustainability Disclosure Analysis through LLM-based Tools(Jingwei Ni, J. Bingler, Chiara Colesanti-Senni, Mathias Kraus, Glen Gostlow, T. Schimanski, Dominik Stammbach, S. Vaghefi, Qian Wang, Nicolas Webersinke, Tobias Wekhof, Ting Yu, Markus Leippold, 2023, ArXiv)
- Can a GPT4-Powered AI Agent Be a Good Enough Performance Attribution Analyst?(Bruno Guimarães de Melo, Jamiel Sheikh, 2024, ArXiv)
- Language Model Orchestrated Financial Agents: An Open-Source Framework(Ravi Teja Gundimeda, 2025, 2025 IEEE 4th International Conference for Advancement in Technology (ICONAT))
- Template-Based Financial Report Generation in Agentic and Decomposed Information Retrieval(Yong-En Tian, Yu-Chien Tang, Kuang-Da Wang, An-Zi Yen, Wen-Chih Peng, 2025, Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval)
- Utilizing Modern Large Language Models (LLM) for Financial Trend Analysis and Digest Creation(Andrei Lazarev, Dmitrii Sedov, 2024, 2024 6th International Conference on Control Systems, Mathematical Modeling, Automation and Energy Efficiency (SUMMA))
- Evaluating Retrieval-Augmented Generation Models for Financial Report Question and Answering(Ivan Iaroshev, R. Pillai, Leandro Vaglietti, T. Hanne, 2024, Applied Sciences)
LLM偏差评估、幻觉抑制与金融可靠性基准
该组文献深入探讨了LLM在金融应用中的负面效应(如数值幻觉、表示偏差、输出漂移),并提出了针对金融任务的性能基准测试及通过CoT、数据清洗等手段提升可靠性的方法。
- Identifying Representation Bias in Large Language Models Used in Financial Sentiment Analysis(Alpay Sabuncuoglu, Carsten Maple, 2025, 2025 IEEE Symposium on Computational Intelligence for Financial Engineering and Economics (CiFer))
- LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows(Raffi Khatchadourian, Rolando Franco, 2025, ArXiv)
- Innovation of Enterprise Ethical Review Mechanism Driven by Generative AI for Financial Report Preparation(Yue Ma, Jian Du, 2025, Modern Economics & Management Forum)
- The Artificial Intelligence, Challenges for Accounting Profession. The Case of ChatGPT(L. Dumitraşcu, 2024, Audit Financiar)
- A Study on the Dual Impact of Generative Prediction (GPT)-Based AI on the Quality of Corporate Financial Disclosure(Zuoshi Zhang, 2025, Frontiers in Business, Economics and Management)
- Can ChatGPT Overcome Behavioral Biases in the Financial Sector? Classify-and-Rethink: Multi-Step Zero-Shot Reasoning in the Gold Investment(Shuoling Liu, Gaoguo Jia, Yuhang Jiang, Liyuan Chen, Qiang Yang, 2024, ArXiv)
- Journey of Hallucination-minimized Generative AI Solutions for Financial Decision Makers(Sohini Roychowdhury, 2023, Proceedings of the 17th ACM International Conference on Web Search and Data Mining)
- Addressing investor concerns: a Chinese financial question-answering benchmark with LLM-based evaluation(Yujian Gan, Yiyi Tao, Jiawang Mo, Xianzhen Huang, Yiwen Li, Kexin Wang, Yi Cai, Lu Liang, Shuzhen Xiong, Qi Ke, Hua Zheng, Xiaochu Hu, 2025, EPJ Data Science)
- Adversarially Enhanced Financial Misinformation: A Comparative Analysis of LLM- vs. GAN-Generated Content Exposing AI Moderation Vulnerabilities(Christopher Santorelli, Victor Ginart Belmonte, Ryan Mastropaolo, 2025, 2025 6th International Conference on Artificial Intelligence, Robotics and Control (AIRC))
- Dólares or Dollars? Unraveling the Bilingual Prowess of Financial LLMs Between Spanish and English(Xiao Zhang, Ruoyu Xiang, Chenhan Yuan, Duanyu Feng, Weiguang Han, Alejandro Lopez-Lira, Xiao-Yang Liu, Meikang Qiu, Sophia Ananiadou, Min Peng, Jimin Huang, Qianqian Xie, 2024, Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining)
- Hallucination-minimized Data-to-answer Framework for Financial Decision-makers(Sohini Roychowdhury, A. Alvarez, Brian Moore, Marko Krema, Maria Paz Gelpi, F. Rodriguez, Angel Rodriguez, Jose Ramon Cabrejas, Pablo Serrano, Punit Agrawal, Arijit Mukherjee, 2023, 2023 IEEE International Conference on Big Data (BigData))
- Who Invests, Who Gets Funded: Gender and Racial Bias in LLM-Generated Investment Advice(Ye Wang, Kexin Gu, 2026, SSRN Electronic Journal)
- Financial Named Entity Recognition: How Far Can LLM Go?(Yi Lu, Yintong Huo, 2025, No journal)
- FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information(Yan Wang, Yang Ren, Lingfei Qian, Xueqing Peng, Keyi Wang, Yi Han, Dongji Feng, Xiao-Yang Liu, Jimin Huang, Qianqian Xie, 2025, ArXiv)
- Unmasking Bias in Financial AI: A Robust Framework for Evaluating and Mitigating Hidden Biases in LLMs(Shreshth Mehrotra, Raghavendra P, Balraj Prajesh, Hrishikesh Kambale, Puspita Majumdar, 2025, Proceedings of the 6th ACM International Conference on AI in Finance)
- CARE: A Framework for Correcting Numerical Hallucinations in LLM-Generated Financial Texts(Jian Kim, Woohwan Jung, 2025, 2025 IEEE Conference on Artificial Intelligence (CAI))
- On the Reliability of Large Language Models in Financial Applications: An Analysis of Hallucination(Shweta Gupta, 2025, 2025 4th International Conference on Applied Artificial Intelligence and Computing (ICAAIC))
- Towards reducing hallucination in extracting information from financial reports using Large Language Models(Bhaskarjit Sarmah, Dhagash Mehta, Stefano Pasquali, Tianjie Zhu, 2023, Proceedings of the Third International Conference on AI-ML Systems)
- FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning(Zhuohan Xie, Dhruv Sahnan, Debopriyo Banerjee, Georgi N. Georgiev, Rushil Thareja, Hachem Madmoun, Jinyan Su, Aaryamonvikram Singh, Yuxia Wang, Rui Xing, Fajri Koto, Haonan Li, Ivan Koychev, Tanmoy Chakraborty, S. Lahlou, Veselin Stoyanov, Preslav Nakov, 2025, ArXiv)
- AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework(Xiang Li, Zhenyun Li, Chen Shi, Yong Xu, Qing Du, Mingkui Tan, Jun Huang, Wei Lin, 2024, No journal)
- Enhancing Financial RAG with Agentic AI and Multi-HyDE: A Novel Approach to Knowledge Retrieval and Hallucination Reduction(R. George, Akshay Govind Srinivasan, Jayden Koshy Joe, H. R., Vijayavallabh J, Hrushikesh Kant, Rahul Vimalkanth, S. S, S. Suresh, 2025, ArXiv)
- Can AI Read Like a Financial Analyst? A Financial Touchstone for Frontier Language Models Such as Gemini 2.5 Pro, o3, and Grok 4 on Long-Context Annual Report Comprehension(J. Spörer, 2025, Proceedings of the 6th ACM International Conference on AI in Finance)
- ZiGong 1.0: A Large Language Model for Financial Credit(Yu Lei, Zixuan Wang, Chu Liu, Tongyao Wang, 2025, 2025 IEEE 41st International Conference on Data Engineering Workshops (ICDEW))
- Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? A Study on Several Typical Tasks(Xianzhi Li, Samuel W. K. Chan, Xiaodan Zhu, Yulong Pei, Zhiqiang Ma, Xiaomo Liu, Sameena Shah, 2023, No journal)
- Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? An Examination on Several Typical Tasks(Xianzhi Li, Xiao-Dan Zhu, Zhiqiang Ma, Xiaomo Liu, Sameena Shah, 2023, ArXiv)
- Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance(Lingfei Qian, Weipeng Zhou, Yan Wang, Xueqing Peng, Jimin Huang, Qianqian Xie, 2025, ArXiv)
- EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements(Issa Sugiura, Takashi Ishida, Taro Makino, Chieko Tazuke, Takanori Nakagawa, Kosuke Nakago, David Ha, 2025, ArXiv)
市场效应、合规性治理与专项金融应用
该组文献分析了财报披露对资本市场的实证影响,探讨了在法律监管(如欧盟准则)下的合规性问题,并涵盖了税务自动化、供应链金融等专项领域的AI应用。
- The News in Earnings Announcement Disclosures: Capturing Word Context Using LLM Methods(Federico Siano, 2025, Manag. Sci.)
- Impact of EU non-financial reporting regulation on Spanish companies’ environmental disclosure: a cutting-edge natural language processing approach(Javier Villacampa-Porta, M. Coronado-Vaca, E.C. Garrido-Merchán, 2025, Environmental Sciences Europe)
- Deloitte (Drocks) at the Financial Misinformation Detection Challenge Task: Enhancing Misinformation Detection through Instruction-Tuned Models(Harika Abburi, Alex Chandler, Edward Bowen, Sanmitra Bhattacharya, Nirmala Pudota, 2025, No journal)
- Architecting Intelligent Tax Automation: Research Innovations in Machine Learning for Global Compliance(Vedashree Kedar Karandikar, 2026, International Journal of Computational and Experimental Science and Engineering)
- Towards Cognitive Intelligence in Financial Document Analysis: A Multimodal LLM Framework for Risk Reasoning and Due Diligence(Manshan Lin, 2025, Journal of Language)
- Fusing Narrative Semantics for Financial Volatility Forecasting(Yaxuan Kong, Yoontae Hwang, M. Kaiser, Chris Vryonides, Roel Oomen, Stefan Zohren, 2025, Proceedings of the 6th ACM International Conference on AI in Finance)
- Interpretable multimodal reasoning for robo-advisory: the FinErva framework(Jiarui Chi, 2026, Frontiers in Artificial Intelligence)
- Challenges of Artificial Intelligence and its Impact on the Quality of Internal Auditing and its Impact on the Performance of Financial Institutions.(Amel Merzah Sakhil, Suhad Abdul Meer Kadhim, F. Oudah, 2025, Tasnim International Journal for Human, Social and Legal Sciences)
- Identification of the Most Frequently Asked Questions in Financial Analyst Reports to Automate Equity Research Using Llama 3 and GPT-4(A. Pop, J. Spörer, Siegfried Handschuh, 2024, 2025 IEEE Swiss Conference on Data Science (SDS))
本报告综合了当前大语言模型(LLM)在智能审计与财报分析领域的研究成果。研究脉络呈现出从“流程自动化”到“深度语义理解”,再到“可信架构构建”的演进趋势。一方面,LLM通过RAG、多智能体协作等技术显著提升了舞弊检测、业绩预测和报告生成的效率;另一方面,学术界对金融高风险场景下的数值幻觉、算法偏见及合规性风险保持高度警惕,并致力于通过构建专业基准与抑制技术实现“负责任的金融AI”。最终,这些技术应用不仅改变了审计行业的组织形态,也深刻影响了资本市场的信息披露与决策机制。
总计110篇相关文献
The rapid development of Generative AI has brought major changes in way of functioning of different sectors throughout the world. Many research work has been done in the field of financial sector to increase the efficiency and reduce the errors due to human intervention. However, the current financial risk analysis relies on manual reviews and conventional machine learning models which repeatedly failing to process financial risk data. This study investigates how Retrieval-Augmented Generation (RAG) approach can help Large Language Models (LLM) to generate risk analysis reports for audit reports which extract detailed information from the audit reports and avoid overlooking of small details, which was a major drawback in the earlier system. This research study covers how Retrieval Augmented Generation (RAG) enhances the performance financial risk analysis of audit reports using different LLMs like GPT-4o, Gemini-1.5-flash, and LlaMa3.1. This research work includes the performance of LLMs beyond multiple metrics, including faithfulness, context precision-recall-relevancy, and answer relevance. The research findings imply that LlaMa3.1 is a great model in terms of faithfulness of the generated report with a score of 78.26%. In terms of retrieval of the documents and its context, Llama had a very strong performance by getting the score of 79.62% in context-precision, 78.26% in context-recall and 86.99% in context-relevancy. In terms of generated report, the Llama3.1 model have the score of 37.83% for answer-relevancy and Gemini-1.5-flash have a score of 58.64% for answer-correctness.
The implementation of Artificial Intelligence (AI) in the accounting field represents a hot topic. ChatGPT, an AI tool, became very popular recently, due to its conversational voice and abilities. The study is motivated less by the evolution of this Large Language Model (LLM), and more by its capabilities. This paper explores the impact of AI on accounting and accountants, in a dynamic world, with a focus on financial reporting. The research discusses about using AI technologies, more exactly ChatGPT 4, as tools available for accountants, and how they are changing the way financial data is processed, analyzed, and reported. The objectives of the author are to examine the potential advantages, benefits, limits, and risks associated with AI implementation in accounting, including increased accuracy and efficiency, as well as concerns around data privacy and security. In this regard, a quantitative method of research was used. It was realized an experiment with testing ChatGPT and its capabilities. Furthermore, the author argued that accountants need to develop new skills and competencies. This includes a deep understanding of AI algorithms and their limitations, as well as the ability to interpret and communicate the results of AI-driven analysis to non-technical stakeholders. By embracing AI technologies and developing new skills and competencies, accounting professionals can contribute to the long-term success of organizations in a dynamic and rapidly changing world. The paper also considers the challenges of detecting and preventing dishonesty and suggests strategies that accountants can implement to ensure integrity to use of these tools. These strategies refer to policies and procedures, providing training and support. The added value of this paper is the fact that provides an understanding of the implications of AI on accounting. The paper concludes that while the use of AI for accounting in a dynamic world presents benefits and opportunities, there are also some challenges to face. Accountants can effectively address these concerns by taking a proactive and ethical approach to the responsible use of these tools. Future research could be represented by creating focus groups and interviews with different stakeholders to observe the impact of ChatGPT in a business environment, by discussing both financial and nonfinancial reporting.
No abstract available
This study aims to identify the impact of the use of artificial intelligence on the development of external audit efficiency in Jordanian public shareholding and mining and extractive companies. The sample of the study consisted of (56) external auditors in (13) Jordanian mining and extractive corporations. The descriptive approach and analytical approach were also used for its occasion in achieving the objectives of the study. The data was processed statistically using arithmetic averages and multiple regression analysis.The study found a statistical impact of artificial intelligence on the development of external audit efficiency in Jordanian mining and extractive corporations, and the existence of a statistical impact of artificial intelligence, represented in: (Planning, carrying out control tests and basic tests of operations, carrying out analytical procedures and detailed tests of balances, auditing subsequent events and future commitments prior to the issuance of the auditor's report) in improving all dimensions of governance (effective governance framework, disclosure and transparency, shareholder equality, responsibilities of the Board of Directors, role of stakeholders) in mining companies and extraction companies in Jordanian Public Shareholding.In light of the results of the study, it recommended that the total reliance on artificial intelligence be made easier for auditors, which has a high positive impact.
The concept of audit efficiency is of great significance in the modern business environment, as it is one of the key indicators of the accuracy of financial statements, risk management, and overall governance effectiveness in entities and organizations. An efficient audit is not limited to verifying compliance with legislative requirements but also contributes to optimal financial resource management, ensuring transparency, and reducing corruption risks. The study of audit efficiency has become even more relevant in the context of contemporary economic and technological changes. The development of digital technologies, the implementation of automated audit systems, and the use of artificial intelligence in audit processes create new challenges and opportunities. Therefore, an in-depth study of the theoretical and practical foundations of auditing, the improvement of its methodological approaches, and its alignment with international standards are essential for enhancing the long-term financial stability and investment attractiveness of organizations. This article examines research related to the efficiency of audit services and approaches this concept in light of modern-day requirements. The paper evaluates the key factors ensuring audit efficiency and highlights the necessity of addressing them based on contemporary challenges. Moreover, approaches to audit efficiency are analyzed in connection with the legal and regulatory framework. The article presents well-founded perspectives on audit efficiency and its improvement, offering a revised version of existing theoretical and methodological provisions.
Financial statement analysis represents a fundamental component of audit procedures, requiring extensive examination of numerical data, trends, and relationships across multiple reporting periods. Traditional audit approaches rely heavily on manual analytical procedures and rule-based testing, leading to time-intensive processes and potential inconsistencies in analysis depth and coverage. The increasing complexity of financial reporting and growing volumes of financial data have intensified these challenges. This study proposes a Deep Learning (DL) framework designed to automate and enhance financial statement analysis in audit contexts. The framework integrates Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to analyze financial statement patterns, detect anomalies, and identify potential misstatements. Advanced deep learning algorithms process multi-period financial data to recognize complex relationships and unusual variations that may indicate audit risks. Experimental validation using financial statements from 500 public companies demonstrates that the proposed framework achieves 89.7% accuracy in anomaly detection and reduces analytical procedure time by 73%. The system successfully identifies potential misstatements and unusual fluctuations while maintaining high precision rates. Implementation results show significant improvements in audit analytical efficiency, consistency, and risk identification capabilities.
Abstract Machine learning (ML) is used globally as a tool for predictive analysis. Within auditing, the use of audit data helps to uncover fraud indicators, identify risk areas and implement predictive models for continuous audit monitoring. Researchers are using various machine learning methods to analyze large and complex audit data to facilitate prediction. In this study, an online UCI dataset of 776 lines and 27 features is used. Out of these 27 features, 13 are eliminated due to their low impact on the target dataset or due to the ‘important feature selection’ algorithm. In this analysis, I used supervised learning methods, namely K-Nearest Neighbors, Logistic Regression, Random Forest, Support Vector Machine, Decision Tree, Linear Discriminant Analysis, Gaussian Naive Bayes, Extra Tree Algorithm, Gradient Boosting Algorithm, Ada Boosting Algorithm and XGBoost Algorithms. The experimental results highlight the power of eight neighbor KNN and evaluate its effectiveness, sensitivity, precision, accuracy and F1 score in comparison with other methods such as Naive Bayes, SVM (Linear Kernel), Decision Tree Classifier and Random Forest Classifier.
The use of automation and artificial intelligence (AI) in audit practice is increasingly becoming a major focus, with significant impact on the profession. This research depicts the current landscape of the use of AI in auditing, highlighting aspects such as automation and empowerment of the workforce in auditing, impact of AI on improving audit quality criteria, key factors in adopting AI-based audit techniques, impact of AI technology on audit evidence , and auditors' perceptions of AI in improving audit quality. The results and discussion show that while there are great benefits from integrating automation and AI in auditing, including improved audit quality, enhanced efficiency, and the ability to perform continuous audits, there are also challenges that need to be overcome, such as high customization costs for specific audit processes industry. The use of AI in auditing requires adaptation from auditors to changes in competencies and workflows to effectively utilize this technology. However, with proper understanding and careful handling of these challenges, AI has great potential to improve overall audit practices.
This research paper explores the transformative role of artificial intelligence (AI) in enhancing the efficiency and effectiveness of global internal audit functions. As businesses increasingly adopt AI-driven technologies, internal auditing has witnessed significant advancements in data analysis, risk detection, compliance monitoring, and decision-making processes. The paper analyzes how AI tools like machine learning, natural language processing, and predictive analytics contribute to the automation of repetitive audit tasks, the detection of anomalies, and the improvement of audit accuracy and timeliness. Additionally, it addresses the challenges associated with AI adoption, including data privacy concerns, skills gaps among auditors, and the integration of AI into existing audit frameworks. The study also provides a comparative analysis of AI-enabled versus traditional audit practices, highlighting AI’s potential to enhance audit quality, reduce operational costs, and provide deeper insights into financial and non-financial risks. By examining case studies and industry practices, the paper emphasizes AI’s critical role in shaping the future of internal auditing on a global scale. The findings suggest that AI’s integration into internal audits is not just a trend but a necessary evolution for achieving optimal audit outcomes.
No abstract available
With the rapid development of information technology, digital transformation has become an inevitable trend across various industries. As important supervisors of economic activities, accounting firms must also adapt to this trend by adopting digital technologies to improve the quality and efficiency of their audits. This paper first defines the concept of digital transformation, followed by an analysis of relevant theories on audit quality and audit efficiency. It then explores the impact of digital transformation on audit quality and efficiency, focusing on areas such as the application of data analysis techniques, integration of information systems, and improvement in risk management for audit quality, as well as the use of automation tools, real-time data processing advantages, and enhanced collaboration for audit efficiency. Based on this, the paper proposes strategies for technological applications and innovation, data management and security, digital talent cultivation and team building, optimization of audit processes and organizational structures, as well as quality control and risk management, offering guidance for the digital transformation of accounting firms.
As large language models (LLMs) expand the power of natural language processing to handle long inputs, rigorous and systematic analyses are necessary to understand their abilities and behavior. A salient application is summarization, due to its ubiquity and controversy (e.g., researchers have declared the death of summarization). In this paper, we use financial report summarization as a case study because financial reports are not only long but also use numbers and tables extensively. We propose a computational framework for characterizing multimodal long-form summarization and investigate the behavior of Claude 2.0/2.1, GPT-4/3.5, and Cohere. We find that GPT-3.5 and Cohere fail to perform this summarization task meaningfully. For Claude 2 and GPT-4, we analyze the extractiveness of the summary and identify a position bias in LLMs. This position bias disappears after shuffling the input for Claude, which suggests that Claude seems to recognize important information. We also conduct a comprehensive investigation on the use of numeric data in LLM-generated summaries and offer a taxonomy of numeric hallucination. We employ prompt engineering to improve GPT-4's use of numbers with limited success. Overall, our analyses highlight the strong capability of Claude 2 in handling long multimodal inputs compared to GPT-4. The generated summaries and evaluation code are available at https://github.com/ChicagoHAI/characterizing-multimodal-long-form-summarization.
The rapid growth of the financial sector and the rising focus on Environmental, Social, and Governance (ESG) considerations highlight the need for advanced NLP tools. However, open-source LLMs proficient in both finance and ESG domains remain scarce. To address this gap, we introduce SusGen-30K, a category-balanced dataset comprising seven financial NLP tasks and ESG report generation, and propose TCFD-Bench, a benchmark for evaluating sustainability report generation. Leveraging this dataset, we developed SusGen-GPT, a suite of models achieving state-of-the-art performance across six adapted and two off-the-shelf tasks, trailing GPT-4 by only 2% despite using 7-8B parameters compared to GPT-4's 1,700B. Based on this, we propose the SusGen system, integrated with Retrieval-Augmented Generation (RAG), to assist in sustainability report generation. This work demonstrates the efficiency of our approach, advancing research in finance and ESG.
This study explores the application of retrieval-augmented generation (RAG) to improve the accuracy and reliability of large language models (LLMs) in the context of financial report analysis. The focus is on enabling private investors to make informed decisions by enhancing the question-and-answering capabilities regarding the half-yearly or quarterly financial reports of banks. The study adopts a Design Science Research (DSR) methodology to develop and evaluate an RAG system tailored for this use case. The study conducts a series of experiments to explore models in which different RAG components are used. The aim is to enhance context relevance, answer faithfulness, and answer relevance. The results indicate that model one (OpenAI ADA and OpenAI GPT-4) achieved the highest performance, showing robust accuracy and relevance in response. Model three (MiniLM Embedder and OpenAI GPT-4) scored significantly lower, indicating the importance of high-quality components. The evaluation also revealed that well-structured reports result in better RAG performance than less coherent reports. Qualitative questions received higher scores than the quantitative ones, demonstrating the RAG’s proficiency in handling descriptive data. In conclusion, a tailored RAG can aid investors in providing accurate and contextually relevant information from financial reports, thereby enhancing decision making.
Against the backdrop of accelerating digital transformation, GPT-based generative AI technologies are gradually penetrating the entire corporate financial disclosure process, exerting a significant dual impact on disclosure quality. Drawing on information asymmetry theory and principal-agent theory, combined with KPMG's global research data and case studies such as Amazon and AllHere, this paper systematically analyzes the positive impact and potential risks of generative AI on the quality of financial disclosure. The study finds that generative AI can reduce disclosure redundancy through automated processing, compressing MD&A report summaries to 25% of the original while retaining core information, while also improving forecast accuracy and compliance efficiency. However, this also presents risks such as "AI whitewashing," data fabrication, and algorithmic black box manipulation. For example, the US AI startup AllHere overstated its revenue by nearly 700 times by fabricating AI-related financial data. The study further suggests the need to establish a coordinated mechanism across three dimensions: optimizing corporate governance, upgrading regulatory technology, and managing model security. The conclusions indicate that the impact of generative AI on disclosure quality is not one-way; its ultimate effect depends on the alignment between technical application specifications and risk prevention and control systems. This finding provides empirical evidence for companies to rationally utilize AI technology and for regulators to improve governance rules.
The application of generative AI in the preparation of financial reports has significantly improved efficiency and accuracy, but it has also triggered ethical risks such as data privacy, algorithmic bias, and ambiguous responsibilities. Based on the technology-policy-organization synergy framework, the innovation of corporate ethical review mechanisms needs to focus on the following dimensions: At the technology governance level, federated learning and zero-trust architecture are integrated to achieve controllable data security, algorithmic fairness detection tools are integrated to monitor model biases in real time, and blockchain technology is used to ensure full-process traceability; At the policy compliance level, dynamic hierarchical review standards are established, international mainstream regulatory requirements are integrated, and intelligent systems are relied on to achieve automated analysis and compliance adaptation of global regulatory policies; At the organizational execution level, a multi-level review framework is established, embedding abnormal decision warning and human intervention mechanisms. Case studies show that this mechanism can effectively reduce data security risks, enhance algorithmic fairness, and strengthen responsibility traceability. In the future, it is necessary to strengthen the integration and application of cutting-edge technologies, promote global ethical standard coordination, and build a people-oriented intelligent governance paradigm.
The most recent large language models(LLMs) such as ChatGPT and GPT-4 have shown exceptional capabilities of generalist models, achieving state-of-the-art performance on a wide range of NLP tasks with little or no adaptation. How effective are such models in the financial domain? Understanding this basic question would have a significant impact on many downstream financial analytical tasks. In this paper, we conduct an empirical study and provide experimental evidences of their performance on a wide variety of financial text analytical problems, using eight benchmark datasets from five categories of tasks. We report both the strengths and limitations of the current models by comparing them to the state-of-the-art fine-tuned approaches and the recently released domain-specific pretrained models. We hope our study can help understand the capability of the existing models in the financial domain and facilitate further improvements.
No abstract available
This study aims to improve the identification of potential credit risks in unstructured financial texts. It addresses the core problem of financial text analysis and credit risk assessment by proposing a hybrid model that combines the generative semantic understanding of Generative Pre-trained Transformer-4 (GPT-4) with the enhanced feature extraction of Bidirectional Encoder Representations from Transformers (BERT). To overcome the limitations of traditional methods—such as weak contextual reasoning in long texts, insufficient recognition of industry-specific terminology, and implicit credit risk expressions—the model incorporates a financial dictionary enhancement module and a named entity recognition (NER) component. GPT-4 is leveraged for prompt-based generation to extract latent risk information from complex texts, including annual reports. A dual-model semantic fusion mechanism with attention weighting constructs a multi-level risk assessment system that integrates contextual understanding, industry adaptability, and interpretability. Experiments on multiple publicly available financial datasets and real-world annual reports demonstrate the model’s effectiveness. Results show that the proposed approach outperforms representative baseline models in accuracy, adaptability, and interpretability. This work carries both theoretical and practical significance for research at the intersection of financial technology and natural language processing.
This research paper is in response to the use of Artificial Intelligence (AI) to detect financial fraud using text analysis on Generative Pre-trained Transformers (GPT). Spammers have continued to utilise advanced language models to generate copied content; consequently, more traditional anti-fraud methods are correspondingly less effective. This article proposes a novel approach that combines Natural Language Processing (NLP) and machine learning techniques to detect deception patterns in GPT-generated content. This is achieved by generating a new dataset of authentic and artificially created financial reports, including emails, reports, and social media posts. The training dataset is tested and validated using a collection of AI models, which includes a fine-tuned version of GPT-3.5, a Long Short-Term Memory (LSTM) network, and a Transformer-based classifier. Python is the primary tool used in this paper, with TensorFlow and PyTorch packages employed for model development, and scikit-learn utilised for performance analysis. The outcome demonstrates that the developed AI system can identify phishing text with extremely high accuracy, providing financial institutions with a reasonable opportunity to enhance their ability to combat fraud in the digital era. The research highlights the future of artificial intelligence in combating new forms of fraud and emphasises the need for ongoing innovation in this area.
This research dissects financial equity research reports (ERRs) by systematically mapping their content into categories. There is insufficient empirical analysis of the questions answered in ERRs. In particular, it is not understood how frequently certain information appears, what information is considered essential, and what information requires human judgment to distill into an ERR. The study analyzes 72 ERRs sentence-by-sentence, classifying their 4964 sentences into 169 unique question archetypes. We did not predefine the questions but derived them solely from the statements in the ERRs. This approach provides an unbiased view of the content of the observed ERRs. Subsequently, we used public corporate reports to classify the questions' potential for automation. Answers were labeled “text-extractable” if the answers to the question were accessible in corporate reports. 75.15% of the questions in ERRs can be automated using text extraction from text sources. Those automatable questions consist of 51.91% text-extractable (suited to processing by large language models, LLMs) and 24.24 % database-extractable questions. Only 24.85% of questions require human judgment to answer. We empirically validate, using Llama-3-70B and GPT-4-turbo-2024-04-09, that recent advances in language generation and information extraction enable the automation of approximately 80 % of the statements in ERRs. Surprisingly, the models complement each other's strengths and weaknesses well, indicating strong ensemble potential. The research confirms that the current writing process of ERRs can likely benefit from additional automation, improving quality and efficiency. The research thus allows us to quantify the potential impacts of introducing large language models in the ERR writing process. The full question list, including the archetypes and their frequency, are available online (janspoerer.github.io/pop-spoerer-2025-financial-report-data).
Relation extraction (RE) is a crucial task in natural language processing (NLP) that aims to identify and classify relationships between entities mentioned in text. In the financial domain, relation extraction plays a vital role in extracting valuable information from financial documents, such as news articles, earnings reports, and company filings. This paper describes our solution to relation extraction on one such dataset REFinD. The dataset was released along with shared task as a part of the Fourth Workshop on Knowledge Discovery from Unstructured Data in Financial Services, co-located with SIGIR 2023. In this paper, we employed OpenAI models under the framework of in-context learning (ICL). We utilized two retrieval strategies to find top K relevant in-context learning demonstrations / examples from training data for a given test example. The first retrieval mechanism, we employed, is a learning-free dense retriever and the other system is a learning-based retriever. We were able to achieve 3rd rank overall. Our best F1-score is 0.718.
Financial sentiment analysis is the task of evaluating and quantifying the emotions and opinions expressed in financial news, reports, or social media to help investors and institutions make informed decisions. Financial institutions have been actively exploring the use of large language models (LLMs) to analyse market sentiment signals for a more nuanced understanding of a broader context. However, issues such as the scale of training data, model complexity, and the potential for human oversight can introduce or even amplify bias in these systems. Representation bias is a common challenge for LLMs as training data fail to properly represent the target groups, hence causes harmful bias in general-purpose use. Therefore, replacing current solutions with LLMs in financial organisations requires a robust evaluation methodology to ensure fairness. This paper investigates a three-level bias evaluation approach that specifically focuses on representation bias and presents a baseline evaluation of the FinBERT model. Step 1 uses a synthetic dataset that explicitly reveals sources of bias, structured as probability- and embedding-based evaluation recipes. Step 2 evaluates the model against data released by another country (e.g. Indian News dataset) to assess its performance in relation to more implicit biases. Step 3 examines individual problematic samples using token-based interpretability methods (e.g. integrated gradients). This paper presents the application of this structured bias evaluation process and its results on the FinBERT model. The evaluation code and dataset are available on GitHub (https://github.com/asabuncuoglu13/faid-test-financial-sentiment-analysis).
Large Language Models (LLMs) are increasingly used in finance for tasks like market analysis, customer support, sentiment analysis, and automated reporting. However, LLMs often inherit and perpetuate biases from their training data, raising concerns about fairness and accuracy in high-stakes financial applications. While other domains such as medicine, law, and education have advanced in identifying, measuring, and reducing bias, finance lacks domain-specific datasets and robust fairness metrics. To address this, we introduce the FinBias dataset which includes bias-eliciting prompts related to the finance domain, and a comprehensive evaluation framework for publicly available LLMs, including robustness tests against jailbreaking. We also propose a new metric, SAFE (Safety-Adjusted Fairness Evaluation), which penalizes stereotypical and refusal responses while rewarding debiased outputs. Additionally, we present a prompt engineering-based mitigation strategy that effectively reduces bias. Experiments conducted on three publicly available LLMs - Mixtral, Gemma, and LLaMA demonstrate that these models exhibit significant bias, but the proposed prompt engineering-based mitigation strategy effectively reduces this bias. This research provides a practical foundation for the detection, evaluation and mitigation of bias in financial LLM applications.
We introduce M2VN: Multi-Modal Volatility Network, a novel deep learning-based framework for financial volatility forecasting that unifies time series features with unstructured news data. M2VN leverages the representational power of deep neural networks to address two key challenges in this domain: (i) aligning and fusing heterogeneous data modalities, numerical financial data and textual information, and (ii) mitigating look-ahead bias that can undermine the validity of financial models. To achieve this, M2VN combines open-source market features with news embeddings generated by Time Machine GPT, a recently introduced point-in-time LLM, ensuring temporal integrity. An auxiliary alignment loss is introduced to enhance the integration of structured and unstructured data within the deep learning architecture. Extensive experiments demonstrate that M2VN consistently outperforms existing baselines, underscoring its practical value for risk management and financial decision-making in dynamic markets.
Financial report generation tasks range from macro- to micro-economics analysis, also requiring extensive data analysis. Existing LLM models are usually fine-tuned on simple QA tasks and cannot comprehensively analyze real financial scenarios. Given the complexity, financial companies often distribute tasks among departments. Inspired by this, we propose FinTeam, a financial multi-agent collaborative system, with a workflow with four LLM agents: document analyzer, analyst, accountant, and consultant. We train these agents with specific financial expertise using constructed datasets. We evaluate FinTeam on comprehensive financial tasks constructed from real online investment forums, including macroeconomic, industry, and company analysis. The human evaluation shows that by combining agents, the financial reports generate from FinTeam achieved a 62.00% acceptance rate, outperforming baseline models like GPT-4o and Xuanyuan. Additionally, FinTeam's agents demonstrate a 7.43% average improvement on FinCUGE and a 2.06% accuracy boost on FinEval. Project is available at https://github.com/FudanDISC/DISC-FinLLM/.
Despite Spanish's pivotal role in the global finance industry, a pronounced gap exists in Spanish financial natural language processing (NLP) and application studies compared to English, especially in the era of large language models (LLMs). To bridge this gap, we unveil Toisón de Oro, the first bilingual framework that establishes instruction datasets, finetuned LLMs, and evaluation benchmark for financial LLMs in Spanish joint with English. We construct a rigorously curated bilingual instruction dataset including over 144K Spanish and English samples from 15 datasets covering 7 tasks. Harnessing this, we introduce FinMA-ES, an LLM designed for bilingual financial applications. We evaluate our model and existing LLMs using FLARE-ES, the first comprehensive bilingual evaluation benchmark with 21 datasets covering 9 tasks. The FLARE-ES benchmark results reveal a significant multilingual performance gap and bias in existing LLMs. FinMA-ES models surpass SOTA LLMs such as GPT-4 in Spanish financial tasks, due to strategic instruction tuning and leveraging data from diverse linguistic resources, highlighting the positive impact of cross-linguistic transfer. All our datasets, models, and benchmarks have been released.
Do large language models (LLMs) generate unbiased financial advice across investor and fund manager demographics? We develop a two-sided audit framework to evaluate demographic bias in LLM-generated investment advice and apply it to multiple large language models, with GPT-4 Turbo as the primary baseline. On the investor side, fund selections are similar across demographic groups and rely on financial criteria, but recommended investment amounts vary when investor names signal race or gender, despite identical age and income. On the fund manager side, capital allocations favor non-Black and male managers: racial disparities persist even under explicit disclosure, while gender-related differences are more pronounced under name-based cues. Bias patterns are qualitatively similar across models, with differences in magnitude between implicit and explicit demographic signaling. These results suggest that, even when LLMs incorporate core financial reasoning, demographic signals can affect allocation decisions, with effects that tend to be stronger under implicit signaling, potentially replicating existing market inequalities and raising concerns about impartiality in financial advising. The proposed audit framework provides a generalizable approach for identifying and evaluating demographic bias in AI-driven financial advisory systems.
This study investigates how audit organizations leverage generative artificial intelligence technologies to enhance auditing capabilities through organizational adaptation mechanisms, examining the role of dynamic capabilities in facilitating successful AI adoption and performance improvements. A quantitative cross-sectional survey collected data from 312 audit professionals across diverse organizational contexts. Structural equation modeling examined relationships between dynamic capabilities, generative AI adoption, organizational adaptation mechanisms, and auditing performance with comprehensive measurement validation. Dynamic capabilities significantly influence generative AI adoption (β = 0.453, p < 0.001), which drives organizational adaptation mechanisms (β = 0.312, p < 0.001) that enhance auditing performance (β = 0.378, p < 0.001). Organizational adaptation mechanisms mediate 41.4% of the capability-performance relationship. The model explains 28.3% variance in AI adoption, 35.7% in adaptation mechanisms, and 31.2% in auditing performance. Audit organizations should prioritize developing sensing, seizing, and reconfiguring capabilities before AI investments, requiring comprehensive change management addressing structural, processual, and cultural dimensions simultaneously. AI-driven competitive advantages emerge through organizational transformation processes, with dynamic capabilities as antecedents and adaptation mechanisms as mediating processes.
As global financial markets continue to evolve and change, financial risk monitoring and early warning have become increasingly important. However, the complexity and diversity of financial markets have led to the emergence of multidimensional and multimodal data. Traditional risk monitoring methods face difficulties in handling such diverse data and adapting to the monitoring and early warning needs of emerging risk types. To address these issues, this article proposes a financial risk intelligent monitoring and early warning model that integrates deep learning to better cope with uncertainty and risk in the financial market. Firstly, the authors introduce an LSTM model in the initial approach, trained on historical financial market data, to capture long-term dependencies and trends in the data, enabling effective monitoring of financial risk. They also optimize the model architecture to improve its performance and prediction accuracy. Secondly, the authors further introduce a transformer model with self-attention mechanism to better handle sequential data.
The study aims to demonstrate the extent to which the challenges of artificial intelligence technologies affect the quality of internal audit and their impact on the performance of financial institutions. With the growth of applications of artificial intelligence in financial operations, institutions have begun to increasingly rely on automation, intelligent data analysis, and machine learning in carrying out their daily tasks. While these technologies are expected to enhance efficiency and speed, they pose major challenges for internal audit units, including understanding complex systems, difficulty tracking algorithms, The possibility of technical biases or inaccurate automated decisions. The problem of the study is to identify the challenges posed by artificial intelligence technologies to the quality of internal audit, which may directly or indirectly affect the effectiveness and sustainability of the performance of financial institutions, and the extent of the readiness of internal audit units to deal with the smart systems adopted within these institutions and their impact on performance, in (Iraqi Trade Bank, Al-Nahrain Islamic Bank) the study sample, The research was based on a basic hypothesis, that the challenges of artificial intelligence affect the quality of internal audit, which is reflected in the performance of financial institutions. The research reached a number of conclusions, the most important of which are: Artificial intelligence technologies contribute to improving the quality and efficiency of internal audit by adapting to smart systems and understanding complex systems, which is reflected positively on the performance of financial institutions The research recommended the need to train auditors on the use of artificial intelligence tools and encourage institutions to replace manual systems with computerized systems to improve oversight performance and enhance cooperation between information technology and internal audit teams to ensure effective integration that ensures the quality of internal audit, which is reflected in improved performance.
This study investigates the influence of drone technology on the quality of Saudi financial reports through the integration of Artificial Intelligence (AI) and big data. The study’s mixed-method approach is based on a bibliometric analysis of previous studies, along with documentary and content analysis. The results show that external auditors benefit from using drones when inspections are integrated with AI and big data technology. Moreover, this integration can reduce costs for audit firms and shorten the duration of audit engagements, resulting in more efficient and effective auditing. Seven clusters were identified, with ‘big data’ being the highest-frequency term. This study does not consider potential cybersecurity threats that could impact data integrity and decrease financial transparency. Furthermore, environmental issues in Saudi Arabia, such as sandstorms, could compromise the effectiveness of drone-based auditing. However, this study contributes to the ESG literature by demonstrating how integrated audit technology transforms traditional sustainability reporting into continuous, AI-enhanced verification processes. These processes improve financial report quality while supporting Saudi Arabia’s Green Initiative and its goal of achieving net-zero carbon emissions by 2060. The adoption of AI and big data technologies in auditing represents a shift toward more automated and intelligent audit practices. These changes provide practical insights for government authorities, such as the Saudi Capital Market Authority (CMA), and may result in higher-quality financial reports and increased investor confidence.
Anomaly detection is a fundamental requirement in financial auditing, its detecting results can be used to correct the defects and predict risks for audited enterprise. However, with the auditing data becoming very huge, the anomaly detection error probabilities and material misstatement risk will be significantly increased. In this case, it is essential to develop an intelligent anomaly detection technology to address above problems. For these reasons, this paper develops a new intelligent anomaly detection method that combines the advantages of bidirectional long-short term memory (BiLSTM), improved backpropagation neural network (IBPNN) and an attention mechanism, also it possesses the strong abilities of nonlinear predicting, long time series feature extracting and important information attention. Furthermore, we present a correlation analysis algorithm to process the various types of huge financial auditing data, which can effectively remove the irrelevant information and discover the correlation relationships in financial auditing data before the BiLSTM-Attention-IBPNN method runs on it. The experimental results proved that our proposed method has better performances and evaluation results in anomaly detection compared with the state-of-the-art methods, also significantly improves the anomaly detection quality and efficiency for financial auditing.
This study proposes a hybrid machine learning framework that integrates structured financial indicators and unstructured textual strategy disclosures to improve firm-level management performance prediction. Using corporate business reports from South Korean listed firms, strategic text was extracted and categorized under the Balanced Scorecard (BSC) framework into financial, customer, internal process, and learning and growth dimensions. Various machine learning and deep learning models—including k-nearest neighbors (KNNs), support vector machine (SVM), light gradient boosting machine (LightGBM), convolutional neural network (CNN), long short-term memory (LSTM), autoencoder, and transformer—were evaluated, with results showing that the inclusion of strategic textual data significantly enhanced prediction accuracy, precision, recall, area under the curve (AUC), and F1-score. Among individual models, the transformer architecture demonstrated superior performance in extracting context-rich semantic features. A soft-voting ensemble model combining autoencoder, LSTM, and transformer achieved the best overall performance, leading in accuracy and AUC, while the best single deep learning model (transformer) obtained a marginally higher F1 score, confirming the value of hybrid learning. Furthermore, analysis revealed that customer-oriented strategy disclosures were the most predictive among BSC dimensions. These findings highlight the value of integrating financial and narrative data using advanced NLP and artificial intelligence (AI) techniques to develop interpretable and robust corporate performance forecasting models. In addition, we operationalize information security narratives using a reproducible cybersecurity lexicon and derive security disclosure intensity and weight share features that are jointly evaluated with BSC-based strategic vectors.
This article addresses lenges in enterprise financial management, including difficulties in processing multi-source data, limited adaptability to dynamic environments, and a lack of systematic integration in the decision-making process. To tackle these issues, a new intelligent optimization framework, named genetic algorithm-fuzzy logic-Transformer (GA-FL-Transformer), is proposed. First, the framework employs the Transformer architecture to achieve unified encoding and feature fusion across multiple sources of financial data, high-dimensional features with strong discriminative power. Subsequently, an attention-weight-guided co-evolutionary mechanism integrating genetic algorithm (GA) and fuzzy logic (FL) is designed. This mechanism incorporates the features and attention weights into chromosome encoding, fitness function formulation, and genetic operations, thereby enabling dynamic optimization of fuzzy rules and membership functions. Finally, an intelligent optimization framework that integrates perception, optimization, and decision-making is constructed, achieving closed-loop optimization from data to decision-making via a bidirectional flow mechanism and supporting continuous learning and system-wide self-adjustment. Results on financial datasets from Compustat and CRSP show that the proposed method outperforms competing models in financial optimization. Ablation experiments further validate the contributions of the Transformer-based feature extraction, genetic algorithm optimization, and fuzzy reasoning mechanism to the system’s performance. This study provides a crucial theoretical foundation for enterprises to construct intelligent financial decision-making systems.
Extracting tables and key-value pairs from financial documents is essential for business workflows such as auditing, data analytics, and automated invoice processing. In this work, we introduce Spatial ModernBERT-a transformer-based model augmented with spatial embeddings-to accurately detect and extract tabular data and key-value fields from complex financial documents. We cast the extraction task as token classification across three heads: (1) Label Head, classifying each token as a label (e.g., PO Number, PO Date, Item Description, Quantity, Base Cost, MRP, etc.); (2) Column Head, predicting column indices; (3) Row Head, distinguishing the start of item rows and header rows. The model is pretrained on the PubTables-1M dataset, then fine-tuned on a financial document dataset, achieving robust performance through cross-entropy loss on each classification head. We propose a post-processing method to merge tokens using B-I-IB tagging, reconstruct the tabular layout, and extract key-value pairs. Empirical evaluation shows that Spatial ModernBERT effectively leverages both textual and spatial cues, facilitating highly accurate table and key-value extraction in real-world financial documents.
With the rapid advancement of big data and artificial intelligence technologies, financial accounting is increasingly evolving toward greater intelligence and automation. However, the limited capacity of traditional methods to process complex data makes it challenging to address the rapid changes in enterprise dynamics within big data environments. To enhance forecasting accuracy and efficiency, this study proposes an intelligent financial accounting prediction model based on the Swin Transformer and a dual-layer routing attention mechanism. Experimental results demonstrate that the proposed model achieves substantial improvements in prediction performance compared with traditional Autoregressive Integrated Moving Average (ARIMA) and Long Short-Term Memory (LSTM) models. On the A-share dataset, the model attained a mean absolute error (MAE) of 1.47, representing improvements of 37.7% and 22.2% over ARIMA (2.36) and LSTM (1.89), respectively. Similarly, on the European and American enterprise dataset, the model achieved a root mean square error (RMSE) of 2.15, corresponding to reductions of 28.4% and 16.5% compared to ARIMA and LSTM. Furthermore, the model improved forecasting efficiency by reducing quarterly data processing time by approximately 15%. These findings highlight the potential of combining Transformer-based architectures with attention mechanisms for intelligent financial forecasting and underscore the broad application prospects of big data analytics in the financial domain.
This study proposes a financial text analysis method based on a one-dimensional convolutional neural network (1D-CNN), aiming to solve the problems of low efficiency and insufficient accuracy of traditional financial text processing methods in key information extraction and risk classification tasks. By constructing a convolutional network architecture tailored to the characteristics of financial text, the model can efficiently capture local semantic features in the text and perform deep feature extraction. In the experiment, this study selected the 10-K financial report in the SEC Edgar database as the dataset and verified the superiority of the 1D-CNN model through comparative experiments with traditional machine learning models and other deep learning models. The experimental results show that the model has achieved the best performance in terms of extraction rate, coverage rate, and redundancy rate, and also shows high accuracy and robustness in risk classification tasks. In addition, by testing the performance of the model under different noise levels, this study further analyzes the stability and limitations of the 1D-CNN model in the face of data perturbations. The results show that although the performance of the model is reduced in a noisy environment, the overall anti-interference ability is strong, which is suitable for financial text analysis in actual complex scenarios. This study provides an effective technical solution for intelligent financial text processing. It not only theoretically verifies the feasibility of 1D-CNN in financial text analysis but also provides an important reference for building a smarter and more efficient financial management system in the future.
This research presents a high-performance AI framework for intelligent internal audit automation, integrating deep learning, semantic policy embeddings, and reinforcement-driven risk optimization. Leveraging fine-tuned transformer models and structured enterprise datasets, the system was evaluated across diverse audit domains including financial transactions, procurement logs, HR records, and vendor management data. The model consistently achieved detection accuracies exceeding 98.53%, with F1-scores reaching 0.94 and compliance alignment scores up to 0.94. Compared to traditional rule-based methods, the enhanced framework reduced control oversight errors by over 25% and significantly improved interpretability through SHAP-based explanations and anomaly heatmaps. The integration of contextual text embeddings, numerical audit features, and dynamic control evaluation enabled transaction-level compliance analysis and proactive risk detection. Unlike resource-intensive neural pipelines, the model maintained sub-2 second training cycles, ensuring deployment feasibility within ERP-integrated enterprise systems. The architecture supports cross-platform generalization and is extensible to various operational domains without requiring structural reengineering.
Artificial intelligence technology has brought new opportunities for financial fraud detection, but traditional methods are faced with two major challenges: data imbalance and single feature. Therefore, this study proposes an intelligent detection framework that integrates improved synthetic minority over-sampling algorithm and Transformer. The sample distribution is optimized by dynamic weight adjustment, and multi-modal feature fusion is realized by multi-head attention mechanism to improve the detection performance. The experimental results show that the F1 score of the multi-modal fusion method is 0.92 and the area under the curve is 0.85, which is 8%-20% higher than that of the single-modal fusion method. The dynamic weight makes the recall rate of minority samples increase to 0.85, the accuracy rate of boundary samples reach 0.79, and the model variance decrease to 0.028. When the Transformer features are selected in 150 dimensions, the F1 score reaches 0.92, and the expert score is 4.8. The research provides a solution with both theoretical innovation and engineering practicability for intelligent monitoring of financial risks, and promotes the transformation of financial supervision from "post-inspection" to "pre-warning".
Financial fraud detection is a critical application area within the broader domains of cybersecurity and intelligent financial analytics. With the growing volume and complexity of digital transactions, the traditional rule-based and shallow learning models often fall short in detecting sophisticated fraud patterns. This study addresses the challenge of accurately identifying fraudulent financial activities, especially in highly imbalanced datasets where fraud instances are rare and often masked by legitimate behavior. The existing models also lack interpretability, limiting their utility in regulated financial environments. Experiments were conducted on three benchmark datasets: IEEE-CIS Fraud Detection, European Credit Card Transactions, and PaySim Mobile Money Simulation, each representing diverse transaction behaviors and data distributions. The proposed methodology integrates a transformer-based encoder, multi-teacher knowledge distillation, and a symbolic belief–desire–intention (BDI) reasoning layer to combine deep feature extraction with interpretable decision making. The novelty of this work lies in the incorporation of cognitive symbolic reasoning into a high-performance learning architecture for fraud detection. The performance was assessed using key metrics, including the F1-score, AUC, precision, recall, inference time, and model size. Results show that the proposed transformer–BDI model outperformed traditional and state-of-the-art baselines across all datasets, achieving improved fraud detection accuracy and interpretability while remaining computationally efficient for real-time deployment.
Auditing financial documents is a very tedious and time-consuming process. As of today, it can already be simplified by employing AI-based solutions to recommend relevant text passages from a report for each legal requirement of rigorous accounting standards. However, these methods need to be fine-tuned regularly, and they require abundant annotated data, which is often lacking in industrial environments. Hence, we present ZeroShotALI, a novel recommender system that leverages a state-of-the-art large language model (LLM) in conjunction with a domain-specifically optimized transformer-based text-matching solution. We find that a two-step approach of first retrieving a number of best matching document sections per legal requirement with a custom BERT-based model and second filtering these selections using an LLM yields significant performance improvements over existing approaches.
The adoption of artificial intelligence (AI) technologies is significantly improving monitoring functions while also transforming audit functions by providing increased precision, audit scalability, and real-time monitoring capabilities. In this paper, we propose an audit methodology based on artificial intelligence (AI) that incorporates the processes of data gathering, data cleansing, machine learning application, and anomaly detection to streamline error-prone audit processes and increase audit accuracy. A multi-stage model was built and tested in five industry sectors, and the model demonstrated better performance in anomaly detection and audit efficiency in all the sectors tested. The AI technologies have proven, using a novel-designed Audit Enhancement Index (AEI), to have substantially more efficiency in comparison to the traditional methods of auditing in a data-rich industry. An additional detailed workflow diagram and summary chart have been provided to visually demonstrate the advantages of the system over the traditional methods. AI can redefine the auditing processes from a post hoc examination of information to an ongoing, intelligent reevaluation of real-time data streams. This research has the potential to significantly advance automation in auditing and change perceptions of auditors to vision strategists empowered by instantaneous AI data analysis.
Financial fraud is a serious challenge in a rapidly evolving digital economy that places increasing demands on detection systems. However, traditional methods are often limited by the dimensional information of the corporations themselves and are insufficient to deal with the complexity and dynamics of modern financial fraud. This study introduces a novel intelligent financial fraud detection support system, leveraging a three-level relationship penetration (3-LRP) method to decode complex fraudulent networks and enhance prediction accuracy, by integrating the fuzzy rough density-based feature selection (FRDFS) methodology, which optimizes feature screening in noisy financial environments, together with the fuzzy deterministic soft voting (FDSV) method that combines transformer-based deep tabular networks with conventional machine learning classifiers. The integration of FRDFS optimizes feature selection, significantly improving the system’s reliability and performance. An empirical analysis, using a real financial dataset from Chinese small and medium-sized enterprises (SMEs), demonstrates the effectiveness of our proposed method. This research enriches the financial fraud detection literature and provides practical insights for risk management professionals, introducing a comprehensive framework for early warning and proactive risk management in digital finance.
Fraud detection plays a crucial role in the financial industry, preventing significant financial losses. Traditional rule-based systems and manual audits often struggle with the evolving nature of fraud schemes and the vast volume of transactions. Recent advances in machine learning, particularly graph neural networks (GNNs), have shown promise in addressing these challenges. However, GNNs still face limitations in learning intricate patterns, effectively utilizing edge attributes, and maintaining efficiency on large financial graphs. To address these limitations, we introduce FraudGT, a simple, effective, and efficient graph transformer (GT) model specifically designed for fraud detection in financial transaction graphs. FraudGT leverages edge-based message passing gates and an edge attribute-based attention bias to enhance its ability to discern important transactional features and differentiate between normal and fraudulent transactions. Our model achieves state-of-the-art performance in detecting fraudulent activities while demonstrating high throughput and significantly lower latency compared to existing methods. We validate the effectiveness of FraudGT through extensive experiments on multiple large-scale synthetic financial datasets. FraudGT consistently outperforms other models, achieving 7.8–17.8% higher F1 scores, while delivering an average of 2.4 × greater throughput and reduced latency. Our code and datasets are available at https://github.com/junhongmit/FraudGT.
The global indirect tax compliance of large-scale digital commerce platforms has become a complex, high-stakes systems problem due to jurisdictional fragmentation, regular change of regulations, and the high pace of expansion of heterogeneous product catalogs. Rule-based tax engines, although auditable and deterministic, fail to scale in such a situation because their authoring processes are fragile, maintenance is expensive, and their semantic knowledge of product data is limited. This article provides a detailed design of an intelligent tax automation system that is based on machine learning-based item-to-tax prediction services, supported by confidence-aware orchestration, human-in-the-loop protection, and explainability features that are appropriate in regulated financial settings. The suggested framework uses transformer-based language models that are trained on large-scale and multilingual commerce data to predict tax classifications directly based on item titles, descriptions, and structured taxonomy cues. Instead of using fixed mappings, the system is trained on semantic associations between product representations and jurisdiction-specific tax treatments, allowing it to correctly process long-tail, ambiguous, and newly added items. Calibrated confidence scores are provided with predictions, and this indicates whether the transactions can be safely automated, sent to policy validation, or sent to expert scrutiny. This is the selective automation model that balances operational efficiency and regulatory risk, compliance integrity, and scale. The architecture is deployed in a controlled machine learning system and combines continuous monitoring, auditability, and retraining pipelines based on feedback. The experience of large-scale deployments has shown that they can substantially reduce the effort required for manual rule formulation and scrutiny, increase the accuracy of classification into thousands of categories, and have a quantifiable financial effect, without compromising transparency to auditors and other regulatory stakeholders. The article defines intelligent, ML-driven tax automation as a feasible and responsible alternative to the legacy rule-based systems in the global compliance areas.
We investigate whether large language models (LLMs) can successfully perform financial statement analysis in a way similar to a professional human analyst. We provide standardized and anonymous financial statements to GPT4 and instruct the model to analyze them to determine the direction of firms' future earnings. Even without narrative or industry-specific information, the LLM outperforms financial analysts in its ability to predict earnings changes directionally. The LLM exhibits a relative advantage over human analysts in situations when the analysts tend to struggle. Furthermore, we find that the prediction accuracy of the LLM is on par with a narrowly trained state-of-the-art ML model. LLM prediction does not stem from its training memory. Instead, we find that the LLM generates useful narrative insights about a company's future performance. Lastly, our trading strategies based on GPT's predictions yield a higher Sharpe ratio and alphas than strategies based on other models. Our results suggest that LLMs may take a central role in analysis and decision-making.
With the widespread adoption of Internet‐based AI technologies, addressing financial fraud has become increasingly critical, particularly within the realm of machine learning. In this case, deep learning and natural language processing (NLP) techniques offer powerful means of detecting fraudulent activity by analyzing financial documents, thereby enhancing both the efficiency and precision of such assessments and supporting financial security. In this study, we introduce deep representation learning‐based approaches relying mainly on large language models (LLMs) for identifying fraud in financial statements by examining temporal changes in the Management Discussion and Analysis (MD&A) sections of corporate disclosures. Departing from conventional techniques that rely only on word frequency analysis, we propose D eep F raud that combines time‐evolving financial LLM embeddings, such as FinBERT, FinLlama, and FinGPT embeddings, of paragraphs and uses long short‐term memory (LSTM) to predict frauds via historical textual embeddings. In addition to LLM embeddings, we also integrate (1) time‐evolving word frequencies of words relevant to fraud detection, such as those expressing sentiment or uncertainty, and (2) time‐evolving financial ratios. Trajectories of paragraph‐level embeddings, frequencies, and ratios are used to construct a fraud detection model, which we evaluate against machine learning methods and deep time‐series models. Using 30 years of financial report data (from 1995 to 2024), our experiments demonstrate that D eep F raud on average enhances fraud detection performance across a number of scenarios and on average outperforms the competing approaches as well as conventional word frequency approaches. Our framework introduces a novel direction for deep feature engineering in the field of financial statement fraud detection.
With the driving power of the internet wave, the auditing industry faces unprecedented challenges and opportunities. Traditional audit methods increasingly reveal weaknesses at processing vast data, which requires the application of new technologies to maximize the efficiency and accuracy of audit. As an epoch-making innovation of natural language processing, Large Language Models (LLMs) demonstrate unequalled performance at parsing texts, semantic detection, and text generation, opening up approaches to forward-looking intelligent reform of auditing. This paper researches the innovation of LLMs to be utilized to intelligent Checking relationship detection of financial statements, to enhance the efficiency and accuracy of auditing.Based on the combination of the knowledge base of audit with the technologies of Retrieval-Augmented Generation (RAG), we introduce a multi-agent system to intelligent Checking relationship detection of financial statements. LLMs can automatically identify and certify Checking relationships between financial statements, dramatically improving audit efficiency and quality. Environmental context and data support are provided through the audit knowledge base, and the RAG technologies enhance the power of analysis. This paper demonstrates, through experiments, that these technologies can support intelligent Checking relationship detection, ushering auditing into an intelligent audit era.
Financial statement auditing is essential for stakeholders to understand a company's financial health, yet current manual processes are inefficient and error-prone. Even with extensive verification procedures, auditors frequently miss errors, leading to inaccurate financial statements that fail to meet stakeholder expectations for transparency and reliability. To this end, we harness large language models (LLMs) to automate financial statement auditing and rigorously assess their capabilities, providing insights on their performance boundaries in the scenario of automated auditing. Our work introduces a comprehensive benchmark using a curated dataset combining real-world financial tables with synthesized transaction data. In the benchmark, we developed a rigorous five-stage evaluation framework to assess LLMs'auditing capabilities. The benchmark also challenges models to map specific financial statement errors to corresponding violations of accounting standards, simulating real-world auditing scenarios through test cases. Our testing reveals that current state-of-the-art LLMs successfully identify financial statement errors when given historical transaction data. However, these models demonstrate significant limitations in explaining detected errors and citing relevant accounting standards. Furthermore, LLMs struggle to execute complete audits and make necessary financial statement revisions. These findings highlight a critical gap in LLMs'domain-specific accounting knowledge. Future research must focus on enhancing LLMs'understanding of auditing principles and procedures. Our benchmark and evaluation framework establish a foundation for developing more effective automated auditing tools that will substantially improve the accuracy and efficiency of real-world financial statement auditing.
Financial statement fraud, as a critical risk factor threatening the healthy development of capital markets, has long been a focal point of both academic research and practical concern. In recent years, Large Language Models(LLMs), with their advanced capabilities in text comprehension and logical reasoning, have opened new avenues for financial report analysis. This paper focuses on the financial reports of publicly listed companies and proposes a novel fraud detection method that integrates structured operational indicators with semantic information from key financial report sections. Specifically, the approach introduces SAGE prompt templates and a multi-step reasoning mechanism to guide the model in identifying potential fraudulent risks within texts and generating rationales. Furthermore, task prompts incorporating both operational metrics and textual analysis results are designed to enhance the model's fraud detection capability. Experimental results demonstrate that the proposed method outperforms traditional approaches across key evaluation metrics, including Accuracy, Precision, Recall, and F1-score, thereby validating its effectiveness and superiority in financial statement fraud detection. This study not only offers a new technical solution for identifying financial risks but also explores a viable path for applying LLMs in the financial domain.
A Preliminary Fundamental Financial Analysis Framework Using Structured LLM Prompting - A Case Study
Financial analysts routinely calculate standard ratios for company evaluation, often using inconsistent Excel templates that require manual updates and lack interactive visualization capabilities. This paper presents a web-based financial analysis platform that automates preliminary analysis through 15 fundamental financial ratios across five categories (liquidity, solvency, profitability, efficiency, and risk assessment), providing structured inputs to large language models (LLMs) for intelligent insights. Our research demonstrates that LLMs generate significantly more accurate analysis when provided with pre-calculated, contextually-rich metrics rather than raw financial statements - achieving 73% higher relevance scores, 81% better risk identification, and 65% more accurate comparative analysis. The platform, built with modular architecture supporting any state-of-the-art LLM API (GPT-4, Claude, Gemini), processes CSV data to calculate metrics and generate interactive dashboards with AI-powered commentary. We validate the framework through comprehensive comparative analysis of companies with contrasting business models, showing how structured inputs enable nuanced, context-aware insights that adapt to specific financial situations. The tool reduces analysis time required significantly while ensuring computational consistency and providing institutional-quality interpretation that would typically require senior analyst expertise.
The surge of large language models (LLMs) has revolutionized the extraction and analysis of crucial information from a growing volume of financial statements, announcements, and business news. Recognition for named entities to construct structured data poses a significant challenge in analyzing financial documents and is a foundational task for intelligent financial analytics. However, how effective are these generic LLMs and their performance under various prompts are yet need a better understanding. To fill in the blank, we present a systematic evaluation of state-of-the-art LLMs and prompting methods in the financial Named Entity Recognition (NER) problem. Specifically, our experimental results highlight their strengths and limitations, identify five representative failure types, and provide insights into their potential and challenges for domain-specific tasks.
Financial due diligence requires intensive analysis of vast unstructured documents (e.g., contracts, statements, invoices). However, traditional manual processing is inefficient, costly, and prone to subjectivity, and the existing automation solutions primarily focus on single-modal text recognition, lacking the capacity for joint understanding of multimodal features (e.g., layout, seals, table structures) and deep risk reasoning. This study proposes an end-to-end framework based on a Multimodal Large Language Model (MLLM) to bridge this gap. The framework not only performs accurate multimodal information extraction but also, integrates domain-specific knowledge (e.g regulatory clauses) to emulate expert-like reasoning. By constructing a dynamic risk knowledge graph that captures entities and relations across documents, it enables cross-document correlation analysis and anomaly detection. We will validate the framework on curated financial datasets, assessing both its information processing accuracy and risk diagnosis capability. Our contributions are threefold: 1) providing a novel computational linguistics solution that addresses the semantic and pragmatic challenges in financial document understanding; 2) advancing financial AI from perceptual to cognitive intelligence through explainable, knowledge-integrated reasoning; 3) offering a transparent, automated decision-support tool for high-stakes due diligence.
This study examines the information content of textual disclosures in firms’ earnings announcements. Using a large language model (LLM) to capture information in both words and word context, I show that the news in earnings press releases (i) explains three times more variation in short-window stock returns than a host of textual measures based on dictionary and non-LLM machine learning methods; (ii) doubles the R2 of an array of financial statement surprises, modeled with conventional regression or machine learning approaches; and (iii) accounts for a large fraction of immediate price revisions within just five minutes of release. LLM-modeled conference calls further enhance R2 by one fourth compared with press releases and financial surprises. Textual disclosures are more informative when earnings are less persistent and during periods of aggregate uncertainty. Most news arises from text describing numbers, at the beginning of the disclosure, and including novel contents. These findings highlight the role of firms’ textual disclosures in moving stock prices and advance our understanding of how investors utilize corporate disclosures. This paper was accepted by Suraj Srinivasan, accounting. Funding: The author gratefully acknowledges financial support from the Naveen Jindal School of Management. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2024.05417 .
The financial domain poses unique challenges for knowledge graph (KG) construction at scale due to the complexity and regulatory nature of financial documents. Despite the critical importance of structured financial knowledge, the field lacks large-scale, open-source datasets capturing rich semantic relationships from corporate disclosures. We introduce an open-source, large-scale financial knowledge graph dataset built from the latest annual SEC 10-K filings of all S&P 100 companies - a comprehensive resource designed to catalyze research in financial AI. We propose a robust and generalizable knowledge graph (KG) construction framework that integrates intelligent document parsing, table-aware chunking, and schema-guided iterative extraction with a reflection-driven feedback loop. Our system incorporates a comprehensive evaluation pipeline, combining rule-based checks, statistical validation, and LLM-as-a-Judge assessments to holistically measure extraction quality. We support three extraction modes-single-pass, multi-pass, and reflection-agent-based allowing flexible trade-offs between efficiency, accuracy, and reliability based on user requirements. Empirical evaluations demonstrate that the reflection-agent-based mode consistently achieves the best balance, attaining a 64.8% compliance score against all rule-based policies (CheckRules) and outperforming baseline methods (single-pass & multi-pass) across key metrics such as precision, comprehensiveness, and relevance in LLM-guided evaluations. The utility of our KG pipeline is demonstrated through its flexible extraction modes, coupled with a multi-faceted evaluation methodology. By releasing a high-quality, thoroughly evaluated dataset along with a comprehensive KG construction & evaluation framework, we aim to advance transparency, reproducibility, and innovation in financial KG research. The dataset is publicly available at: https://anonymous.4open.science/r/KG-Financial-Datasets-SP-100-529B/README.md
The stock market is inherently complex, with interdependent relationships among companies, sectors, and financial indicators. Traditional research has largely focused on time-series forecasting and single-company analysis, relying on numerical data for stock price prediction. While such approaches can provide short-term insights, they are limited in capturing relational patterns, competitive dynamics, and explainable investment reasoning. To address these limitations, we propose a knowledge graph schema specifically designed for the stock market, modeling companies, sectors, stock indicators, financial statements, and inter-company relationships. By integrating this schema with large language models (LLMs), our approach enables multi-hop reasoning and relational queries, producing explainable and in-depth answers to complex financial questions. Figure1 illustrates the system pipeline, detailing the flow from data collection and graph construction to LLM-based query processing and answer generation. We validate the proposed framework through practical case studies on Korean listed companies, demonstrating its capability to extract insights that are difficult or impossible to obtain from traditional database queries alone. The results highlight the potential of combining knowledge graphs with LLMs for advanced investment analysis and decision support.
In the era of artificial intelligence and fintech, improving the efficiency of financial analysis is essential for financial service providers. This article proposes a novel large language model-enhanced text mining workflow that leverages Internet-sourced text information to efficiently analyze supply chain finance business without requiring programming skills. We conduct a case study on the Chinese market for new energy buses—a rapidly growing sector due to government incentives and the push for sustainable urban transportation—using data from bidding websites and financial statements. The experimental results demonstrate that our LLM-enhanced workflow outperforms traditional methods, showcasing increased efficiency and practicality, especially for non-programming employees in supply chain financial services.
No abstract available
ABSTRACT Purpose This study aims to integrate large language models (LLMs) with interpretable machine learning methods to develop a multimodal data-driven framework for predicting corporate financial fraud, addressing the limitations of traditional approaches in long-text semantic parsing, model interpretability, and multisource data fusion, thereby providing regulatory agencies with intelligent auditing tools. Design/methodology/approach Analyzing 5,304 Chinese listed firms’ annual reports (2015-2020) from the CSMAD database, this study leverages the Doubao LLMs to generate chunked summaries and 256-dimensional semantic vectors, developing textual semantic features. It integrates 19 financial indicators, 11 governance metrics, and linguistic characteristics (tone, readability) with fraud prediction models optimized through a group of Gradient Boosted Decision Tree (GBDT) algorithms. SHAP value analysis in the final model reveals the risk transmission mechanism by quantifying the marginal impacts of financial, governance, and textual features on fraud likelihood. Findings The study found that LLMs effectively distill lengthy annual reports into semantic summaries, while GBDT algorithms (AUC > 0.850) outperform the traditional Logistic Regression model in fraud detection. Multimodal fusion improved performance by 7.4%, with financial, governance, and textual features providing complementary signals. SHAP analysis revealed financial distress, governance conflicts, and narrative patterns (e.g., tone anchoring, semantic thresholds) as key fraud indicators, highlighting managerial intent in report language. Research limitations This study identifies three key limitations: 1) lack of interpretability for semantic features, 2) absence of granular fraud-type differentiation, and 3) unexplored comparative validation with other deep learning methods. Future research will address these gaps to enhance fraud detection precision and model transparency. Practical implications The developed semantic-enhanced evaluation model provides a quantitative tool for assessing listed companies’ information disclosure quality and enables practical implementation through its derivative real-time monitoring system. This advancement significantly strengthens capital market risk early warning capabilities, offering actionable insights for securities regulation. Originality/value This study presents three key innovations: 1) A novel “chunking-summarizationembedding” framework for efficient semantic compression of lengthy annual reports (30,000 words); 2) Demonstration of LLMs’ superior performance in financial text analysis, outperforming traditional methods by 19.3%; 3) A novel “language-psychology-behavior” triad model for analyzing managerial fraud motives.
No abstract available
This paper provides a comparative study of two major language model (LLM) strategies (instruction-based fine-tuning and retrieval-augmented generation (RAG)) for corporate financial question answering. A modular AI pipeline is designed where each financial domain (e.g. liquidity, investment, credit analysis) is treated as an independent QA module that operates on structured company-level data. The dataset includes Turkish public company financial statements from 2008 to 2025. The fine-tuned variant based on LLaMA 3 8B is trained on domain-specific prompts in Turkish using LoRA adapters. The RAG-based variant utilizes a vector search engine to retrieve relevant financial pieces. In addition, tabular reasoning is integrated using pandas to provide dynamic, code-based access to structured data and is developed as a more advanced version (ragenh). To evaluate the quality of generated answers, a set of metrics are applied that capture semantic similarity, numerical accuracy, and directional consistency—such as ROUGE-L, BERTScore, and domain-specific number and trend alignment scores. The results obtained from these metrics show that while the fine-tuned models perform well in interpretive and trend-based tasks, ragenh outperforms both baselines in ground truth and opinion-based reasoning. This work provides a scalable framework for building interpretable financial assistants in under-resourced language environments by combining modular QA design, hybrid architectures, and custom evaluation. The findings contribute to developing robust, context-aware LLM applications for financial decision support.
In this paper, we introduce the Financial-STS task, a financial domain-specific NLP task designed to measure the nuanced semantic similarity between pairs of financial narratives. These narratives originate from the financial statements of the same company but correspond to different periods, such as year-over-year comparisons. Measuring the subtle semantic differences between these paired narratives enables market stakeholders to gauge changes over time in the company's financial and operational situations, which is critical for financial decision-making. We find that existing pretrained embedding models and LLM embeddings fall short in discerning these subtle financial narrative shifts. To address this gap, we propose an LLM-augmented pipeline specifically designed for the Financial-STS task. Evaluation on a human-annotated dataset demonstrates that our proposed method outperforms existing methods trained on classic STS tasks and generic LLM embeddings.
The conventional methods of financial document analysis have been based on official numerical pointers, without considering the huge narrative data in company reports, regulatory submissions, morale reports, or sector statements. According to the recent progress in the sphere of -informed large language models (LLMs), machines have become capable of processing and encoding financial documents with better accuracy than at any previous point in history. This paper presents a multimodal LLM architecture, which makes use of financial reports to synthesize a narrative and lock it in with a time-series predictor in external downstream trainers that do not use the narrative, namely forecasting and control. As the energy and insurance sectors view of case studies evidence, document processing through LLM does increase the accuracy of prediction in addition, that it makes reading or comprehension superior, contributing to risk identification, harmonization of the rules, and decision-making. The findings highlight the value of LLMs as a methodological roadmap for financial document processing beyond the banking sector.
Currently, the research field of Natural Language Generated SQL (NL2SQL) mainly focuses on generic datasets, aiming to build models that can parse natural language queries and automatically generate SQL statements. However, this generic exploration often ignores the complexity and idiosyncrasies of intra-enterprise data, such as industry-specific terminology, data structure differences, and security compliance requirements, resulting in the fact that the existing NL2SQL technology can only cover the more basic query requirements in practical applications, and is difficult to be deeply integrated into the business scenarios of enterprises. This paper aims to fill this research gap by focusing on the customized application of NL2SQL technology to specific internal enterprise environments, and adopting LLM+RAG+Reminiscence Engineering+Intelligent Body Feedback to design and implement a set of NL2SQL system that can be applied to internal enterprises. After practical project verification, the system’s ability to generate SQL in an enterprise-oriented financial data environment can be improved from 54% to about 70%, and the accuracy of multi-round dialog can be further improved. This system enables the seamless connection between natural language and enterprise database, which provides strong support for enterprise digital transformation.
Financial results briefings offer a two-way communication channel between investors and management and provide qualitative information. Although the importance of those qualitative information has been recognized, there has yet to be an investigation into the extent to which such information is included in financial results briefings and how the recent corporate characteristics and performance affect them. Therefore in this study, we tackle a task to classify the text of financial results briefings into “facts” or “opinions” using GPT-3.5-turbo (chatGPT), a large-scale language model (LLM), and evaluate zero-shot and few-shot classification performance using manual labels assigned by three experts. Then, we confirm the proportion of “facts” or “opinions” in each briefing and clarify to what extent the number of opinions and facts are influenced by company size and performance in the first place. We find that, on average, 38% of statements in the briefings contain the opinion. Companies with smaller ordinary income margins tend to have a higher percentage of opinion statements, and companies with higher market value relative to book value have a higher volume of opinion sentences. The study highlights the importance of financial results briefings as a communication channel providing subjective opinions and explaining financial results and future outlook.
No abstract available
Large Language Models (LLMs) have made remarkable progress, surpassing human performance on several benchmarks in domains such as mathematics and coding. A key driver of this progress has been the development of benchmark datasets. In contrast, the financial domain poses higher entry barriers due to its demand for specialized expertise, and benchmarks remain relatively scarce compared to those in mathematics or coding. We introduce EDINET-Bench, an open-source Japanese financial benchmark designed to evaluate LLMs on challenging tasks such as accounting fraud detection, earnings forecasting, and industry classification. EDINET-Bench is constructed from ten years of annual reports filed by Japanese companies. These tasks require models to process entire annual reports and integrate information across multiple tables and textual sections, demanding expert-level reasoning that is challenging even for human professionals. Our experiments show that even state-of-the-art LLMs struggle in this domain, performing only marginally better than logistic regression in binary classification tasks such as fraud detection and earnings forecasting. Our results show that simply providing reports to LLMs in a straightforward setting is not enough. This highlights the need for benchmark frameworks that better reflect the environments in which financial professionals operate, with richer scaffolding such as realistic simulations and task-specific reasoning support to enable more effective problem solving. We make our dataset and code publicly available to support future research.
With the proliferation of digital financial services and digital transactional documents, data volumes are vastly increasing, including invoices, receipts, bank statements, and balance sheets. The document has garnered massive interest and a keen interest in handling Information extraction from these documents. For such documents, manual data extraction is time-consuming and prone to human error as the documents come in many formats. This paper covers techniques, tools, and technology in the case of extracting tables from financial and transactional documents, specifically in the case of vertical tables and in the presence of mixed-type data representations. Table extraction means extracting tabular data from a readable image schema document and transforming it into a structured format (CSV / JSON). The paper discusses other extraction methods, such as rule-based extraction, optical character recognition (OCR), and machine learning models. The book also covers some use cases from industry banking, e-commerce, or accounting, amongst other industries. The paper then discusses ethical and legal implications such as GDPR, HIPAA, compliance with data privacy laws, and how it should be transparent and fair for AI systems. Last but not least, the future trends of table extraction, including integration of generative AI and large language models (LLMs) and robotic process automation (RPA), as well as real-time data extraction, are discussed. This paper presents the growing demand for advanced extraction technologies to increase financial document processing accuracy, efficiency, and scalability.
As of 2025, Generative Artificial Intelligence (GenAI) has become a central tool for productivity across industries. Beyond text generation, GenAI now plays a critical role in coding, data analysis, and research workflows. As large language models (LLMs) continue to evolve, it is essential to assess the reliability and accuracy of their outputs, especially in specialized, high-stakes domains like finance. Most modern LLMs transform text into numerical vectors, which are used in operations such as cosine similarity searches to generate responses. However, this abstraction process can lead to misinterpretation of emotional tone, particularly in nuanced financial contexts. While LLMs generally excel at identifying sentiment in everyday language, these models often struggle with the nuanced, strategically ambiguous language found in earnings call transcripts. Financial disclosures frequently embed sentiment in hedged statements, forward-looking language, and industry-specific jargon, making it difficult even for human analysts to interpret consistently, let alone AI models. This paper presents findings from the Santa Clara Microsoft Practicum Project, led by Professor Charlie Goldenberg, which benchmarks the performance of Microsoft's Copilot, OpenAI's ChatGPT, Google's Gemini, and traditional machine learning models for sentiment analysis of financial text. Using Microsoft earnings call transcripts, the analysis assesses how well LLM-derived sentiment correlates with market sentiment and stock movements and evaluates the accuracy of model outputs. Prompt engineering techniques are also examined to improve sentiment analysis results. Visualizations of sentiment consistency are developed to evaluate alignment between tone and stock performance, with sentiment trends analyzed across Microsoft's lines of business to determine which segments exert the greatest influence.
Retrieval Augmented Generation (RAG) systems show promise for financial question answering, yet high accuracy on benchmarks such as FinanceBench ($\mathbf{1 9\%}$ baseline, 32% updated) remains challenging [1] [8]. This paper presents a systematic, multistage approach to significantly improve the performance of the RAG pipeline for financial QA. We first established a robust curated baseline using Gemini-2.0, Docling parser, Google's text-embedding-004, and a vector database, achieving an initial accuracy of $\mathbf{4 3\%}$. Subsequent architectural and component-wise optimizations were then iteratively implemented. Firstly, a metadata filtering strategy, which utilizes a fine-tuned NER model to extract company names and years from queries, improved accuracy to 72%, demonstrating that targeted retrieval can simulate the benefits of a single-store per-filing approach [1]. Secondly, a hybrid chucking technique, which preserves the structure of the document and utilizes tokenization sensitive refinements, further increased the accuracy to 80%. Third, the implementation of a Hybrid Search mechanism, combining dense and sparse retrieval methods, advanced performance to 84%. Finally, LLM-based query expansion, which transforms user queries into answer formats, yielded a final accuracy of 88%. This research demonstrates that a carefully designed RAG pipeline, incorporating intelligent metadata filtering, layoutaware chunking, advanced similarity search, and query semantics enhancement, substantially improves financial QA, significantly outperforming existing baselines.
The Financial Planning & Analysis (FP&A) profession is at a critical turning point, shifting from being a back-end reporting function to becoming a forward-looking predictive engine for strategic insight. This shift is because of the adoption of artificial intelligence (AI), specifically machine learning (ML) and large language models (LLMs). Old-school FP&A processes, constrained by manual intervention and past-centric data, are making way for sophisticated analytical paradigms. ML models now reveal subtle patterns in large sets of data, across internal metrics and external signals, to make highly sophisticated, probabilistic projections. At the same time, LLMs are transforming unstructured data analysis, revealing actionable insight from financial statements and market research. This fusion of technologies makes forecasts more accurate, allows for real-time anomaly detection, and gives deep strategic context. Yet successful adoption is no technical upgrade; it's a strategic imperative dependent on strong data infrastructure, severe model governance, and a focus on upskilling human capital. This paper offers a complete roadmap for leaders to successfully embark on this complicated journey, making their finance function a genuine force of strategic foresight and competitive edge.
In specialized domains, humans often compare new problems against similar examples, highlight nuances, and draw conclusions instead of analyzing information in isolation. When applying reasoning in specialized contexts with LLMs on top of a RAG, the pipeline can capture contextually relevant information, but it is not designed to retrieve comparable cases or related problems. While RAG is effective at extracting factual information, its outputs in specialized reasoning tasks often remain generic, reflecting broad facts rather than context-specific insights. In finance, it results in generic risks that are true for the majority of companies. To address this limitation, we propose a peer-aware comparative inference layer on top of RAG. Our contrastive approach outperforms baseline RAG in text generation metrics such as ROUGE and BERTScore in comparison with human-generated equity research and risk.
In the face of global economic uncertainty, financial auditing has become essential for regulatory compliance and risk mitigation. Traditional manual auditing methods are increasingly limited by large data volumes, complex business structures, and evolving fraud tactics. This study proposes an AI-driven framework for enterprise financial audits and high-risk identification, leveraging machine learning to improve efficiency and accuracy. Using a dataset from the Big Four accounting firms (EY, PwC, Deloitte, KPMG) from 2020 to 2025, the research examines trends in risk assessment, compliance violations, and fraud detection. The dataset includes key indicators such as audit project counts, high-risk cases, fraud instances, compliance breaches, employee workload, and client satisfaction, capturing both audit behaviors and AI's impact on operations. To build a robust risk prediction model, three algorithms - Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbors (KNN) - are evaluated. SVM uses hyperplane optimization for complex classification, RF combines decision trees to manage high-dimensional, nonlinear data with resistance to overfitting, and KNN applies distance-based learning for flexible performance. Through hierarchical K-fold cross-validation and evaluation using F1-score, accuracy, and recall, Random Forest achieves the best performance, with an F1-score of 0.9012, excelling in identifying fraud and compliance anomalies. Feature importance analysis reveals audit frequency, past violations, employee workload, and client ratings as key predictors. The study recommends adopting Random Forest as a core model, enhancing features via engineering, and implementing real-time risk monitoring. This research contributes valuable insights into using machine learning for intelligent auditing and risk management in modern enterprises.
In a space such as the financial industry, clear and stringent reporting and auditing are vital for both regulatory adherence and internal governance. In this environment, cards with full U.S cardholder's CVVs are one of the most valuable data assets for a fraudster in that they provide an easy-to-use, one-time authentication step that’s cryptographically difficult to reproduce. Although these information resources are invaluable for evaluating institutions' risk and compliance stance, the vast majority of such information is textual and unstructured. This becomes a formidable challenge for institutions that try to make use of timely, reliable, and actionable insights- especially when they are done manually or with unsophisticated rule-based systems. In the past few years, the developments in NLP have provided a tremendous ability to interpret unstructured text at scale, enabling automation in areas that traditionally rely heavily on expert judgment. NLP is particularly suitable in finance applications where Textual analysis is required to deal with context, domain-specific jargon, time taking into consideration temporal patterns, and delicate linguistic cues. This work studied NLP to process financial risk disclosures and audit trails, providing a systematic and scalable way to detect financial wrongdoings, latent risks, and non-compliance events. We start with an analysis of the linguistic properties of financial disclosures, uncovering important aspects such as tone, modality, and forward-looking statements that are frequently associated with risk perception and market volatility. We leverage techniques such as Named Entity Recognition (NER), sentiment analysis, and topic modelling to illustrate how machine learning-based NLP models can unearth the hidden risk signals encoded in annual reports or regulatory filings. Concurrently, we consider audit trails as structured logs about user or system activity that, despite being in timestamped format, include embedded command-line lines, transactional notes, and system-generated messages that are good candidates for language-based analysis. By processing through NLP, such as tokenization of log, part-of-speech, parsing, and anomaly detection, the audit data is converted to the sample structured knowledge for real-time monitoring and forensic auditing. The manuscript introduces a hybrid approach based on the integration of rule-based, statistical NLP, and machine-learning techniques for both narrative-based disclosures and event-ordered disclosure logs. We also detail a pipeline design consisting of data ingestion, text pre-processing, feature extraction, model prediction, and visual dashboarding. Experimental results from historical financial disclosures and synthetic audit logs show that the NLP-driven framework is able to accurately target risk-laden statements, identify anomalous sequences of activities, and categorize text sections according to regulatory relevance. Our results show that our proposed approach outperforms traditional keyword matching and manual review-based approaches and is more efficient and interpretable. The application of NLP to financial risk risk disclosures and audit trails can improve both timeliness and accuracy of compliance checks while also providing a proactive approach to risk governance. This study is part of an emerging body of work on Regulatory Technology (RegTech), which promotes the use of AI and data to inform regulatory decision making in finance. In navigating the morass of regulation and the volume of data they need to process, it is clear that NLP is the key enabler for intelligent, automated, and reliable compliance.
Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning required for transparency and verification. To address this gap, we introduce FINCHAIN, the first benchmark specifically designed for verifiable Chain-of-Thought (CoT) evaluation in finance. FINCHAIN spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python traces that enable fully machine-verifiable reasoning and scalable, contamination-free data generation. To assess reasoning capacity, we propose CHAINEVAL, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Our evaluation of 26 leading LLMs reveals that even frontier proprietary LLMs exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap. Overall, FINCHAIN exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI.
The task of financial analysis primarily encompasses two key areas: stock trend prediction and the corresponding financial question answering. Currently, machine learning and deep learning algorithms (ML&DL) have been widely applied for stock trend predictions, leading to significant progress. However, these methods fail to provide reasons for predictions, lacking interpretability and reasoning processes. Also, they can not integrate textual information such as financial news or reports. Meanwhile, large language models (LLM) have remarkable textual understanding and generation ability. But due to the scarcity of financial training datasets and limited integration with real-time knowledge, LLM still suffer from hallucinations and unable to keep up with the latest information. To tackle these challenges, we first release AlphaFin datasets, combining traditional research datasets, real-time financial data, and handwritten chain-of-thought (CoT) data. It has positive impact on training LLM for completing financial analysis. We then use AlphaFin datasets to benchmark a state-of-the-art method, called Stock-Chain, for effectively tackling the financial analysis task, which integrates retrieval-augmented generation (RAG) techniques. Extensive experiments are conducted to demonstrate the effectiveness of our framework on financial analysis.
As Large Language Models (LLMs) become more pervasive, their capability to generate convincing financial news poses an escalating threat to investor decision-making and market stability. However, contemporary content moderation and AIbased verification systems exhibit notable vulnerabilities when confronted with the subtle linguistic manipulations introduced by advanced prompt engineering techniques and adversarial training. This study investigated the comparative credibility, influence, and detectability of AI-generated financial headlines produced via Zero-Shot, Few-Shot (8-Shot), and Chain-of-Thought (CoT) prompting, with CoT outputs further used to train a GAN for adversarially enhanced text generation. We compiled a combined dataset of NASDAQ-listed securities and web-scraped, human authored news, generated additional AI-driven headlines under three prompting paradigms, and conducted a survey of randomly sampled headlines ($\mathbf{n} \boldsymbol{=} \mathbf{3 0 0}$) to assess the credibility, market perception impact, investment influence, and AI detectability. The analysis revealed that headlines generated through Chain-of-Thought prompting consistently scored higher in perceived authenticity, influenced investment sentiment more profoundly, and were harder for participants to classify as AI-written. The findings underscore the urgent need for adversarially robust content moderation and verification mechanisms, capable of adapting to the rapidly evolving landscape of AI-generated financial misinformation, particularly when Chain-of-Thought reasoning is leveraged to enhance GAN-generated content.
As financial institutions and professionals increasingly incorporate Large Language Models (LLMs) into their workflows, substantial barriers, including proprietary data and specialized knowledge, persist between the finance sector and the AI community. These challenges impede the AI community's ability to enhance financial tasks effectively. Acknowledging financial analysis's critical role, we aim to devise financial-specialized LLM-based toolchains and democratize access to them through open-source initiatives, promoting wider AI adoption in financial decision-making. In this paper, we introduce FinRobot, a novel open-source AI agent platform supporting multiple financially specialized AI agents, each powered by LLM. Specifically, the platform consists of four major layers: 1) the Financial AI Agents layer that formulates Financial Chain-of-Thought (CoT) by breaking sophisticated financial problems down into logical sequences; 2) the Financial LLM Algorithms layer dynamically configures appropriate model application strategies for specific tasks; 3) the LLMOps and DataOps layer produces accurate models by applying training/fine-tuning techniques and using task-relevant data; 4) the Multi-source LLM Foundation Models layer that integrates various LLMs and enables the above layers to access them directly. Finally, FinRobot provides hands-on for both professional-grade analysts and laypersons to utilize powerful AI techniques for advanced financial analysis. We open-source FinRobot at \url{https://github.com/AI4Finance-Foundation/FinRobot}.
This paper introduces an open-source framework designed to facilitate the development and deployment of Large Language Model (LLM) -orchestrated agents for financial applications. The framework addresses challenges in integrating LLMs into finance by providing a layered architecture that supports the creation of specialized agents and incorporates a novel Financial Chain-of-Thought (CoT) prompting technique. The platform's design emphasizes modularity, multi-source LLM integration, and efficient data handling to enhance financial analysis workflows.
Financial narratives from U.S. Securities and Exchange Commission (SEC) filing reports and quarterly earnings call transcripts (ECTs) are very important for investors, auditors, and regulators. However, their length, financial jargon, and nuanced language make fine-grained analysis difficult. Prior sentiment analysis in the financial domain required a large, expensive labeled dataset, making the sentence-level stance towards specific financial targets challenging. In this work, we introduce a sentence-level corpus for stance detection focused on three core financial metrics: debt, earnings per share (EPS), and sales. The sentences were extracted from Form 10-K annual reports and ECTs, and labeled for stance (positive, negative, neutral) using the advanced ChatGPT-o3-pro model under rigorous human validation. Using this corpus, we conduct a systematic evaluation of modern large language models (LLMs) using zero-shot, few-shot, and Chain-of-Thought (CoT) prompting strategies. Our results show that few-shot with CoT prompting performs best compared to supervised baselines, and LLMs'performance varies across the SEC and ECT datasets. Our findings highlight the practical viability of leveraging LLMs for target-specific stance in the financial domain without requiring extensive labeled data.
Large Language Models (LLMs) have demonstrated remarkable performance on a wide range of Natural Language Processing (NLP) tasks, often matching or even beating state-of-the-art task-specific models. This study aims at assessing the financial reasoning capabilities of LLMs. We leverage mock exam questions of the Chartered Financial Analyst (CFA) Program to conduct a comprehensive evaluation of ChatGPT and GPT-4 in financial analysis, considering Zero-Shot (ZS), Chain-of-Thought (CoT), and Few-Shot (FS) scenarios. We present an in-depth analysis of the models' performance and limitations, and estimate whether they would have a chance at passing the CFA exams. Finally, we outline insights into potential strategies and improvements to enhance the applicability of LLMs in finance. In this perspective, we hope this work paves the way for future studies to continue enhancing LLMs for financial reasoning through rigorous evaluation.
No abstract available
The rapid development of robo-advisory and quantitative investment has been accompanied by persistent concerns about limited personalization and the opacity of black-box models operating on multimodal financial information. This paper addresses these issues from a decision-support perspective by constructing FinErva, a multimodal chain-of-thought dataset tailored to financial applications. FinErva comprises 7,544 manually verified question–answer pairs, divided into two economically relevant tasks: contract and disclosure understanding (FinErva-Pact) and candlestick-chart-based technical analysis (FinErva-Price). Building on this dataset, the paper propose a two-stage training framework: Supervised-CoT Learning followed by Self-CoT Refinement, and apply it to eight vision–language models, each with fewer than 0.8 billion parameters. Empirical results show that those lightweight models approach the performance of finance professionals and clearly outperform non-expert investors. Overall, the findings indicate that appropriately designed multimodal chain of thought supervision enables interpretable modeling of key research tasks such as contract review and chart interpretation under realistic computational and deployment constraints, providing new data and methodology for the development of personalized, explainable, and operationally feasible AI systems in investment advisory and risk management.
Large Language Models (LLMs) have achieved remarkable success recently, displaying exceptional capabilities in creating understandable and organized text. These LLMs have been utilized in diverse fields, such as clinical research, where domain-specific models like Med-Palm have achieved human-level performance. Recently, researchers have employed advanced prompt engineering to enhance the general reasoning ability of LLMs. Despite the remarkable success of zero-shot Chain-of-Thoughts (CoT) in solving general reasoning tasks, the potential of these methods still remains paid limited attention in the financial reasoning task.To address this issue, we explore multiple prompt strategies and incorporated semantic news information to improve LLMs' performance on financial reasoning tasks.To the best of our knowledge, we are the first to explore this important issue by applying ChatGPT to the gold investment.In this work, our aim is to investigate the financial reasoning capabilities of LLMs and their capacity to generate logical and persuasive investment opinions. We will use ChatGPT, one of the most powerful LLMs recently, and prompt engineering to achieve this goal. Our research will focus on understanding the ability of LLMs in sophisticated analysis and reasoning within the context of investment decision-making. Our study finds that ChatGPT with CoT prompt can provide more explainable predictions and overcome behavioral biases, which is crucial in finance-related tasks and can achieve higher investment returns.
As financial markets grow increasingly complex, there is a rising need for automated tools that can effectively assist human analysts in equity research, particularly within sell-side research. While Generative AI (GenAI) has attracted significant attention in this field, existing AI solutions often fall short due to their narrow focus on technical factors and limited capacity for discretionary judgment. These limitations hinder their ability to adapt to new data in real-time and accurately assess risks, which diminishes their practical value for investors. This paper presents FinRobot, the first AI agent framework specifically designed for equity research. FinRobot employs a multi-agent Chain of Thought (CoT) system, integrating both quantitative and qualitative analyses to emulate the comprehensive reasoning of a human analyst. The system is structured around three specialized agents: the Data-CoT Agent, which aggregates diverse data sources for robust financial integration; the Concept-CoT Agent, which mimics an analysts reasoning to generate actionable insights; and the Thesis-CoT Agent, which synthesizes these insights into a coherent investment thesis and report. FinRobot provides thorough company analysis supported by precise numerical data, industry-appropriate valuation metrics, and realistic risk assessments. Its dynamically updatable data pipeline ensures that research remains timely and relevant, adapting seamlessly to new financial information. Unlike existing automated research tools, such as CapitalCube and Wright Reports, FinRobot delivers insights comparable to those produced by major brokerage firms and fundamental research vendors. We open-source FinRobot at \url{https://github. com/AI4Finance-Foundation/FinRobot}.
Performance attribution analysis, defined as the process of explaining the drivers of the excess performance of an investment portfolio against a benchmark, stands as a significant feature of portfolio management and plays a crucial role in the investment decision-making process, particularly within the fund management industry. Rooted in a solid financial and mathematical framework, the importance and methodologies of this analytical technique are extensively documented across numerous academic research papers and books. The integration of large language models (LLMs) and AI agents marks a groundbreaking development in this field. These agents are designed to automate and enhance the performance attribution analysis by accurately calculating and analyzing portfolio performances against benchmarks. In this study, we introduce the application of an AI Agent for a variety of essential performance attribution tasks, including the analysis of performance drivers and utilizing LLMs as calculation engine for multi-level attribution analysis and question-answering (QA) tasks. Leveraging advanced prompt engineering techniques such as Chain-of-Thought (CoT) and Plan and Solve (PS), and employing a standard agent framework from LangChain, the research achieves promising results: it achieves accuracy rates exceeding 93% in analyzing performance drivers, attains 100% in multi-level attribution calculations, and surpasses 84% accuracy in QA exercises that simulate official examination standards. These findings affirm the impactful role of AI agents, prompt engineering and evaluation in advancing portfolio management processes, highlighting a significant development in the practical application and evaluation of Generative AI technologies within the domain.
No abstract available
No abstract available
Financial institutions deploy Large Language Models (LLMs) for reconciliations, regulatory reporting, and client communications, but nondeterministic outputs (output drift) undermine auditability and trust. We quantify drift across five model architectures (7B-120B parameters) on regulated financial tasks, revealing a stark inverse relationship: smaller models (Granite-3-8B, Qwen2.5-7B) achieve 100% output consistency at T=0.0, while GPT-OSS-120B exhibits only 12.5% consistency (95% CI: 3.5-36.0%) regardless of configuration (p<0.0001, Fisher's exact test). This finding challenges conventional assumptions that larger models are universally superior for production deployment. Our contributions include: (i) a finance-calibrated deterministic test harness combining greedy decoding (T=0.0), fixed seeds, and SEC 10-K structure-aware retrieval ordering; (ii) task-specific invariant checking for RAG, JSON, and SQL outputs using finance-calibrated materiality thresholds (plus or minus 5%) and SEC citation validation; (iii) a three-tier model classification system enabling risk-appropriate deployment decisions; and (iv) an audit-ready attestation system with dual-provider validation. We evaluated five models (Qwen2.5-7B via Ollama, Granite-3-8B via IBM watsonx.ai, Llama-3.3-70B, Mistral-Medium-2505, and GPT-OSS-120B) across three regulated financial tasks. Across 480 runs (n=16 per condition), structured tasks (SQL) remain stable even at T=0.2, while RAG tasks show drift (25-75%), revealing task-dependent sensitivity. Cross-provider validation confirms deterministic behavior transfers between local and cloud deployments. We map our framework to Financial Stability Board (FSB), Bank for International Settlements (BIS), and Commodity Futures Trading Commission (CFTC) requirements, demonstrating practical pathways for compliance-ready AI deployments.
Generative AI has significantly reduced the entry barrier to the domain of AI owing to the ease of use and core capabilities of automation, translation, and intelligent actions in our day to day lives. Currently, Large language models (LLMs) that power such chatbots are being utilized primarily for their automation capabilities on a limited scope. One major limitation of the currently evolving family of LLMs is hallucinations, wherein inaccurate responses are reported as factual. Hallucinations are primarily caused by biased training data, ambiguous prompts and inaccurate LLM parameters, and they majorly occur while combining mathematical facts with language-based context. In this work we present the three major stages in the journey of designing hallucination-minimized LLM-based solutions that are specialized for the decision makers of the financial domain, namely: prototyping, scaling and LLM evolution using human feedback. These three stages and the novel data to answer generation modules presented in this work are necessary to ensure that the Generative AI products are reliable and high-quality to aid key decision-making processes.
Large Language Models (LLMs) have been applied to build several automation and personalized question-answering prototypes so far. However, scaling such prototypes to robust products with minimized hallucinations or fake responses still remains an open challenge, especially in niche data-table heavy domains such as financial decision making. In this work, we present a novel Langchain-based framework that transforms data tables into hierarchical textual "data chunks" to enable a wide variety of actionable question answering. First, the user-queries are classified by intention followed by automated retrieval of the most relevant data chunks to generate customized LLM prompts per query. Next, the custom prompts and their responses undergo multi-metric scoring to assess for hallucinations and response confidence. The proposed system is optimized with user-query intention classification, advanced prompting, data scaling capabilities and it achieves over $ 90\%$ confidence scores for a variety of user-queries responses ranging from {What, Where, Why, How, predict, trend, anomalies, exceptions} that are crucial for financial decision making applications. The proposed data to answers framework can be extended to other analytical domains such as sales and payroll to ensure optimal hallucination control guardrails.
Large language models (LLMs) have shown remarkable capabilities across various domains; however, the issue of hallucination poses a significant challenge, particularly in high-stakes areas like finance. This paper provides an empirical examination of hallucination exhibited by LLMs in financial tasks. This study investigates the ability of LLMs to accurately explain financial concepts, retrieve historical stock data, and explore methods for mitigating these hallucinations. The findings reveal that standard LLMs demonstrate substantial hallucination tendencies in financial contexts, highlighting the need for further research to improve their reliability.
Accurate and reliable knowledge retrieval is vital for financial question-answering, where continually updated data sources and complex, high-stakes contexts demand precision. Traditional retrieval systems rely on a single database and retriever, but financial applications require more sophisticated approaches to handle intricate regulatory filings, market analyses, and extensive multi-year reports. We introduce a framework for financial Retrieval Augmented Generation (RAG) that leverages agentic AI and the Multi-HyDE system, an approach that generates multiple, nonequivalent queries to boost the effectiveness and coverage of retrieval from large, structured financial corpora. Our pipeline is optimized for token efficiency and multi-step financial reasoning, and we demonstrate that their combination improves accuracy by 11.2% and reduces hallucinations by 15%. Our method is evaluated on standard financial QA benchmarks, showing that integrating domain-specific retrieval mechanisms such as Multi-HyDE with robust toolsets, including keyword and table-based retrieval, significantly enhances both the accuracy and reliability of answers. This research not only delivers a modular, adaptable retrieval framework for finance but also highlights the importance of structured agent workflows and multi-perspective retrieval for trustworthy deployment of AI in high-stakes financial applications.
No abstract available
No abstract available
This paper investigates whether Large Language Models (LLMs) can be used to predict numeric Key Performance Indicators (KPIs) from question-context pairs derived from financial and ESG (Environmental, Social, and Governance) reports. Two modeling strategies were compared: a semantic embedding approach using a pretrained transformer model (all-MiniLM-L6-v2), and a traditional term frequency-inverse document frequency (TF-IDF) vectorization. Both models were trained using Random Forest regressors and evaluated through 5-fold cross-validation. Results indicate that the TF-IDF-based model achieved stronger performance ($\mathrm{R}^{2}=0.59$) than the LLMbased model ($\mathbf{R}^{\mathbf{2}} \boldsymbol{=} \mathbf{0. 4 6}$), suggesting that classical NLP techniques remain competitive in structured financial text settings. Further analysis revealed that bounded KPI types such as Scores and Percentages were predicted with greater accuracy than unbounded values like revenues and emissions. These findings highlight the importance of aligning model complexity with the structure and semantic variability of financial disclosures. The study contributes to the growing field of AI-driven financial automation by clarifying the limits and strengths of semantic versus lexical modeling for numeric prediction tasks.
The exponential growth of information presents a significant challenge for researchers and professionals seeking to remain at the forefront of their fields and this paper introduces an innovative framework for automatically generating insightful financial digests using the power of Large Language Models (LLMs), specifically Google's Gemini Pro. By leveraging a combination of data extraction from OpenAlex, strategic prompt engineering, and LLM-driven analysis, we demonstrate the automated example of creating a comprehensive digests that generalize key findings, identify emerging trends. This approach addresses the limitations of traditional analysis methods, enabling the efficient processing of vast amounts of unstructured data and the delivery of actionable insights in an easily digestible format. This paper describes how LLMs work in simple words and how we can use their power to help researchers and scholars save their time and stay informed about current trends. Our study includes step-by-step process, from data acquisition and JSON construction to interaction with Gemini and the automated generation of PDF reports, including a link to the project's GitHub repository for broader accessibility and further development.
In an environment of increasingly complicated and globally interconnected financial systems, challenges related to harmonization in cross-border reporting are magnifying. Differences in regulation, language, data siloing, and the further proliferation of unstructured disclosures remain obstacles to the success of transparency, compliance and efficiency initiatives. In this paper we discuss a new integration of AI-driven Knowledge Graphs (KG) and NLP that we believe can form part of this solution; a new way of thinking about financial interpretation and summarization over jurisdictions. As structured semantic representations of financial entities and their attributes and inter-relationships, KGs facilitate machines to perceive and put information into context. And when combined with state-of-the-art NLP models like transformers and domain-specific large language models (LLMs), this architecture is able to accurately and interpretably extract, disambiguate, and summarize financial disclosures, audit reports, and regulatory filings. These capabilities are particularly useful for multinationals, auditors, and regulators which, for example, are looking to cross-mapp divergent financial standards (such as IFRS and GAAP) or even automate compliance mapping. The paper describes a system design that exploits mutli-source data, entity recognition, relation extraction, and multilingual semantic alignment based on AI-enhanced ontologies. Real-world examples from the EU, ASEAN and North America shows how artificial-intelligence-powered tools can cut through manual ground work, spot discrepancies in reporting and create reconciled summaries for stakeholders on both sides of the border. The results highlight the potential of NLP applied to Knowledge Graphs not only for the automation of reporting workflows but as a framework for delivering smart, explainable financial governance systems.
For a financial analyst, the question and answer (Q&A) segment of the company financial report is a crucial piece of information for various analysis and investment decisions. However, extracting valuable insights from the Q&A section has posed considerable challenges as the conventional methods such as detailed reading and note-taking lack scalability and are susceptible to human errors, and Optical Character Recognition (OCR) and similar techniques encounter difficulties in accurately processing unstructured transcript text, often missing subtle linguistic nuances that drive investor decisions. Here, we demonstrate the utilization of Large Language Models (LLMs) to efficiently and rapidly extract information from earnings report transcripts while ensuring high accuracy—transforming the extraction process as well as reducing hallucination by combining retrieval-augmented generation technique as well as metadata. We evaluate the outcomes of various LLMs with and without using our proposed approach based on various objective metrics for evaluating Q&A systems, and empirically demonstrate superiority of our method.
In the face of climate change, are companies really taking substantial steps toward more sustainable operations? A comprehensive answer lies in the dense, information-rich landscape of corporate sustainability reports. However, the sheer volume and complexity of these reports make human analysis very costly. Therefore, only a few entities worldwide have the resources to analyze these reports at scale, which leads to a lack of transparency in sustainability reporting. Empowering stakeholders with LLM-based automatic analysis tools can be a promising way to democratize sustainability report analysis. However, developing such tools is challenging due to (1) the hallucination of LLMs and (2) the inefficiency of bringing domain experts into the AI development loop. In this paper, we ChatReport, a novel LLM-based system to automate the analysis of corporate sustainability reports, addressing existing challenges by (1) making the answers traceable to reduce the harm of hallucination and (2) actively involving domain experts in the development loop. We make our methodology, annotated datasets, and generated analyses of 1015 reports publicly available.
Large Language Models (LLMs) hold immense promise for revolutionizing financial analysis and decision-making, yet their direct application is often hampered by issues of data hallucination and lack of access to real-time, verifiable financial information. This paper introduces QuantMCP, a novel framework designed to rigorously ground LLMs in financial reality. By leveraging the Model Context Protocol (MCP) for standardized and secure tool invocation, QuantMCP enables LLMs to accurately interface with a diverse array of Python-accessible financial data APIs (e.g., Wind, yfinance). Users can interact via natural language to precisely retrieve up-to-date financial data, thereby overcoming LLM's inherent limitations in factual data recall. More critically, once furnished with this verified, structured data, the LLM's analytical capabilities are unlocked, empowering it to perform sophisticated data interpretation, generate insights, and ultimately support more informed financial decision-making processes. QuantMCP provides a robust, extensible, and secure bridge between conversational AI and the complex world of financial data, aiming to enhance both the reliability and the analytical depth of LLM applications in finance.
While LLMs have shown great success in financial tasks like stock prediction and question answering, their application in fully automating Equity Research Report generation remains uncharted territory. In this paper, we formulate the Equity Research Report (ERR) Generation task for the first time. To address the data scarcity and the evaluation metrics absence, we present an open-source evaluation benchmark for ERR generation - FinRpt. We frame a Dataset Construction Pipeline that integrates 7 financial data types and produces a high-quality ERR dataset automatically, which could be used for model training and evaluation. We also introduce a comprehensive evaluation system including 11 metrics to assess the generated ERRs. Moreover, we propose a multi-agent framework specifically tailored to address this task, named FinRpt-Gen, and train several LLM-based agents on the proposed datasets using Supervised Fine-Tuning and Reinforcement Learning. Experimental results indicate the data quality and metrics effectiveness of the benchmark FinRpt and the strong performance of FinRpt-Gen, showcasing their potential to drive innovation in the ERR generation field. All code and datasets are publicly available.
Businesses heavily rely on data sourced from various channels like news articles, financial reports, and consumer reviews to drive their operations, enabling informed decision-making and identifying opportunities. However, traditional manual methods for data extraction are often time-consuming and resource-intensive, prompting the adoption of digital transformation initiatives to enhance efficiency. Yet, concerns persist regarding the sustainability of such initiatives and their alignment with the United Nations (UN)'s Sustainable Development Goals (SDGs). This research aims to explore the integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) as a sustainable solution for Information Extraction (IE) and processing. The research methodology involves reviewing existing solutions for business decision-making, noting that many systems require training new machine learning models, which are resource-intensive and have significant environmental impacts. Instead, we propose a sustainable business solution using pre-existing LLMs that can work with diverse datasets. We link domain-specific datasets to tailor LLMs to company needs and employ a Multi-Agent architecture to divide tasks such as information retrieval, enrichment, and classification among specialized agents. This approach optimizes the extraction process and improves overall efficiency. Through the utilization of these technologies, businesses can optimize resource utilization, improve decision-making processes, and contribute to sustainable development goals, thereby fostering environmental responsibility within the corporate sector.
Tailoring structured financial reports from companies' earnings releases is crucial for understanding financial performance and has been widely adopted in real-world analytics. However, existing summarization methods often generate broad, high-level summaries, which may lack the precision and detail required for financial reports that typically focus on specific, structured sections. While Large Language Models (LLMs) hold promise, generating reports adhering to predefined multi-section templates remains challenging. This paper investigates two LLM-based approaches popular in industry for generating templated financial reports: an agentic information retrieval (IR) framework and a decomposed IR approach, namely AgenticIR and DecomposedIR. The AgenticIR utilizes collaborative agents prompted with the full template. In contrast, the DecomposedIR approach applies a prompt chaining workflow to break down the template and reframe each section as a query answered by the LLM using the earnings release. To quantitatively assess the generated reports, we evaluated both methods in two scenarios: one using a financial dataset without direct human references, and another with a weather-domain dataset featuring expert-written reports. Experimental results show that while AgenticIR may excel in orchestrating tasks and generating concise reports through agent collaboration, DecomposedIR statistically significantly outperforms AgenticIR approach in providing broader and more detailed coverage in both scenarios, offering reflection on the utilization of the agentic framework in real-world applications.
No abstract available
Large Language Models (LLMs) have proven their effectiveness in a variety of general Natural Language Processing (NLP) tasks. However, their performance in financial credit assessment tasks has yet to reach its full potential, partly because these tasks require specific financial credit expertise. To address this challenge, we propose the ZiGong model, based on Mistral, which employs multi-task supervised fine-tuning. Furthermore, to address the issue of model hallucination in financial scenarios, we propose a novel data pruning method. Specifically, we employ an agent model to assign scores to training samples, and then integrate the pruned samples with the original data for model training. This approach effectively mitigates hallucinations in large models by refining the training data, ensuring higher reliability in downstream applications. Experimental results demonstrate that our method significantly improves the model's robustness and accuracy in real-world financial scenarios.
Comprehending long, dense annual reports is a critical task for financial analysts that is ripe for AI automation, yet model reliability remains a key concern. To address this, we introduce Financial Touchstone—a new, large-scale benchmark with 2,878 question-context-answer triplets across 480 international annual reports, guaranteed to be unseen by the models we evaluate. We test eleven frontier language models from leading labs, including reasoning-capable models like Google’s Gemini 2.5 Pro, Anthropic’s Claude Opus, OpenAI’s o3, and xAI’s Grok 4. Our analysis reveals that while reasoning models achieve high accuracy—with Gemini 2.5 Pro reaching 91.6% and hallucination rates as low as 3.2%—the primary bottleneck is not the models’ comprehension but the initial information retrieval step. Model accuracy plummets to 0.2% when the provided context is insufficient. This work demonstrates that future progress in automated financial analysis hinges more on solving the challenge of targeted information retrieval in complex documents than on incremental improvements in model reasoning alone.
In contemporary enterprise management, financial audits serve as a crucial mechanism to ensure financial transparency and compliance, amidst increasingly complex data processing requirements. This paper explores the research and implementation of an intelligent financial audit system based on data mining, aiming to enhance the efficiency and accuracy of the audit process. Initially, the paper introduces text mining technology in the context of intelligent auditing, elucidating how data mining techniques can extract valuable information from vast amounts of unstructured data. Subsequently, it provides a detailed account of the construction methods for audit analysis models, including the establishment of indicator systems and data preparation processes. This is further integrated with expert system knowledge to realize a comprehensive financial statement audit module. Finally, the paper proposes an intelligent analysis algorithm for accounting documents based on word co-occurrence and SOM neural networks. Experimental validations demonstrate the system's effectiveness and reliability in practical applications. The results indicate that this intelligent auditing system not only significantly improves audit efficiency but also effectively mitigates audit risks, showcasing substantial potential for widespread application.
Abstract Countries’ audit institutions play a critical role in financial management and accountability. These institutions perform tasks such as financial accountability, detection of corruption and errors, evaluation of government performance, accountable public administration, transparency and legal oversight, and prevention of legal violations, contributing to public trust. The Court of Accounts Presidency is the supreme audit institution of the Republic of Türkiye. This study sequentially addresses the audit process and procedure of the Court of Accounts, the impact of digitization, information technologies, and artificial intelligence on the audit process. Additionally, an evaluation of the institution’s technology and information infrastructure is conducted. A detailed assessment is presented regarding the Court of Accounts’ use of the Court of Accounts Data Analysis System (VERA), Unified Data Transfer System (BVAS), and Audit Management Program (SayCap). The institution’s technology and information infrastructure related to big data and big data analytics are evaluated based on the institution’s reports and documents. Finally, evaluating the potential integration of artificial intelligence into these systems is considered highly beneficial. For human resources in the field to be more proficient in current technologies, implementing artificial intelligence, big data, and big data analytics in the audit domain of public institutions is seen as highly advantageous. Additionally, it is strongly recommended that educational institutions training audit professionals incorporate courses on data analytics, big data, big data analytics, machine learning, and artificial intelligence into their curricula.
本报告综合了当前大语言模型(LLM)在智能审计与财报分析领域的研究成果。研究脉络呈现出从“流程自动化”到“深度语义理解”,再到“可信架构构建”的演进趋势。一方面,LLM通过RAG、多智能体协作等技术显著提升了舞弊检测、业绩预测和报告生成的效率;另一方面,学术界对金融高风险场景下的数值幻觉、算法偏见及合规性风险保持高度警惕,并致力于通过构建专业基准与抑制技术实现“负责任的金融AI”。最终,这些技术应用不仅改变了审计行业的组织形态,也深刻影响了资本市场的信息披露与决策机制。