大语言模型(LLM)在智能审计、财报分析中的效率与偏差
智能审计自动化与财务舞弊识别
该组文献聚焦于LLM在审计实务中的直接应用,涵盖了从发票处理、勾稽关系检查到复杂财务欺诈检测的演进。研究强调了利用AI实现连续审计、实时异常监控以及自动化生成审计工作底稿,旨在提升审计效率并降低人为疏漏风险。
- Multimodal detection framework for financial fraud integrating LLMs and interpretable machine learning(Hui Nie, Zhaoye Long, Ze-jun Fang, Lu Gao, 2025, Journal of Data and Information Science)
- Using Large Language Models to Support the Audit Process in the Accountability of Interim Managers in Notary Offices(Myke Valadão, Natalia Freire, Mateus de Paula, Lucas Almeida, Leonardo C. Marques, 2025, No journal)
- Artificial Intelligence in Auditing: Opportunities and Challenges for Top Firms in Bangalore(Dr. M. Sumathy, Salman Ahmed, 2025, International Journal of Advanced Research in Science, Communication and Technology)
- INTELLIGENT AUDIT: HOW ARTIFICIAL INTELLIGENCE IS REWRITING THE RULES OF THE FINANCIAL WORLD(Popel Serhii, 2025, Grail of Science)
- Application and Exploration of Artificial Intelligence Technology in Audit Risk Identification(Weicheng Pan, 2025, Journal of Social Science and Humanities)
- Automation of accounting and auditing processes: The potential and risks of AI application(I. A. NAUGOL’NOVA, V. Kuznetsov, 2025, Finance and Credit)
- Auditing AI-Generated Financial Statements: Navigating New Challenges(Modupe James, 2023, SSRN Electronic Journal)
- AI and Auditing: Enhancing Audit Efficiency and Effectiveness with Artificial Intelligence(Lidiana Lidiana, 2024, Accounting Studies and Tax Journal (COUNT))
- One-Class Classifiers Ensembles for Detecting Fund Misuse Problems within Financial Auditing(Chaoxian Feng, Haotian Wu, Zhe Li, Hongru Lu, Changjian Fang, Zhiang Wu, 2024, 2024 Twelfth International Conference on Advanced Cloud and Big Data (CBD))
- Applying Natural Language Processing to Financial Risk Disclosures and Audit Trails(Prashant Singh, 2023, Journal of Advances in Developmental Research)
- Automating Financial Statement Audits with Large Language Models(Rushi Wang, Jiateng Liu, Weijie Zhao, Shenglan Li, Denghui Zhang, 2025, ArXiv)
- Financial Statement Fraud Detection via Large Language Models(Zehra Erva Ergun, Emre Sefer, 2025, Intell. Syst. Account. Finance Manag.)
- Research on Financial Statement Checking Relationship Recognition System Based on Large Language Models(Haichao Zhang, Jie Zhang, Jiancheng Zhou, 2025, Proceedings of the 2nd Guangdong-Hong Kong-Macao Greater Bay Area International Conference on Digital Economy and Artificial Intelligence)
- Large Language Models-Based Robots for Enterprise Financial Statement Auditing(Yuan Yang, Haichao Zhang, Weidong Shen, Haoxuan Chen, Yi Cao, Liangyu Zhao, 2024, 2024 6th International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI))
- AI-Driven Continuous Auditing and Real-Time Anomaly Detection to Strengthen Financial Statement Reliability and Regulatory Transparency Frameworks(2025, International Research Journal of Modernization in Engineering Technology & Science)
- Auditing AI-Generated Financial Statements(Modupe James, 2023, SSRN Electronic Journal)
- ChatGPT and the Financial Statement Audit: Can Staff Level Employees at Audit Clients Use ChatGPT to Answer Auditor Inquiries?(Gregory P. Tapis, Julie Ravenscraft, J. Naegle, C. E. Keller, K. Church, 2026, Journal of Emerging Technologies in Accounting)
- Towards a Conversational Invoice Issuance LLM-Based Agent(Runze Nie, Hao Wu, Lan Ma, Zhenyu Liu, Zhigang Wang, Ping Zhang, 2024, 2024 7th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI))
- ASSESSING THE INTEGRATION OF CHATGPT IN IT AUDITS THAT SUPPORT FINANCIAL STATEMENT AUDITS(A. R. Otero, 2024, SSRN Electronic Journal)
- Harnessing Artificial Intelligence for Auditing and Assurance: Challenges, Opportunities, and Policy Directions in India(Vandana Gupta, 2025, International Journal of Global Research Innovations & Technology)
- The Integration of Artificial Intelligence in Modern Auditing Practices Transforming Auditing through Technological Advancements(Aman Deep Singh, 2025, Journal of Advances in Developmental Research)
- Detecting Bugs with Substantial Monetary Consequences by LLM and Rule-based Reasoning(Brian Zhang, Zhuo Zhang, 2024, Advances in Neural Information Processing Systems 37)
- Blazing a New Trail in ERP Integration with NLP and Generative AI through APIs: a fraud examination perspective(Alessio Faccia, F. Manni, Vishal Pandey, Luigi Pio Leonardo Cavaliere, 2023, Proceedings of the 2023 8th International Conference on Information Systems Engineering)
- NEXT-GENERATION INTELLIGENT AUDIT: INNOVATIVE TRANSFORMATION AND STRATEGIC EVOLUTION OF FINANCIAL CONTROL THROUGH AI, XAI, AND AUTONOMOUS DIGITAL PLATFORMS(Serhii Popel, 2025, VIII International Scientific and Practical Conference «SCIENTIFIC PRACTICE: MODERN AND CLASSICAL RESEARCH METHODS»)
- Impact of AI on Auditing: Transforming Assurance Services in the Digital Age(Anmar Noori Dawood, A. Almagtome, 2025, 2025 International Conference on Frontier Technologies and Solutions (ICFTS))
财报深度解析、数值推理与跨模态对齐
此类文献探讨了LLM处理非结构化财报(如10-K、管理层讨论)、结构化数据(XBRL、区块链)及多模态信息的能力。重点在于提取KPI、进行多步数值推理、情绪分析以及解决链上数据与披露信息的一致性审计问题。
- Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs(Yukun Zhang, Stefan Elbl Droguett, Samyak Jain, 2025, ArXiv)
- FinQAPT: Empowering Financial Decisions with End-to-End LLM-driven Question Answering Pipeline(Kuldeep Singh, Simerjot Kaur, Charese H. Smiley, 2024, Proceedings of the 5th ACM International Conference on AI in Finance)
- Predicting Numeric Financial KPIs From Unstructured Text: a Comparative Study of LLM-Based Embeddings and Traditional NLP Techniques(Lord Coffie, Melvin Ajuluchukwu, Michael Nsor, 2025, 2025 IEEE/ACIS 29th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD))
- Evaluation of Generative AI Q&A Chatbot Chained to Optical Character Recognition Models for Financial Documents(Yu Qiu, Venkata Duvvuri, Pratibha Yadavalli, Neal Prasad, 2024, Proceedings of the 2024 8th International Conference on Machine Learning and Soft Computing)
- Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?(Saeed AlMarri, Kristof Juhasz, Mathieu Ravaut, Gautier Marti, Hamdan Al Ahbabi, Ibrahim M. Elfadel, 2025, ArXiv)
- GLCM: A Multimodal Framework for Credit Rating With Chain-of-Thought Reasoning(Sihan Hu, Yisi Wang, Zhongliang Yang, Yumeng Shi, Linna Zhou, 2025, IEEE Signal Processing Letters)
- Construction of Financial Risk Assessment Model Based on Text Mining and LLM Architecture(Zhenglin Li, M. Sui, Cheng Yang, Sa Liu, Bo Guan, Xinjin Li, 2025, Proceedings of the 2025 2nd International Conference on Digital Economy and Computer Science)
- MarketSenseAI 2.0: Enhancing Stock Analysis through LLM Agents(George Fatouros, Kostas C. Metaxas, John Soldatos, Manos Karathanassis, 2025, ArXiv)
- GPT-InvestAR: Enhancing Stock Investment Strategies through Annual Report Analysis with Large Language Models(Udit Gupta, 2023, ArXiv)
- Large Language Models for Corporate Financial Distress Prediction: Overview and Exploration(Wei Fu, 2025, Highlights in Business, Economics and Management)
- From fiction to fact: the growing role of generative AI in business and finance(Boyang Chen, Zongxiao Wu, Ruoran Zhao, 2023, Journal of Chinese Economic and Business Studies)
- A persona-based framework for teaching financial statement analysis with Generative AI: pedagogical design and practical application(Mirna Ibrahim, 2025, Advances in Economics Education)
- Addressing investor concerns: a Chinese financial question-answering benchmark with LLM-based evaluation(Yujian Gan, Yiyi Tao, Jiawang Mo, Xianzhen Huang, Yiwen Li, Kexin Wang, Yi Cai, Lu Liang, Shuzhen Xiong, Qi Ke, Hua Zheng, Xiaochu Hu, 2025, EPJ Data Science)
- Credit risk prediction and heterogeneity analysis for SMEs based on large language models and multimodal data fusion(Chuanhe Shen, Wenjing Pan, Xuan Shen, 2025, Complex & Intelligent Systems)
- Lightweight VLM for Financial Sentiment Analysis and Investment Decision Support(Zanyu Fang, 2025, Proceedings of the 2025 International Conference on Economic Management and Big Data Application)
- XBRL Agent: Leveraging Large Language Models for Financial Report Analysis(Shijie Han, Haoqiang Kang, Bo Jin, Xiao-Yang Liu, Steve Y. Yang, 2024, Proceedings of the 5th ACM International Conference on AI in Finance)
- Leveraging Large Language Models to Bridge On-chain and Off-chain Transparency in Stablecoins(Yuexin Xiang, Yuchen Lei, SM Mahir Shazeed Rish, Yuanzhe Zhang, Qin Wang, Tsz Hon Yuen, Jiangshan Yu, 2025, ArXiv)
- The Impact of Blockchain-Generative AI Integration: A Study on Financial Reporting with a Special Reference to Accuracy, Efficiency, and Trust(M. Hossain, Arif M. Rana, 2026, IUBAT Review)
- LLM-Powered Information Extraction for the Dairy Financial Domain: Tackling Data Scarcity and Ambiguity(Chunyan An, Yuying Huang, Qiang Yang, Siyu Yuan, Zhixu Li, 2025, Proceedings of the 34th ACM International Conference on Information and Knowledge Management)
- Generative AI for Automated Financial Reporting and Narrative Generation(Nesrat Abdelouahab, Aouni Mohammed Seghir, Abdelkamel Maamri, 2025, Journal of Ecohumanism)
- Fine-Tuning Vision-Language Models for Markdown Conversion of Financial Tables in Malaysian Audited Financial Reports(Jing Tan, En Jun Choong, Ethan Jeremiah Chitty, Yan Pheng Choo, John Hsin Yang Wong, Chern Cheah, 2025, ArXiv)
- Large Language Models as Financial Data Annotators: A Study on Effectiveness and Efficiency(Toyin Aguda, Suchetha Siddagangappa, Elena Kochkina, Simerjot Kaur, Dongsheng Wang, Charese H. Smiley, Sameena Shah, 2024, ArXiv)
- Towards reducing hallucination in extracting information from financial reports using Large Language Models(Bhaskarjit Sarmah, Dhagash Mehta, Stefano Pasquali, Tianjie Zhu, 2023, Proceedings of the Third International Conference on AI-ML Systems)
- EFSA: Towards Event-Level Financial Sentiment Analysis(Tianyu Chen, Yiming Zhang, Guoxin Yu, Dapeng Zhang, Li Zeng, Qing He, Xiang Ao, 2024, No journal)
- Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis(Md Talha Mohsin, 2025, ArXiv)
- Evaluating Large Language Models for Stance Detection on Financial Targets from SEC Filing Reports and Earnings Call Transcripts(Nikesh Gyawali, Doina Caragea, A. Vasenkov, Cornelia Caragea, 2025, ArXiv)
- Enhancing Financial Risk Analysis using RAG-based Large Language Models(A.A. Darji, Fenil Kheni, Dhruvil Chodvadia, Parth Goel, Dweepna Garg, Bankim Patel, 2024, 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS))
增强可靠性的技术架构:RAG与多智能体协作
这组研究致力于通过架构创新解决LLM在金融领域的局限。核心技术包括检索增强生成(RAG)、多智能体(Agent)框架、知识图谱集成以及人类参与(Human-in-the-loop),旨在确保输出的准确性、实时性与可追溯性。
- A Hybrid Retrieval-Generative AI Framework for FinTech Document Handling and Compliance Tracking(Siddhartha Chatterjee, Sudeshna Dey, Soumitra De, Jonti Deuri, 2026, International Journal of Innovative Science and Research Technology)
- FinArena: A Human-Agent Collaboration Framework for Financial Market Analysis and Forecasting(Congluo Xu, Zhaobin Liu, Ziyang Li, 2025, ArXiv)
- Fin-Rag A Rag System for Financial Documents(K. E. Kannammal, Mr. Anirudh R K, Kuzhali Tamizhiniyal P, G. G, Adrinath C, 2025, International Journal of Innovative Science and Research Technology)
- Integrating Retrieval-Augmented Generation and Large Language Models for Financial Question Answering(Yu-Jen Chen, Ping Chen, Tzu-Chia Tung, Yung-Chien Chou, Yucheng Chu, Sian-Wun Du, Wei-Chien Wang, C. Fuh, Chung-Ming Yang, 2025, 2025 10th International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS))
- Language Model Orchestrated Financial Agents: An Open-Source Framework(Ravi Teja Gundimeda, 2025, 2025 IEEE 4th International Conference for Advancement in Technology (ICONAT))
- GRAF: A SCALABLE AND AUDITABLE GENERATIVE AI FRAMEWORK FOR AUTOMATED REGULATORY INTELLIGENCE IN U.S. FINANCIAL MARKETS(Viswatej Seela, 2025, INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING AND TECHNOLOGY)
- Improving Zero-Shot Text Matching for Financial Auditing with Large Language Models(L. Hillebrand, Armin Berger, Tobias Deußer, Tim Dilmaghani, Mohamed Khaled, Bernd Kliem, Rüdiger Loitz, Maren Pielka, David Leonhard, C. Bauckhage, R. Sifa, 2023, Proceedings of the ACM Symposium on Document Engineering 2023)
- The Role of Retrieval-Augmented Generation (RAG) in Financial Document Processing: Automating Compliance and Reporting(Nihar Malali, 2025, International Journal of Management Technology)
- Sparse Attention Combined with RAG Technology for Financial Data Analysis(Zhaoyan Zhang, Kaixian Xu, Yu Qiao, Alan Wilson, 2025, Journal of Computer Science Research)
- Leveraging LLAMA for Financial Chatbots: Domain-Specific Fine-Tuning and Performance Evaluation(L. Saikrishna, S. Narayana, P. C. Sriyan, 2025, AVE Trends in Intelligent Management Letters)
- FinCARDS: Card-Based Analyst Reranking for Financial Document Question Answering(Yixi Zhou, Fan Zhang, Yu Chen, Haipeng Zhang, Preslav Nakov, Zhuohan Xie, 2026, ArXiv)
- GraphRAG Analysis for Financial Narrative Summarization and A Framework for Optimizing Domain Adaptation(Neelesh K. Shukla, Prabhat Prabhakar, Sakthivel Thangaraj, Sandeep Singh, Weiyi Sun, Prasanna Venkatesan, Viji Krishnamurthy, 2025, No journal)
- Progressive Knowledge Distillation and Numerical Reasoning Enhancement for Financial Report Question Answering(Ruonan Fang, Chao Yang, Wei Li, Xin Lin, Pingping Li, Yiman Wu, Xinyan Liu, 2025, Electronics)
- Hybrid RAG-LLM Framework For Intelligent Supplier Risk Assessment In Global Supply Chains(Sujith Vadakkepati, 2025, Journal of International Crisis and Risk Communication Research)
- The Evolution of Financial Analysis: From Manual Methods to AI and AI Agents(Z. Yordanova, Y. Hristozov, 2025, ECONOMICS)
- QuantMCP: Grounding Large Language Models in Verifiable Financial Reality(Yifan Zeng, 2025, ArXiv)
- Evaluating Retrieval-Augmented Generation Models for Financial Report Question and Answering(Ivan Iaroshev, R. Pillai, Leandro Vaglietti, T. Hanne, 2024, Applied Sciences)
- FinReflectKG: Agentic Construction and Evaluation of Financial Knowledge Graphs(Abhinav Arun, Fabrizio Dimino, T. Agarwal, Bhaskarjit Sarmah, Stefano Pasquali, 2025, Proceedings of the 6th ACM International Conference on AI in Finance)
- FinRobot: An Open-Source AI Agent Platform for Financial Applications using Large Language Models(Hongyang Yang, Boyu Zhang, Neng Wang, Chengkai Guo, Xiaoli Zhang, Likun Lin, Junlin Wang, Tianyu Zhou, Mao Guan, Runjia Zhang, Chris Wang, 2024, ArXiv)
- Knowledge-augmented Financial Market Analysis and Report Generation(Yue Chen, Feifan Wu, Jingwei Wang, Hao Qian, Ziqi Liu, Zhiqiang Zhang, Jun Zhou, Meng Wang, 2024, No journal)
- AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework(Xiang Li, Zhenyun Li, Chen Shi, Yong Xu, Qing Du, Mingkui Tan, Jun Huang, Wei Lin, 2024, No journal)
模型偏差识别、幻觉治理与审计信任机制
该分组深入探讨LLM在金融高压环境下的负面特性,包括幻觉行为、语义偏移、各类认知偏见(性别、位置、代表性偏见)以及审计师对AI的信任悖论。研究提出了针对性的风险缓释策略与确定性验证框架。
- A RAG-Based Evaluation of Large Language Models for Financial Insights(C. Bhowmik, Monika Bansal, Nivedita Palia, Isha Gupta, Deepika Rawat, 2025, 2025 International Conference on Digital Innovations for Sustainable Solutions (ICDISS))
- Hallucination-minimized Data-to-answer Framework for Financial Decision-makers(Sohini Roychowdhury, A. Alvarez, Brian Moore, Marko Krema, Maria Paz Gelpi, F. Rodriguez, Angel Rodriguez, Jose Ramon Cabrejas, Pablo Serrano, Punit Agrawal, Arijit Mukherjee, 2023, 2023 IEEE International Conference on Big Data (BigData))
- Journey of Hallucination-minimized Generative AI Solutions for Financial Decision Makers(Sohini Roychowdhury, 2023, Proceedings of the 17th ACM International Conference on Web Search and Data Mining)
- Variance-Aware LLM Annotation for Strategy Research: Sources, Diagnostics, and a Protocol for Reliable Measurement(Arnaldo Camuffo, Alfonso Gambardella, Saeid Kazemi, Jakub Malachowski, Abhinav Pandey, 2025, ArXiv)
- Unmasking Bias in Financial AI: A Robust Framework for Evaluating and Mitigating Hidden Biases in LLMs(Shreshth Mehrotra, Raghavendra P, Balraj Prajesh, Hrishikesh Kambale, Puspita Majumdar, 2025, Proceedings of the 6th ACM International Conference on AI in Finance)
- Systematic Evaluation of Long-Context LLMs on Financial Concepts(Lavanya Gupta, Saket Sharma, Yiyun Zhao, 2024, ArXiv)
- OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain(Shuting Wang, Jiejun Tan, Zhicheng Dou, Ji-Rong Wen, 2024, ArXiv)
- Quantifying Semantic Shift in Financial NLP: Robust Metrics for Market Prediction Stability(Zhongtian Sun, Chenghao Xiao, Anoushka Harit, Jongmin Yu, 2025, Proceedings of the 6th ACM International Conference on AI in Finance)
- Towards Explainable and Reliable AI in Finance(Albi Isufaj, Pablo Moll'a, Helmut Prendinger, 2025, ArXiv)
- On the Reliability of Large Language Models in Financial Applications: An Analysis of Hallucination(Shweta Gupta, 2025, 2025 4th International Conference on Applied Artificial Intelligence and Computing (ICAAIC))
- LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows(Raffi Khatchadourian, Rolando Franco, 2025, ArXiv)
- Evaluating Company-specific Biases in Financial Sentiment Analysis using Large Language Models(Kei Nakagawa, Masanori Hirano, Yugo Fujimoto, 2024, 2024 IEEE International Conference on Big Data (BigData))
- Responsible Innovation: A Strategic Framework for Financial LLM Integration(Ahmadreza Tavasoli, Maedeh Sharbaf, Seyed Mohamad Madani, 2025, ArXiv)
- Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents(Raffi Khatchadourian, 2026, ArXiv)
- The Trust Paradox: Analyzing Auditor Reliance on Hallucinating Generative AI Models in Internal Control Testing(Aduragbemi Joshua Olaseinde, Bolanle Busirat Azeez, 2022, International Journal of Artificial Intelligence Engineering and Transformation)
- Deficiency of Large Language Models in Finance: An Empirical Examination of Hallucination(Haoqiang Kang, Xiao-Yang Liu, 2023, ArXiv)
- Adaptive Trust Metrics for Multi-LLM Systems: Enhancing Reliability in Regulated Industries(Tejaswini Bollikonda, 2026, ArXiv)
- Tracing Positional Bias in Financial Decision-Making: Mechanistic Insights from Qwen2.5(Fabrizio Dimino, Krati Saxena, Bhaskarjit Sarmah, Stefano Pasquali, 2025, Proceedings of the 6th ACM International Conference on AI in Finance)
- A Financial Brain Scan of the LLM(Hui Chen, Antoine Didisheim, Luciano Somoza, Hanqing Tian, 2025, ArXiv)
- Accuracy and Bias Mitigation in GenAI / LLM-based Financial Underwriting and Clinical Summarization Systems(Praveen Kumar, Shailendra Bade, 2024, International Journal of Science and Research (IJSR))
- Argument Quality Annotation and Gender Bias Detection in Financial Communication through Large Language Models(Alaa Alhamzeh, Mays Al Rebdawi, 2025, ArXiv)
- Identifying Representation Bias in Large Language Models Used in Financial Sentiment Analysis(Alpay Sabuncuoglu, Carsten Maple, 2025, 2025 IEEE Symposium on Computational Intelligence for Financial Engineering and Economics (CiFer))
- Innovation of Enterprise Ethical Review Mechanism Driven by Generative AI for Financial Report Preparation(Yue Ma, Jian Du, 2025, Modern Economics & Management Forum)
行业基准测试、合规监管与宏观治理框架
这部分文献侧重于建立科学的评价体系(如FinMaster、BizFinBench)和宏观治理视角。研究涵盖了LLM在税务、合规检查中的应用,以及AI对会计行业数字化转型、代理成本和财政问责制的深远影响。
- LARGE LANGUAGE MODELS EMPOWERING COMPLIANCE CHECKS AND REPORT GENERATION IN AUDITING(2024, World Journal of Information Technology )
- RAG-Augmented Payroll Intelligence Stack: A Zero-Trust, Policy-Verified Architecture for Compliant and Explainable Payroll Automation(Raghuveer Yerneni, 2025, 2025 13th International Conference on Intelligent Systems and Embedded Design (ISED))
- E2E Process Automation Leveraging Generative AI and IDP-Based Automation Agent: A Case Study on Corporate Expense Processing(Cheonsu Jeong, Seongmin Sim, Hyoyoung Cho, Sungsu Kim, Byounggwan Shin, 2025, ArXiv)
- Architecting Intelligent Tax Automation: Research Innovations in Machine Learning for Global Compliance(Vedashree Kedar Karandikar, 2026, International Journal of Computational and Experimental Science and Engineering)
- The impact of large language models on accounting and future application scenarios(WenYi Li, Wenyu Liu, Mengya Deng, Xin Liu, Lingbing Feng, 2025, Journal of Accounting Literature)
- FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning(Zhuohan Xie, Dhruv Sahnan, Debopriyo Banerjee, Georgi N. Georgiev, Rushil Thareja, Hachem Madmoun, Jinyan Su, Aaryamonvikram Singh, Yuxia Wang, Rui Xing, Fajri Koto, Haonan Li, Ivan Koychev, Tanmoy Chakraborty, S. Lahlou, Veselin Stoyanov, Preslav Nakov, 2025, ArXiv)
- FinMaster: A Holistic Benchmark for Mastering Full-Pipeline Financial Workflows with LLMs(Junzhe Jiang, Chang Yang, Aixin Cui, Sihan Jin, Ruiyu Wang, Bo Li, Xiao Huang, Dongning Sun, Xinrun Wang, 2025, ArXiv)
- FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs(Yan Wang, Keyi Wang, Shanshan Yang, J. Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Jian-Yun Nie, 2025, ArXiv)
- Evaluating Large Language Models on Financial Report Summarization: An Empirical Study(Xinqi Yang, Scott D. Zang, Yong Ren, Dingjie Peng, Zheng Wen, 2024, ArXiv)
- Towards Automated Regulatory Compliance Verification in Financial Auditing with Large Language Models(Armin Berger, L. Hillebrand, David Leonhard, Tobias Deußer, Thiago Bell Felix de Oliveira, T. Khameneh, Mohamed Khaled, Bernd Kliem, Rüdiger Loitz, Christian Bauckhage, R. Sifa, 2023, 2023 IEEE International Conference on Big Data (BigData))
- AI-Driven Proactive Monitoring: Mitigating Agency Costs and Financial Risk(Raden Agrosamdhyo, 2025, Proceeding of the International Conference on Global Education and Learning)
- Artificial Intelligence in Auditing: A Bibliometric Analysis of Research Trends, Collaboration, and Key Themes (20002025)(Zimeng Guo, 2025, Advances in Economics, Management and Political Sciences)
- Revolutionizing Accounting and Finance Practices with Large Language Models: A Comprehensive Review of Applications and Implications(Zhihong Luo, Jing Cui, Meijiazi Yang, 2025, Journal of Statistics and Economics)
- ENHANCING FISCAL ACCOUNTABILITY AND AUDITABILITY: A FRAMEWORK FOR DEPLOYING GENERATIVE AI PROCESS AGENTS IN PUBLIC SECTOR FINANCIAL ERPS(Arunkumar Yadava, Harshini Gadam, Rajender Chilukala, 2025, Lex localis - Journal of Local Self-Government)
- AI-driven data governance in banking: Leveraging large language models for compliance and risk management(Rajesh Kamisetty, Raj Nagamangalam, 2025, World Journal of Advanced Research and Reviews)
- BloombergGPT: Revolutionizing Finance with Large Language Models(M. K. Keshri, 2025, International Journal of Advanced Research in Science, Communication and Technology)
- The evolution of accounting and auditing in the era of digital technologies: the role of cloud services and process automation(S. Matchuk, Valentyna Havrylenko, I. Lukanovska, T. Kharkhalis, Yana Ostapenko, 2024, Salud, Ciencia y Tecnología - Serie de Conferencias)
- BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs(Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, Ji Liu, 2025, ArXiv)
- ZiGong 1.0: A Large Language Model for Financial Credit(Yu Lei, Zixuan Wang, Chu Liu, Tongyao Wang, 2025, 2025 IEEE 41st International Conference on Data Engineering Workshops (ICDEW))
- Application of Startup Success Prediction Models and Business Document Extraction Using Large Language Models to Enhance Due Diligence Efficiency(Vito Christian Samudra, Dicky Prima Satya, 2024, 2024 11th International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA))
- From reactive to proactive policy implement method through intelligent enterprise matching(Yongkang Duan, Guangyu Zhao, Qian Geng, P. Ji, Jian Jin, 2026, Enterprise Information Systems)
- Evaluating Large Language Models for Financial Reasoning: A CFA-Based Benchmark Study(Xuan Yao, Qianteng Wang, Xinbo Liu, Ke-wei Huang, 2025, ArXiv)
- PROSPECTS FOR USING ARTIFICIAL INTELLIGENCE TO PREDICT FINANCIAL TRENDS: FROM TIME SERIES MODELS TO NEWS AND LLM-ORIENTED CONTOURS(Natalia S. Plaskova, Viktoriia I. Demina, 2026, EKONOMIKA I UPRAVLENIE: PROBLEMY, RESHENIYA)
- CARE: A Framework for Correcting Numerical Hallucinations in LLM-Generated Financial Texts(Jian Kim, Woohwan Jung, 2025, 2025 IEEE Conference on Artificial Intelligence (CAI))
- Large Language Models in the Market: A Study on Financial Forecasting and Stock Interpretation(Soumyajit Hazra, Saptarshi Banerjee, S. Karmakar, Malay Gangopadhay, 2025, 2025 International Conference on Artificial Intelligence for Computing, Astronomy and Renewable Energy (AICARE))
- Leveraging Large Language Models for Sentiment Analysis and Investment Strategy Development in Financial Markets(Yejoon Mun, Namhyoung Kim, 2025, J. Theor. Appl. Electron. Commer. Res.)
- Adversarially Enhanced Financial Misinformation: A Comparative Analysis of LLM- vs. GAN-Generated Content Exposing AI Moderation Vulnerabilities(Christopher Santorelli, Victor Ginart Belmonte, Ryan Mastropaolo, 2025, 2025 6th International Conference on Artificial Intelligence, Robotics and Control (AIRC))
本报告综合了LLM在智能审计与财报分析领域的全方位研究。行业正经历从“工具化替代”到“系统性重构”的转变:一方面,通过RAG、智能体架构和跨模态对齐技术,LLM在自动化审计、舞弊识别和复杂财报解析中的效率显著提升;另一方面,学术界和实务界正通过构建严苛的行业基准(如FinAuditing)和治理框架,系统性地应对模型幻觉、算法偏差及审计信任悖论。最终目标是构建一个可解释、高可靠且符合监管要求的金融智能生态系统。
总计123篇相关文献
This article aims to enhance the digital and intelligent capabilities of enterprise financial statement audits by proposing and constructing optimization strategies and frameworks for audit robots based on large language models. It utilizes large language models to design the audit data robot, audit workpaper robot, and audit analysis robot for human-computer interaction, thereby lowering the technical threshold for auditors and improving their understanding of personalized audit requirements. The audit data robot intelligently extracts data through natural language commands, the audit workpaper robot automates the preparation of workpapers through intent recognition and tool library selection, and the audit analysis robot enhances decision-making levels using visualization and intelligent analysis techniques. The audit robots constructed based on large language models will effectively address existing issues in current audits and improve audit efficiency and intelligence levels.
The auditing of financial documents, historically a labor-intensive process, stands on the precipice of transformation. AI-driven solutions have made inroads into streamlining this process by recommending pertinent text passages from financial reports to align with the legal requirements of accounting standards. However, a glaring limitation remains: these systems commonly fall short in verifying if the recommended excerpts indeed comply with the specific legal mandates. Hence, in this paper, we probe the efficiency of publicly available Large Language Models (LLMs) in the realm of regulatory compliance across different model configurations. We place particular emphasis on comparing cutting-edge open-source LLMs, such as Llama-2, with their proprietary counterparts like OpenAI’s GPT models. This comparative analysis leverages two custom datasets provided by our partner PricewaterhouseCoopers (PwC) Germany. We find that the open-source Llama-2 70 billion model demonstrates outstanding performance in detecting non-compliance or true negative occurrences, beating all their proprietary counterparts. Nevertheless, proprietary models such as GPT-4 perform the best in a broad variety of scenarios, particularly in non-English contexts.
With the driving power of the internet wave, the auditing industry faces unprecedented challenges and opportunities. Traditional audit methods increasingly reveal weaknesses at processing vast data, which requires the application of new technologies to maximize the efficiency and accuracy of audit. As an epoch-making innovation of natural language processing, Large Language Models (LLMs) demonstrate unequalled performance at parsing texts, semantic detection, and text generation, opening up approaches to forward-looking intelligent reform of auditing. This paper researches the innovation of LLMs to be utilized to intelligent Checking relationship detection of financial statements, to enhance the efficiency and accuracy of auditing.Based on the combination of the knowledge base of audit with the technologies of Retrieval-Augmented Generation (RAG), we introduce a multi-agent system to intelligent Checking relationship detection of financial statements. LLMs can automatically identify and certify Checking relationships between financial statements, dramatically improving audit efficiency and quality. Environmental context and data support are provided through the audit knowledge base, and the RAG technologies enhance the power of analysis. This paper demonstrates, through experiments, that these technologies can support intelligent Checking relationship detection, ushering auditing into an intelligent audit era.
Collecting labeled datasets in finance is challenging due to scarcity of domain experts and higher cost of employing them. While Large Language Models (LLMs) have demonstrated remarkable performance in data annotation tasks on general domain datasets, their effectiveness on domain specific datasets remains under-explored. To address this gap, we investigate the potential of LLMs as efficient data annotators for extracting relations in financial documents. We compare the annotations produced by three LLMs (GPT-4, PaLM 2, and MPT Instruct) against expert annotators and crowdworkers. We demonstrate that the current state-of-the-art LLMs can be sufficient alternatives to non-expert crowdworkers. We analyze models using various prompts and parameter settings and find that customizing the prompts for each relation group by providing specific examples belonging to those groups is paramount. Furthermore, we introduce a reliability index (LLM-RelIndex) used to identify outputs that may require expert attention. Finally, we perform an extensive time, cost and error analysis and provide recommendations for the collection and usage of automated annotations in domain-specific settings.
No abstract available
Startups face extreme uncertainty and high failure rates, posing challenges for investors in identifying promising ventures. This research, based on a case study and interviews at a prominent Indonesian corporate venture capital firm, explores the due diligence process, typically taking 4–6 weeks depending on data completeness. Using Large Language Model (LLM) and Machine Learning (ML) technologies developed with the Team Data Science Process (TDSP) methodology, the research aims to enhance due diligence efficiency. Key development steps include data integration, ML model creation for startup success classification, and the integration of OpenAI's GPT -4 and Google Search APIs for comprehensive business analysis. The system's dashboard offers features such as pitch deck, financial, market trends, competitor, and founding team analyses, along with startup success prediction using the XGBoost model. This model, deployed via Flask, demonstrated consistent results through cross-validation. Customer acceptance testing, conducted with eight experienced startup investors, yielded a high satisfaction rate of 4.50 out of 5.00, indicating strong approval of the system's effectiveness.
Auditing financial documents is a very tedious and time-consuming process. As of today, it can already be simplified by employing AI-based solutions to recommend relevant text passages from a report for each legal requirement of rigorous accounting standards. However, these methods need to be fine-tuned regularly, and they require abundant annotated data, which is often lacking in industrial environments. Hence, we present ZeroShotALI, a novel recommender system that leverages a state-of-the-art large language model (LLM) in conjunction with a domain-specifically optimized transformer-based text-matching solution. We find that a two-step approach of first retrieving a number of best matching document sections per legal requirement with a custom BERT-based model and second filtering these selections using an LLM yields significant performance improvements over existing approaches.
Financial statement auditing is essential for stakeholders to understand a company's financial health, yet current manual processes are inefficient and error-prone. Even with extensive verification procedures, auditors frequently miss errors, leading to inaccurate financial statements that fail to meet stakeholder expectations for transparency and reliability. To this end, we harness large language models (LLMs) to automate financial statement auditing and rigorously assess their capabilities, providing insights on their performance boundaries in the scenario of automated auditing. Our work introduces a comprehensive benchmark using a curated dataset combining real-world financial tables with synthesized transaction data. In the benchmark, we developed a rigorous five-stage evaluation framework to assess LLMs'auditing capabilities. The benchmark also challenges models to map specific financial statement errors to corresponding violations of accounting standards, simulating real-world auditing scenarios through test cases. Our testing reveals that current state-of-the-art LLMs successfully identify financial statement errors when given historical transaction data. However, these models demonstrate significant limitations in explaining detected errors and citing relevant accounting standards. Furthermore, LLMs struggle to execute complete audits and make necessary financial statement revisions. These findings highlight a critical gap in LLMs'domain-specific accounting knowledge. Future research must focus on enhancing LLMs'understanding of auditing principles and procedures. Our benchmark and evaluation framework establish a foundation for developing more effective automated auditing tools that will substantially improve the accuracy and efficiency of real-world financial statement auditing.
The banking sector has to deal with governance, compliance, and risk management challenges due to the evolving nature of the financial regulation and high volume of sensitive data. Real-time monitoring and anomaly detection are challenging in traditional rule based systems, which lead to inefficiencies and compliance risks. Using Large Language Models (LLMs), this paper discusses enabling banking data governance by automating compliance with banking regulations, risk assessment and fraud detection. Allow Intelligent data classification, predictive analytics and real-time auditing, in compliance with GDPR, Basel III, AML directive standards, etc. LLMs offer a transformative solution for secure and transparent financial operations, albeit with challenges like data privacy, model bias, explainability, etc. This research is based on real case studies and discusses how AI-based data governance can provide banks with improved security, compliance with regulatory mandates, and operational effectiveness
: The auditing process in notary offices in Brazil is hindered by inefficiencies, high costs, and the complexity of manual procedures. To address these challenges, we propose a system that leverages the capabilities of Large Language Models (LLMs), specifically LLaMA2-7B and Falcon-7B, to automate critical information extraction from diverse document types. The system detects anomalous monetary values and unauthorized services, linking them to corresponding dates and beneficiaries to provide a detailed overview of financial discrepancies. Integrating advanced Natural Language Processing (NLP) techniques into auditing workflows enhances fraud detection, reduces operational costs, and improves accuracy. With a BLEU metric superior to 0 . 67, the proposed system demonstrates significant potential to streamline auditing operations. Key benefits include assisting court analysts in identifying fraud cases, optimizing public resource management by eliminating unjustified expenses, and potentially increasing court revenues to reinvest in public services.
PurposeThis paper examines the transformative impact of large language models (LLMs) on accounting practices and explores future application scenarios. Through a systematic literature review, it highlights the potential of LLMs to enhance efficiency, transparency and innovation across areas such as financial reporting, ESG disclosure, financial analysis and risk management. Additionally, it identifies key challenges, including data quality, privacy and the need for domain-specific adaptations, while proposing actionable strategies to address them. By forecasting advanced applications like intelligent knowledge bases and automated operations, this study provides a roadmap for integrating LLMs into accounting, driving progress and sustainability in the industry.Design/methodology/approachThis study adopts a systematic literature review methodology to explore the impact and future applications of LLMs in accounting. It identifies key research areas by analyzing over 50 high-quality studies selected through extensive keyword searches, Boolean queries and backward and forward citation analyses of seminal works. The review is structured around eight thematic areas, including financial reporting, ESG disclosure and risk management. By synthesizing findings, the study develops a comprehensive framework for understanding the transformative potential of LLMs while addressing associated challenges, such as data security and specialization, to guide future research and practical applications in accounting.FindingsThe study reveals that LLMs significantly enhance efficiency, transparency and innovation in accounting by automating processes like financial reporting, ESG disclosure and risk management. They enable advanced applications such as intelligent knowledge bases, budget optimization and automated contract management. However, challenges remain, including the need for high-quality data, domain-specific model training, interdisciplinary talent development and robust data security measures. The findings underscore LLMs’ potential to transform accounting practices while emphasizing the importance of theoretical frameworks and strategic planning to address these challenges and fully realize their benefits in driving industry progress and sustainability.Practical implicationsThe study highlights practical pathways for integrating LLMs into accounting, emphasizing their potential to automate processes, enhance decision-making and improve operational efficiency. Organizations can leverage LLMs for tasks such as financial reporting, ESG analysis and risk management, reducing manual effort and increasing accuracy. Practical implications include the need for targeted training of LLMs in accounting-specific contexts, robust data governance to ensure quality and security and developing interdisciplinary skills among accounting professionals. By addressing these areas, organizations can harness LLMs to drive innovation, streamline operations and achieve sustainable growth in a rapidly evolving business environment.Originality/valueThis study provides a comprehensive and systematic analysis of the transformative impact of LLMs on accounting, addressing gaps in fragmented research and limited practical insights. It uniquely integrates theoretical perspectives with practical applications, offering a structured framework for understanding LLMs’ role across multiple accounting domains. By identifying key challenges and proposing actionable strategies, the paper delivers original value to both researchers and practitioners, fostering innovation and guiding the integration of LLMs into accounting practices. Its forward-looking approach offers a valuable resource for advancing knowledge and shaping the future of accounting in the digital age.
BloombergGPT represents a significant advancement in applying Large Language Models to the financial domain. This article examines how this specialized variant leverages natural language processing capabilities to transform financial operations across multiple applications. Developed by Bloomberg, this decoder-only language model is trained on an extensive corpus of financial texts and general-purpose datasets, enabling superior performance on finance-specific tasks while maintaining competence in general NLP benchmarks. It explores its diverse applications, including economic news summarization, market analysis, research report generation, virtual assistance, fraud detection, and trading strategy optimization. While offering substantial benefits in efficiency, accuracy, and cost reduction, BloombergGPT also faces important challenges related to data quality, hallucinations, regulatory compliance, and explainability that must be addressed for responsible implementation in the precision-critical financial industry
The rapid advancement of large language models presents significant opportunities for financial applications, yet systematic evaluation in specialized financial contexts remains limited. This study presents the first comprehensive evaluation of state-of-the-art LLMs using 1,560 multiple-choice questions from official mock exams across Levels I-III of CFA, most rigorous professional certifications globally that mirror real-world financial analysis complexity. We compare models distinguished by core design priorities: multi-modal and computationally powerful, reasoning-specialized and highly accurate, and lightweight efficiency-optimized. We assess models under zero-shot prompting and through a novel Retrieval-Augmented Generation pipeline that integrates official CFA curriculum content. The RAG system achieves precise domain-specific knowledge retrieval through hierarchical knowledge organization and structured query generation, significantly enhancing reasoning accuracy in professional financial certification evaluation. Results reveal that reasoning-oriented models consistently outperform others in zero-shot settings, while the RAG pipeline provides substantial improvements particularly for complex scenarios. Comprehensive error analysis identifies knowledge gaps as the primary failure mode, with minimal impact from text readability. These findings provide actionable insights for LLM deployment in finance, offering practitioners evidence-based guidance for model selection and cost-performance optimization.
Large Language Models (LLMs) have transformed financial research by providing sophisticated analysis of complex textual data. This research examines the performance differences between GPT-4 and Gemini Pro and Claude and DeepSeek through a Retrieval-Augmented Generation (RAG) framework designed to interpret financial news. We built a Streamlit-based application through the integration of LangChain with HuggingFace embeddings and FAISS vector stores and Gemini APIs to create a system that mirrors real-world financial analysis requirements. Multiple metrics including contextual accuracy, relevance, factual consistency, latency and cost-efficiency serve as evaluation criteria for the models. The outcomes of other models are extrapolated from Gemini's complete implementation through extensive documentation and public benchmarks along with qualitative evaluation methods. The evaluation demonstrates that Gemini delivers better contextual relevancy along with faster inference speeds but GPT-4 demonstrates stronger coherence combined with more precise factual accuracy. This work presents actionable guidelines to choose appropriate LLMs for financial research alongside an analysis of how RAG systems boost model effectiveness.
With the widespread adoption of Internet‐based AI technologies, addressing financial fraud has become increasingly critical, particularly within the realm of machine learning. In this case, deep learning and natural language processing (NLP) techniques offer powerful means of detecting fraudulent activity by analyzing financial documents, thereby enhancing both the efficiency and precision of such assessments and supporting financial security. In this study, we introduce deep representation learning‐based approaches relying mainly on large language models (LLMs) for identifying fraud in financial statements by examining temporal changes in the Management Discussion and Analysis (MD&A) sections of corporate disclosures. Departing from conventional techniques that rely only on word frequency analysis, we propose D eep F raud that combines time‐evolving financial LLM embeddings, such as FinBERT, FinLlama, and FinGPT embeddings, of paragraphs and uses long short‐term memory (LSTM) to predict frauds via historical textual embeddings. In addition to LLM embeddings, we also integrate (1) time‐evolving word frequencies of words relevant to fraud detection, such as those expressing sentiment or uncertainty, and (2) time‐evolving financial ratios. Trajectories of paragraph‐level embeddings, frequencies, and ratios are used to construct a fraud detection model, which we evaluate against machine learning methods and deep time‐series models. Using 30 years of financial report data (from 1995 to 2024), our experiments demonstrate that D eep F raud on average enhances fraud detection performance across a number of scenarios and on average outperforms the competing approaches as well as conventional word frequency approaches. Our framework introduces a novel direction for deep feature engineering in the field of financial statement fraud detection.
Large language models (LLMs) are increasingly used to support the analysis of complex financial disclosures, yet their reliability, behavioral consistency, and transparency remain insufficiently understood in high-stakes settings. This paper presents a controlled evaluation of five transformer-based LLMs applied to question answering over the Business sections of U.S. 10-K filings. To capture complementary aspects of model behavior, we combine human evaluation, automated similarity metrics, and behavioral diagnostics under standardized and context-controlled prompting conditions. Human assessments indicate that models differ in their average performance across qualitative dimensions such as relevance, completeness, clarity, conciseness, and factual accuracy, though inter-rater agreement is modest, reflecting the subjective nature of these criteria. Automated metrics reveal systematic differences in lexical overlap and semantic similarity across models, while behavioral diagnostics highlight variation in response stability and cross-prompt alignment. Importantly, no single model consistently dominates across all evaluation perspectives. Together, these findings suggest that apparent performance differences should be interpreted as relative tendencies under the tested conditions rather than definitive indicators of general reliability. The results underscore the need for evaluation frameworks that account for human disagreement, behavioral variability, and interpretability when deploying LLMs in financially consequential applications.
: This paper examines the challenges and solutions related to accuracy and bias in Generative AI (GenAI) and Large Language Models (LLMs) when applied to financial underwriting and clinical summarization. We compare and contrast the unique issues in these domains, explore current mitigation strategies, and propose novel approaches to enhance the reliability and fairness of AI-driven decision-making in these critical sectors. Through comprehensive analysis of recent research and case studies, we demonstrate the potential of these technologies to revolutionize both industries while highlighting the crucial need for ongoing vigilance and innovation in addressing accuracy and bias concerns
eXtensible Business Reporting Language (XBRL) has attained the status of the global de facto standard for business reporting. However, its complexity poses significant barriers to interpretation and accessibility. In this paper, we present the first evaluation of large language models’ (LLMs) performance in analyzing XBRL reports. Our study identifies LLMs’ limitations in the comprehension of financial domain knowledge and mathematical calculation in the context of XBRL reports. To address these issues, we propose enhancement methods using external tools under the agent framework, referred to as XBRL-Agent, which invokes retrievers and calculators. Extensive experiments on two tasks - the Domain Query Task (which involved testing 500 XBRL term explanations and 50 domain questions) and the Numeric Type Query Task (tested 1,000 financial math tests and 50 numeric queries) - demonstrate substantial performance improvements, with accuracy increasing by up to 17% for the domain task and 42% for the numeric type task. This work not only explores the potential of LLMs for analyzing XBRL reports but also augments the reliability and robustness of such analysis, although there is still much room for improvement in mathematical calculations.
The application of generative AI in the preparation of financial reports has significantly improved efficiency and accuracy, but it has also triggered ethical risks such as data privacy, algorithmic bias, and ambiguous responsibilities. Based on the technology-policy-organization synergy framework, the innovation of corporate ethical review mechanisms needs to focus on the following dimensions: At the technology governance level, federated learning and zero-trust architecture are integrated to achieve controllable data security, algorithmic fairness detection tools are integrated to monitor model biases in real time, and blockchain technology is used to ensure full-process traceability; At the policy compliance level, dynamic hierarchical review standards are established, international mainstream regulatory requirements are integrated, and intelligent systems are relied on to achieve automated analysis and compliance adaptation of global regulatory policies; At the organizational execution level, a multi-level review framework is established, embedding abnormal decision warning and human intervention mechanisms. Case studies show that this mechanism can effectively reduce data security risks, enhance algorithmic fairness, and strengthen responsibility traceability. In the future, it is necessary to strengthen the integration and application of cutting-edge technologies, promote global ethical standard coordination, and build a people-oriented intelligent governance paradigm.
In recent years, Large Language Models (LLMs) have demonstrated remarkable versatility across various applications, including natural language understanding, domain-specific knowledge tasks, etc. However, applying LLMs to complex, high-stakes domains like finance requires rigorous evaluation to ensure reliability, accuracy, and compliance with industry standards. To address this need, we conduct a comprehensive and comparative study on three state-of-the-art LLMs, GLM-4, Mistral-NeMo, and LLaMA3.1, focusing on their effectiveness in generating automated financial reports. Our primary motivation is to explore how these models can be harnessed within finance, a field demanding precision, contextual relevance, and robustness against erroneous or misleading information. By examining each model's capabilities, we aim to provide an insightful assessment of their strengths and limitations. Our paper offers benchmarks for financial report analysis, encompassing proposed metrics such as ROUGE-1, BERT Score, and LLM Score. We introduce an innovative evaluation framework that integrates both quantitative metrics (e.g., precision, recall) and qualitative analyses (e.g., contextual fit, consistency) to provide a holistic view of each model's output quality. Additionally, we make our financial dataset publicly available, inviting researchers and practitioners to leverage, scrutinize, and enhance our findings through broader community engagement and collaborative improvement. Our dataset is available on huggingface.
Annual Reports of publicly listed companies contain vital information about their financial health which can help assess the potential impact on Stock price of the firm. These reports are comprehensive in nature, going up to, and sometimes exceeding, 100 pages. Analysing these reports is cumbersome even for a single firm, let alone the whole universe of firms that exist. Over the years, financial experts have become proficient in extracting valuable information from these documents relatively quickly. However, this requires years of practice and experience. This paper aims to simplify the process of assessing Annual Reports of all the firms by leveraging the capabilities of Large Language Models (LLMs). The insights generated by the LLM are compiled in a Quant styled dataset and augmented by historical stock price data. A Machine Learning model is then trained with LLM outputs as features. The walkforward test results show promising outperformance wrt S&P500 returns. This paper intends to provide a framework for future work in this direction. To facilitate this, the code has been released as open source.
MarketSenseAI is a novel framework for holistic stock analysis which leverages Large Language Models (LLMs) to process financial news, historical prices, company fundamentals and the macroeconomic environment to support decision making in stock analysis and selection. In this paper, we present the latest advancements on MarketSenseAI, driven by rapid technological expansion in LLMs. Through a novel architecture combining Retrieval-Augmented Generation and LLM agents, the framework processes SEC filings and earnings calls, while enriching macroeconomic analysis through systematic processing of diverse institutional reports. We demonstrate a significant improvement in fundamental analysis accuracy over the previous version. Empirical evaluation on S\&P 100 stocks over two years (2023-2024) shows MarketSenseAI achieving cumulative returns of 125.9% compared to the index return of 73.5%, while maintaining comparable risk profiles. Further validation on S\&P 500 stocks during 2024 demonstrates the framework's scalability, delivering a 33.8% higher Sortino ratio than the market. This work marks a significant advancement in applying LLM technology to financial analysis, offering insights into the robustness of LLM-driven investment strategies.
Financial sentiment analysis is the task of evaluating and quantifying the emotions and opinions expressed in financial news, reports, or social media to help investors and institutions make informed decisions. Financial institutions have been actively exploring the use of large language models (LLMs) to analyse market sentiment signals for a more nuanced understanding of a broader context. However, issues such as the scale of training data, model complexity, and the potential for human oversight can introduce or even amplify bias in these systems. Representation bias is a common challenge for LLMs as training data fail to properly represent the target groups, hence causes harmful bias in general-purpose use. Therefore, replacing current solutions with LLMs in financial organisations requires a robust evaluation methodology to ensure fairness. This paper investigates a three-level bias evaluation approach that specifically focuses on representation bias and presents a baseline evaluation of the FinBERT model. Step 1 uses a synthetic dataset that explicitly reveals sources of bias, structured as probability- and embedding-based evaluation recipes. Step 2 evaluates the model against data released by another country (e.g. Indian News dataset) to assess its performance in relation to more implicit biases. Step 3 examines individual problematic samples using token-based interpretability methods (e.g. integrated gradients). This paper presents the application of this structured bias evaluation process and its results on the FinBERT model. The evaluation code and dataset are available on GitHub (https://github.com/asabuncuoglu13/faid-test-financial-sentiment-analysis).
Emerging techniques in computer science make it possible to"brain scan"large language models (LLMs), identify the plain-English concepts that guide their reasoning, and steer them while holding other factors constant. We show that this approach can map LLM-generated economic forecasts to concepts such as sentiment, technical analysis, and timing, and compute their relative importance without reducing performance. We also show that models can be steered to be more or less risk-averse, optimistic, or pessimistic, which allows researchers to correct or simulate biases. The method is transparent, lightweight, and replicable for empirical research in the social sciences.
This study explores the application of retrieval-augmented generation (RAG) to improve the accuracy and reliability of large language models (LLMs) in the context of financial report analysis. The focus is on enabling private investors to make informed decisions by enhancing the question-and-answering capabilities regarding the half-yearly or quarterly financial reports of banks. The study adopts a Design Science Research (DSR) methodology to develop and evaluate an RAG system tailored for this use case. The study conducts a series of experiments to explore models in which different RAG components are used. The aim is to enhance context relevance, answer faithfulness, and answer relevance. The results indicate that model one (OpenAI ADA and OpenAI GPT-4) achieved the highest performance, showing robust accuracy and relevance in response. Model three (MiniLM Embedder and OpenAI GPT-4) scored significantly lower, indicating the importance of high-quality components. The evaluation also revealed that well-structured reports result in better RAG performance than less coherent reports. Qualitative questions received higher scores than the quantitative ones, demonstrating the RAG’s proficiency in handling descriptive data. In conclusion, a tailored RAG can aid investors in providing accurate and contextually relevant information from financial reports, thereby enhancing decision making.
This paper investigates whether Large Language Models (LLMs) can be used to predict numeric Key Performance Indicators (KPIs) from question-context pairs derived from financial and ESG (Environmental, Social, and Governance) reports. Two modeling strategies were compared: a semantic embedding approach using a pretrained transformer model (all-MiniLM-L6-v2), and a traditional term frequency-inverse document frequency (TF-IDF) vectorization. Both models were trained using Random Forest regressors and evaluated through 5-fold cross-validation. Results indicate that the TF-IDF-based model achieved stronger performance ($\mathrm{R}^{2}=0.59$) than the LLMbased model ($\mathbf{R}^{\mathbf{2}} \boldsymbol{=} \mathbf{0. 4 6}$), suggesting that classical NLP techniques remain competitive in structured financial text settings. Further analysis revealed that bounded KPI types such as Scores and Percentages were predicted with greater accuracy than unbounded values like revenues and emissions. These findings highlight the importance of aligning model complexity with the structure and semantic variability of financial disclosures. The study contributes to the growing field of AI-driven financial automation by clarifying the limits and strengths of semantic versus lexical modeling for numeric prediction tasks.
No abstract available
Financial arguments play a critical role in shaping investment decisions and public trust in financial institutions. Nevertheless, assessing their quality remains poorly studied in the literature. In this paper, we examine the capabilities of three state-of-the-art LLMs GPT-4o, Llama 3.1, and Gemma 2 in annotating argument quality within financial communications, using the FinArgQuality dataset. Our contributions are twofold. First, we evaluate the consistency of LLM-generated annotations across multiple runs and benchmark them against human annotations. Second, we introduce an adversarial attack designed to inject gender bias to analyse models responds and ensure model's fairness and robustness. Both experiments are conducted across three temperature settings to assess their influence on annotation stability and alignment with human labels. Our findings reveal that LLM-based annotations achieve higher inter-annotator agreement than human counterparts, though the models still exhibit varying degrees of gender bias. We provide a multifaceted analysis of these outcomes and offer practical recommendations to guide future research toward more reliable, cost-effective, and bias-aware annotation methodologies.
Financial decision-making hinges on the analysis of relevant information embedded in the enormous volume of documents in the financial domain. To address this challenge, we developed FinQAPT, an end-to-end pipeline that streamlines the identification of relevant financial reports based on a query, extracts pertinent context, and leverages Large Language Models (LLMs) to perform downstream tasks. To evaluate the pipeline, we experimented with various techniques to optimize the performance of each module using the FinQA dataset. We introduced a novel clustering-based negative sampling technique to enhance context extraction and a novel prompting method called Dynamic N-shot Prompting to boost the numerical question-answering capabilities of LLMs. At the module level, we achieved state-of-the-art accuracy on FinQA, attaining an accuracy of 80.6%. However, at the pipeline level, we observed decreased performance due to challenges in extracting relevant context from financial reports. We conducted a detailed error analysis of each module and the end-to-end pipeline, pinpointing specific challenges that must be addressed to develop a robust solution for handling complex financial tasks.
The growing adoption of large language models (LLMs) in finance exposes high-stakes decision-making to subtle, underexamined positional biases. The complexity and opacity of modern model architectures compound this risk. We present the first unified framework and benchmark that not only detects and quantifies positional bias in binary financial decisions but also pinpoints its mechanistic origins within open-source Qwen2.5-instruct models (1.5B–14B). Our empirical analysis covers a novel, finance-authentic dataset revealing that positional bias is pervasive, scale-sensitive, and prone to resurfacing under nuanced prompt designs and investment scenarios, with recency and primacy effects revealing new vulnerabilities in risk-laden contexts. Through transparent mechanistic interpretability, we map how and where bias emerges and propagates within the models to deliver actionable, generalizable insights across prompt types and scales. By bridging domain-specific audit with model interpretability, our work provides a new methodological standard for both rigorous bias diagnosis and practical mitigation, establishing essential guidance for responsible and trustworthy deployment of LLMs in financial systems.
Large Language Models (LLMs) are increasingly used in finance for tasks like market analysis, customer support, sentiment analysis, and automated reporting. However, LLMs often inherit and perpetuate biases from their training data, raising concerns about fairness and accuracy in high-stakes financial applications. While other domains such as medicine, law, and education have advanced in identifying, measuring, and reducing bias, finance lacks domain-specific datasets and robust fairness metrics. To address this, we introduce the FinBias dataset which includes bias-eliciting prompts related to the finance domain, and a comprehensive evaluation framework for publicly available LLMs, including robustness tests against jailbreaking. We also propose a new metric, SAFE (Safety-Adjusted Fairness Evaluation), which penalizes stereotypical and refusal responses while rewarding debiased outputs. Additionally, we present a prompt engineering-based mitigation strategy that effectively reduces bias. Experiments conducted on three publicly available LLMs - Mixtral, Gemma, and LLaMA demonstrate that these models exhibit significant bias, but the proposed prompt engineering-based mitigation strategy effectively reduces this bias. This research provides a practical foundation for the detection, evaluation and mitigation of bias in financial LLM applications.
This study aims to evaluate the sentiment of financial texts using large language models (LLMs) and to empirically determine whether LLMs exhibit company-specific biases in sentiment analysis. Specifically, we examine the impact of general knowledge about firms on the sentiment measurement of texts by LLMs. Firstly, we compare the sentiment scores of financial texts by LLMs when the company name is explicitly included in the prompt versus when it is not. We define and quantify companyspecific bias as the difference between these scores. Next, we construct an economic model to theoretically evaluate the impact of sentiment bias on investor behavior. This model helps us understand how biased LLM investments, when widespread, can distort stock prices. This implies the potential impact on stock prices if investments driven by biased LLMs become dominant in the future. Finally, we conduct an empirical analysis using Japanese financial text data to examine the relationship between firm-specific sentiment bias, corporate characteristics, and stock performance.
No abstract available
The task of financial analysis primarily encompasses two key areas: stock trend prediction and the corresponding financial question answering. Currently, machine learning and deep learning algorithms (ML&DL) have been widely applied for stock trend predictions, leading to significant progress. However, these methods fail to provide reasons for predictions, lacking interpretability and reasoning processes. Also, they can not integrate textual information such as financial news or reports. Meanwhile, large language models (LLM) have remarkable textual understanding and generation ability. But due to the scarcity of financial training datasets and limited integration with real-time knowledge, LLM still suffer from hallucinations and unable to keep up with the latest information. To tackle these challenges, we first release AlphaFin datasets, combining traditional research datasets, real-time financial data, and handwritten chain-of-thought (CoT) data. It has positive impact on training LLM for completing financial analysis. We then use AlphaFin datasets to benchmark a state-of-the-art method, called Stock-Chain, for effectively tackling the financial analysis task, which integrates retrieval-augmented generation (RAG) techniques. Extensive experiments are conducted to demonstrate the effectiveness of our framework on financial analysis.
Crafting a convincing financial market analysis report necessitates a wealth of market information and the expertise of financial analysts, posing a highly challenging task. While large language models (LLMs) have enabled the automated generation of financial market analysis text, they still face issues such as hallucinations, errors in financial knowledge, and insufficient capability to reason about complex financial problems, which limits the quality of the generation. To tackle these shortcomings, we propose a novel task and a retrieval-augmented framework grounded in a financial knowledge graph (FKG). The proposed framework is compatible with commonly used instruction-tuning methods. Experiments demonstrate that our framework, coupled with a small-scale language model fine-tuned with instructions, can significantly enhance the logical consistency and quality of the generated analysis texts, outperforming both large-scale language models and other retrieval-augmented baselines.
Abstract Purpose: This study examines the transformation of financial decision-making through the adoption of artificial intelligence, focusing on the shift from conventional AI systems to AI agents and agentic AI. It differentiates between automated analytical tools and autonomous, goal-oriented systems that increasingly assume decision-making authority within financial operations. Design/Methodology/Approach: Employing a qualitative multi-method approach—comprising semi-structured expert interviews, industry report synthesis, in-depth case studies, and a comparative performance evaluation—this research investigates AI agent implementation across SMEs, pharmaceutical analytics, and ERP-integrated corporate finance. Theoretically, it extends foundational models including the Efficient Market Hypothesis (EMH), Behavioral Finance, and the Adaptive Markets Hypothesis (AMH) by embedding the dynamic, learning-driven nature of AI agents into financial decision logic. Findings: The results indicate that AI agents introduce novel forms of informational asymmetry, enhance bias mitigation through adaptive modeling, and give rise to emergent decision structures via multi-agent interactions. These dynamics challenge core assumptions of market rationality and static efficiency. Practically, the study offers a structured framework for AI agent integration, emphasizing explainability, hybrid human-AI governance, and risk-specific safeguards to navigate ethical and regulatory constraints. The proposed conceptual taxonomy and cross-industry implementation roadmap reposition agentic AI as a strategic transformation—reshaping how financial institutions process data, execute judgments, and regulate algorithmic autonomy.
The rapid development of Generative AI has brought major changes in way of functioning of different sectors throughout the world. Many research work has been done in the field of financial sector to increase the efficiency and reduce the errors due to human intervention. However, the current financial risk analysis relies on manual reviews and conventional machine learning models which repeatedly failing to process financial risk data. This study investigates how Retrieval-Augmented Generation (RAG) approach can help Large Language Models (LLM) to generate risk analysis reports for audit reports which extract detailed information from the audit reports and avoid overlooking of small details, which was a major drawback in the earlier system. This research study covers how Retrieval Augmented Generation (RAG) enhances the performance financial risk analysis of audit reports using different LLMs like GPT-4o, Gemini-1.5-flash, and LlaMa3.1. This research work includes the performance of LLMs beyond multiple metrics, including faithfulness, context precision-recall-relevancy, and answer relevance. The research findings imply that LlaMa3.1 is a great model in terms of faithfulness of the generated report with a score of 78.26%. In terms of retrieval of the documents and its context, Llama had a very strong performance by getting the score of 79.62% in context-precision, 78.26% in context-recall and 86.99% in context-relevancy. In terms of generated report, the Llama3.1 model have the score of 37.83% for answer-relevancy and Gemini-1.5-flash have a score of 58.64% for answer-correctness.
Large language models (LLMs) offer strategy researchers powerful tools for annotating text at scale, but treating LLM-generated labels as deterministic overlooks substantial instability. Grounded in content analysis and generalizability theory, we diagnose five variance sources: construct specification, interface effects, model preferences, output extraction, and system-level aggregation. Empirical demonstrations show that minor design choices-prompt phrasing, model selection-can shift outcomes by 12-85 percentage points. Such variance threatens not only reproducibility but econometric identification: annotation errors correlated with covariates bias parameter estimates regardless of average accuracy. We develop a variance-aware protocol specifying sampling budgets, aggregation rules, and reporting standards, and delineate scope conditions where LLM annotation should not be used. These contributions transform LLM-based annotation from ad hoc practice into auditable measurement infrastructure.
Traditional supplier risk scoring systems rely on structured data and predetermined rules, often failing to capture emerging threats embedded in unstructured sources such as news media, regulatory filings, and environmental disclosures. A hybrid framework combining Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs) addresses this gap by retrieving contextually relevant documents and synthesizing evidence-based risk evaluations. The framework integrates enterprise resource planning data with external intelligence streams to assess geopolitical, financial, compliance, and sustainability risks across supplier networks. Real-time adaptation capabilities enable procurement organizations to respond dynamically to evolving threat landscapes. Each risk decision includes transparent provenance metadata, ensuring auditability and regulatory compliance. The framework addresses critical challenges, including algorithmic bias, interpretability requirements, and human oversight through structured governance protocols. By combining retrieval precision with generative reasoning capabilities, this system represents a transformative advancement in procurement risk mitigation, delivering proactive intelligence while maintaining enterprise-grade transparency and ethical deployment standards for responsible artificial intelligence integration in supply chain operations.
Faced with the challenges of accelerating growth in unstructured text data and increasing risk concealment in the financial market, this study constructs a financial risk assessment system that combines text mining with large language models (LLMs). This system forms an end-to-end architecture encompassing data, knowledge, models, services, and governance. The system collects multi source text, constructs a risk knowledge graph, and extracts key events and sentiment signals. These are then integrated into the LLM framework of retrieval augmented generation (RAG) and multi feature fusion to achieve credit risk prediction and default probability estimation. This experiment relies on the FinBen Lending Club dataset (2024) and conducts comparative experiments, ablation studies, error analysis, and stability tests. The model outperforms traditional structured models and plain text models in key evaluation indicators such as F1, MCC, and PR AUC. In scenarios of market environment changes, cross industry migration, and anti-interference, the model's stability and compliance performance are outstanding. This study designs an intelligent risk identification solution for financial institutions, which makes the identification process explainable, traceable, and auditable. This study has significant theoretical and practical impact on risk governance and decision support for banks, securities, insurance, and regulatory authorities.
Information extraction is a critical technology for intelligent analysis and risk assessment in the dairy financial domain. However, real-world applications face three major challenges: the complexity and diversity of entity-relation types, significant data imbalance, and ambiguity in textual expressions. Traditional methods often fail to capture rare patterns, struggle with vague mentions, and exhibit poor generalization in low-resource settings. To address these issues, we propose a novel framework that integrates large language models (LLMs) with targeted data augmentation and agent-based retrieval-augmented generation (RAG). Our approach builds on the BaiChuan2 model, which is first adapted to the dairy finance domain via secondary pretraining. We introduce a two-stage data augmentation strategy: the first stage uses ChatGPT to generate pseudo-samples for rare types, and the second stage refines model weaknesses based on prediction-guided feedback. These augmented datasets are used to fine-tune the model through prompt-based supervised learning with LoRA. To further enhance robustness, we incorporate an agent-based RAG module for completing vague or underspecified entities by retrieving external contextual knowledge. Extensive experiments demonstrate that our framework achieves state-of-the-art performance, with the improved metric, i.e., F1+ scores, of 0.876 and 0.824 for entity recognition and relation extraction, respectively. The RAG component boosts entity completion accuracy to 0.802 while reducing retrieval latency by over 6x, showcasing both the effectiveness and practicality of our method in real-world dairy financial applications.
Artificial intelligence (AI) has increasingly become integrated into audit work, potentially increasing efficiency, accuracy, and the detection of risk. This study uses bibliometric analysis to explore how academic research in this field has developed from 2000 to 2025. Based on 584 journal articles collected from the Web of Science Core Collection, the paper answers three research questions: (1) How has the number of papers and their geographic distribution changed? (2) Who are the most influential researchers and journals? (3) What are the key research topics? Using the tool VOSviewer, this study conducted co-authorship, co-citation, and keyword co-occurrence analyses. The results show a rapid increase in AI-audit research after 2018, especially in countries like the USA, China, and England. Influential papers mainly focus on AI ethics, audit automation, and human-AI collaboration. This study identifies four main research themes: (1) Emerging technologies and intelligent systems, including the use of blockchain, ChatGPT, and cloud platforms in auditing; (2) Machine learning and analytical techniques, focusing on prediction and data-driven decision-making; (3) Audit quality and organisational impact, which looks at performance and governance improvements; and (4) Ethics, fairness, and AI governance, which discusses risks like bias and lack of transparency in AI tools, often analysed under the concept of explainable AI.
ABSTRACT Despite the availability of pro-business policies, many micro, small and medium-sized enterprises (MSMEs) struggle to effectively utilise these resources due to limited capacity. This study proposes an policy-enterprise matching framework that transforms ‘enterprises seeking policies’ approach into a ‘policies seeking enterprises’ mechanism. The mechanism uses large language models (LLMs) as tool for extracting policy restrictive clauses and generating compliance checking strategies. Meanwhile, machine learning methods are used for enterprise risk prediction. The matching framework broadens the research perspective of policy analysis and risk prediction, and provides government staff with decision support based on risk prediction.
Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?
Large Language Models (LLMs) are increasingly explored as flexible alternatives to classical machine learning models for classification tasks through zero-shot prompting. However, their suitability for structured tabular data remains underexplored, especially in high-stakes financial applications such as financial risk assessment. This study conducts a systematic comparison between zero-shot LLM-based classifiers and LightGBM, a state-of-the-art gradient-boosting model, on a real-world loan default prediction task. We evaluate their predictive performance, analyze feature attributions using SHAP, and assess the reliability of LLM-generated self-explanations. While LLMs are able to identify key financial risk indicators, their feature importance rankings diverge notably from LightGBM, and their self-explanations often fail to align with empirical SHAP attributions. These findings highlight the limitations of LLMs as standalone models for structured financial risk prediction and raise concerns about the trustworthiness of their self-generated explanations. Our results underscore the need for explainability audits, baseline comparisons with interpretable models, and human-in-the-loop oversight when deploying LLMs in risk-sensitive financial environments.
Audit risk identification is a core task in the audit planning stage, and its results directly affect the accuracy and reliability of audit conclusions. Under the traditional audit model, constrained by tight timelines and limited technical means, auditors often have difficulty mining and verifying massive volumes of raw data in a timely and comprehensive manner. However, with the powerful natural language processing capabilities of large language models (LLMs) and the automated execution features of intelligent agents (Agents), auditors are now able to quickly filter and deeply analyze various structured and unstructured data, accurately identify key audit risks, and further improve and systematically construct the audit evidence chain. This paper focuses on analyzing the collaborative working mechanism of LLM and Agent technologies in audit risk identification, designs an intelligent audit risk identification methodology that integrates LLM and Agent, and discusses its application value in multi-source data analysis, evidence chain construction, and other aspects.
The article examines the profound transformation of financial control under the influence of artificial intelligence. The transition from classical auditing to intelligent auditing is presented as a shift not only in technology but also in philosophy: from retrospective verification to predictive and preventive financial governance. Intelligent audit combines machine learning, explainable AI (XAI), large language models (LLMs), and autonomous digital platforms into a holistic ecosystem. Special attention is paid to the Audit-as-a-Service (AaaS) model, which democratizes access to high-quality audit for small and medium-sized enterprises (SMEs). The study explores both opportunities and threats, from real-time anomaly detection and transparency to data security risks, false positives, and ethical dilemmas. Ultimately, intelligent audit is framed as a new social contract of trust, where algorithms and humans collaborate to safeguard not only numbers but also values.
With the increasing complexity of the business environment and the evolution of information disclosure tools, financial distress prediction (FDP) is gradually transforming from structured data-driven to semantic information fusion. Traditional models rely on financial ratios and statistical indicators, which make it difficult to capture risk propensity in “soft signals” such as management tone and textual metaphors. And the existing large language models (LLMs) provide a new perspective for the text-driven FDP system by the excellent semantic modeling and inference generation capabilities. This paper systematically sorts out the application path of LLMs in FDP by focusing on variable construction and model construction. Three types of representative text features, namely, emotional tone, semantic embedding, and generative variables, are summarized. The modeling mechanism analyzes LLMs as categorical predictive models and their fusion patterns in multimodal integrated systems. In addition, this work points out that there are still challenges such as scarce data labels, non-interpretable models, high cost of system deployment and lack of compliance mechanisms in existing studies, which urgently requires the evolution towards an intelligent early warning system with high credibility, transparency and adaptability under the synergistic promotion of multidisciplinary efforts. This work will provide a cutting-edge reference for constructing intelligent risk control systems and developing financial regulatory technology.
The global indirect tax compliance of large-scale digital commerce platforms has become a complex, high-stakes systems problem due to jurisdictional fragmentation, regular change of regulations, and the high pace of expansion of heterogeneous product catalogs. Rule-based tax engines, although auditable and deterministic, fail to scale in such a situation because their authoring processes are fragile, maintenance is expensive, and their semantic knowledge of product data is limited. This article provides a detailed design of an intelligent tax automation system that is based on machine learning-based item-to-tax prediction services, supported by confidence-aware orchestration, human-in-the-loop protection, and explainability features that are appropriate in regulated financial settings. The suggested framework uses transformer-based language models that are trained on large-scale and multilingual commerce data to predict tax classifications directly based on item titles, descriptions, and structured taxonomy cues. Instead of using fixed mappings, the system is trained on semantic associations between product representations and jurisdiction-specific tax treatments, allowing it to correctly process long-tail, ambiguous, and newly added items. Calibrated confidence scores are provided with predictions, and this indicates whether the transactions can be safely automated, sent to policy validation, or sent to expert scrutiny. This is the selective automation model that balances operational efficiency and regulatory risk, compliance integrity, and scale. The architecture is deployed in a controlled machine learning system and combines continuous monitoring, auditability, and retraining pipelines based on feedback. The experience of large-scale deployments has shown that they can substantially reduce the effort required for manual rule formulation and scrutiny, increase the accuracy of classification into thousands of categories, and have a quantifiable financial effect, without compromising transparency to auditors and other regulatory stakeholders. The article defines intelligent, ML-driven tax automation as a feasible and responsible alternative to the legacy rule-based systems in the global compliance areas.
ABSTRACT Purpose This study aims to integrate large language models (LLMs) with interpretable machine learning methods to develop a multimodal data-driven framework for predicting corporate financial fraud, addressing the limitations of traditional approaches in long-text semantic parsing, model interpretability, and multisource data fusion, thereby providing regulatory agencies with intelligent auditing tools. Design/methodology/approach Analyzing 5,304 Chinese listed firms’ annual reports (2015-2020) from the CSMAD database, this study leverages the Doubao LLMs to generate chunked summaries and 256-dimensional semantic vectors, developing textual semantic features. It integrates 19 financial indicators, 11 governance metrics, and linguistic characteristics (tone, readability) with fraud prediction models optimized through a group of Gradient Boosted Decision Tree (GBDT) algorithms. SHAP value analysis in the final model reveals the risk transmission mechanism by quantifying the marginal impacts of financial, governance, and textual features on fraud likelihood. Findings The study found that LLMs effectively distill lengthy annual reports into semantic summaries, while GBDT algorithms (AUC > 0.850) outperform the traditional Logistic Regression model in fraud detection. Multimodal fusion improved performance by 7.4%, with financial, governance, and textual features providing complementary signals. SHAP analysis revealed financial distress, governance conflicts, and narrative patterns (e.g., tone anchoring, semantic thresholds) as key fraud indicators, highlighting managerial intent in report language. Research limitations This study identifies three key limitations: 1) lack of interpretability for semantic features, 2) absence of granular fraud-type differentiation, and 3) unexplored comparative validation with other deep learning methods. Future research will address these gaps to enhance fraud detection precision and model transparency. Practical implications The developed semantic-enhanced evaluation model provides a quantitative tool for assessing listed companies’ information disclosure quality and enables practical implementation through its derivative real-time monitoring system. This advancement significantly strengthens capital market risk early warning capabilities, offering actionable insights for securities regulation. Originality/value This study presents three key innovations: 1) A novel “chunking-summarizationembedding” framework for efficient semantic compression of lengthy annual reports (30,000 words); 2) Demonstration of LLMs’ superior performance in financial text analysis, outperforming traditional methods by 19.3%; 3) A novel “language-psychology-behavior” triad model for analyzing managerial fraud motives.
In a space such as the financial industry, clear and stringent reporting and auditing are vital for both regulatory adherence and internal governance. In this environment, cards with full U.S cardholder's CVVs are one of the most valuable data assets for a fraudster in that they provide an easy-to-use, one-time authentication step that’s cryptographically difficult to reproduce. Although these information resources are invaluable for evaluating institutions' risk and compliance stance, the vast majority of such information is textual and unstructured. This becomes a formidable challenge for institutions that try to make use of timely, reliable, and actionable insights- especially when they are done manually or with unsophisticated rule-based systems. In the past few years, the developments in NLP have provided a tremendous ability to interpret unstructured text at scale, enabling automation in areas that traditionally rely heavily on expert judgment. NLP is particularly suitable in finance applications where Textual analysis is required to deal with context, domain-specific jargon, time taking into consideration temporal patterns, and delicate linguistic cues. This work studied NLP to process financial risk disclosures and audit trails, providing a systematic and scalable way to detect financial wrongdoings, latent risks, and non-compliance events. We start with an analysis of the linguistic properties of financial disclosures, uncovering important aspects such as tone, modality, and forward-looking statements that are frequently associated with risk perception and market volatility. We leverage techniques such as Named Entity Recognition (NER), sentiment analysis, and topic modelling to illustrate how machine learning-based NLP models can unearth the hidden risk signals encoded in annual reports or regulatory filings. Concurrently, we consider audit trails as structured logs about user or system activity that, despite being in timestamped format, include embedded command-line lines, transactional notes, and system-generated messages that are good candidates for language-based analysis. By processing through NLP, such as tokenization of log, part-of-speech, parsing, and anomaly detection, the audit data is converted to the sample structured knowledge for real-time monitoring and forensic auditing. The manuscript introduces a hybrid approach based on the integration of rule-based, statistical NLP, and machine-learning techniques for both narrative-based disclosures and event-ordered disclosure logs. We also detail a pipeline design consisting of data ingestion, text pre-processing, feature extraction, model prediction, and visual dashboarding. Experimental results from historical financial disclosures and synthetic audit logs show that the NLP-driven framework is able to accurately target risk-laden statements, identify anomalous sequences of activities, and categorize text sections according to regulatory relevance. Our results show that our proposed approach outperforms traditional keyword matching and manual review-based approaches and is more efficient and interpretable. The application of NLP to financial risk risk disclosures and audit trails can improve both timeliness and accuracy of compliance checks while also providing a proactive approach to risk governance. This study is part of an emerging body of work on Regulatory Technology (RegTech), which promotes the use of AI and data to inform regulatory decision making in finance. In navigating the morass of regulation and the volume of data they need to process, it is clear that NLP is the key enabler for intelligent, automated, and reliable compliance.
The advent of Large Language Models (LLMs), such as GPT-4 and BERT, is transforming accounting and finance by enabling intelligent automation, advanced data analysis, and real-time decision support. This paper provides a comprehensive review of recent applications and implications of LLMs in the accounting and finance domains. Specifically, it examines how LLMs enhance financial reporting through automated data processing and narrative generation; improve financial decision-making and forecasting via intelligent analysis and predictive modeling; and increase audit efficiency by enabling compliance checks, anomaly detection, and risk identification. This study contributes to the understanding of how LLMs are reshaping accounting workflows and professional roles, offering a foundation for future research and practical implementation. It also calls for the development of responsible AI governance frameworks to ensure the trustworthy, transparent, and sustainable integration of LLMs in accounting and finance practices.
Financial statement analysis requires students to understand that identical financial data yield different insights depending on the analytical perspective adopted. This paper presents a pedagogical framework that leverages GenAI to teach financial statement analysis through four distinct professional personas, including equity analyst, credit analyst, internal manager, and auditor. Building on stakeholder theory (Freeman 2010) and situated learning theory (Lave and Wenger 1991), I develop comprehensive prompt sequences that guide students through persona-specific analytical processes. Using Walmart Inc. and Netflix Inc. as contrasting case studies, a mature retailer versus a growth-oriented technology firm, we demonstrate how each persona prioritizes different metrics, asks different questions, and reaches different conclusions from identical financial statements. The framework provides educators with ready-to-implement prompt hierarchies progressing from basic ratio calculation to complex interpretative analysis. This structured approach addresses documented challenges in accounting and finance education, including the difficulty novice learners face in understanding contextual financial analysis (Grimm and Blazovich 2016) and the need for scalable methods to provide individualized feedback (Warren et al. 2025). The framework contributes to scholarship on technology-enhanced learning by demonstrating how GenAI can scaffold professional reasoning development while maintaining pedagogical rigor.
No abstract available
Fiscal transparency, accountability, and the reliability of audits are some of the issues that public sector organizations are now facing. The paper at hand gives a framework that deals comprehensively with the use of Generative AI (GenAI) process agents in Financial Enterprise Resource Planning (ERP) systems in the public sector. The ERP changes will include the detection of anomalies, auditing process made easier, and reporting of compliance improved. The research is based on mixed methods where quantitative ERP transaction data analysis (p < 0.005) and qualitative interviews are done with financial administrators. The final results show that the use of AI and automation brought about significant improvements in terms of error rate reduction, audit completion time, and compliance accuracy. The framework is built around XAI models, LLMs, and federated learning that are sure to be morally deployed under the government data governance policies. The dummy ERP dataset gives empirical proof that GenAI-driven agents can account for 22% more fiscal accountability and audit efficiency 35% more. The suggested model gives a bright future for the invigoration of transparency in electronic governance systems and the sustainable management of public finances.
Financial statements are cornerstones of several analyses, such as loan applications, as well as for legal firms collecting evidence and analysis. They exert a significant influence on the decisions of these institutions. Streamlining the processing of these statements, regardless of their form—be it digital or hard copies—stands as a pivotal objective for banks and similar firms. This research explores the integration of Optical Character Recognition (OCR) and generative AI for automating the extraction of crucial financial data from bank statement images. Furthermore, we design an architecture to make a generic analysis possible on multiple types of financial documents by utilizing a classification model tailored to categorize bank statement documents. This facilitates seamless data preparation for subsequent analysis or model training. Emphasizing precision and efficiency, we investigate OCR model architectures designed specifically to enhance text extraction accuracy from low-resolution bank statement images. The study evaluates two different OCR model architectures—the accuracy of FSRCNN model being the best—achieving an accuracy above 93% in OCR. Additionally, we analyze a generative AI-based Q&A chatbot to simplify analysis for novice users.
Generative artificial intelligence (GenAI) is rapidly being embedded into corporate reporting workflows, yet its implications for financial reporting quality and auditability remain insufficiently understood. This paper examines how GenAI models can be used to automate financial reporting narratives—such as Management’s Discussion and Analysis (MD&A) and risk disclosures—and evaluates their effects on disclosure quality, transparency, and assurance. The study employs an experimental mixed‑methods design, comparing human‑authored, GenAI‑generated, and human‑edited GenAI narratives based on a large sample of corporate reports. Text‑analytic techniques (readability indices, sentiment and topic analysis, and red‑flag indicators) are combined with explainable AI methods to assess both the content produced by GenAI and the traceability of underlying decision processes. The findings indicate that GenAI can substantially improve readability and linguistic consistency while reducing boilerplate, but also introduces new risks related to hallucinated details, optimistic bias, and potential masking of earnings‑management signals. Explainability tools partially mitigate these concerns by providing auditable evidence of how inputs shape outputs, yet do not fully resolve issues of accountability and professional scepticism. Overall, the paper contributes empirical evidence and a governance framework for responsibly integrating GenAI into financial reporting and auditing, offering practical guidance for preparers, auditors, and regulators seeking to harness automation without compromising reliability or trust.
No abstract available
This qualitative study investigates the transformative potential of integrating block chain and generative AI in financial reporting, specifically examining impacts on accuracy, efficiency, and trust. Based on a comprehensive review of literature from 2020 to 2025, this paper synthesizes current academic understanding. The study aims to determine the role of block chain in ensuring data integrity and auditability, assess AI's capacity for automating processes and enhancing analytical capabilities, and explore the combined impact of these technologies on stakeholder trust. The findings indicate that block chain’s inherent immutability and transparency significantly improve the accuracy of financial data. Simultaneously, generative AI enhances efficiency by automating tasks and providing real-time insights. However, the effect on trust is complex, as block chain’s transparency contrasts with the opacity of certain AI algorithms, underscoring the need for explainable AI (XAI). Agency Theory and the Resource-Based View provide theoretical support for the argument that this integration can improve financial reporting quality and efficiency. The study emphasizes the importance of transparent and well-governed applications to fully realize the benefits of this technological convergence. Policy recommendations include the development of adaptable regulatory frameworks, the promotion of standardization, and investment in education to effectively manage this evolving technological context. IUBAT Review—A Multidisciplinary Academic Journal, 8(2): 239-263
Artificial intelligence (AI) has brought significant changes in many fields, including the audit sector. Generative AI, a type of AI, leverages deep learning models to generate human-like content like images and words. ChatGPT, a type of generative AI, is a language model capable of revolutionizing information technology (IT) audits in support of financial statement (FS) audits. This paper highlights and examines advantages and disadvantages of incorporating ChatGPT into IT audits that support FS audits. ChatGPT offers benefits such as efficiency, precision, and speed, which are vital in the IT audit process. However, it also comes with challenges that need to be addressed. This paper contributes to the growing body of knowledge related to the use of ChatGPT in IT audits that support FS audits. The paper further provides insights into how audit firms can effectively adopt ChatGPT to improve the quality, efficiency, and effectiveness of their IT and FS audits.
This paper presents a case study of end-to-end (E2E) automation of corporate financial expense processing by combining generative AI (GenAI) and intelligent document processing (IDP) technologies with automation agents and shows the automation of intelligent tasks in a modern digital transformation environment. Although conventional RPA is effective in automating repetitive, rule-based, and simple tasks, it has limitations in handling unstructured data, responding to exceptions, and making complex decisions. In this study, we designed and implemented a four-step integration process, including automatic recognition of proofs such as receipts through OCR/IDP, item classification based on policy database, intelligent judgment support for exceptional situations through GenAI (LLMs), and human final decision and system learning (human-in-the-loop) through automation agents. As a result of the application to Company S, a large Korean company, quantitative effects such as reducing the processing time of branch receipt expenses by more than 80%, reducing error rates, and improving compliance rates were confirmed, as well as qualitative effects such as improving work accuracy and consistency, increasing employee satisfaction, and supporting data-based decision-making. In addition, the system learns from human judgment and continuously improves its ability to automatically handle exceptions, creating a virtuous cycle. This study empirically demonstrates that the organic combination of GenAI, IDP, and an automation agent overcomes the limitations of existing automation and is effective in realizing E2E automation of complex corporate tasks. In addition, it suggests the possibility of expansion to various business areas such as accounting, human resources, and purchasing in the future, as well as the development direction of AI-based hyperautomation. Received: 30 May 2025 | Revised: 30 July 2025 | Accepted: 22 September 2025 Conflicts of Interest The authors declare that they have no conflicts of interest to this work. Data Availability Statement Data sharing is not applicable to this article as no new data were created or analyzed in this study. Author Contribution Statement Cheonsu Jeong: Conceptualization, Supervision, Writing – original draft. Seongmin Sim: Software, Writing – review & editing. Hyoyoung Cho: Software, Writing – review & editing. Sungsu Kim: Software, Writing – review & editing. Byounggwan Shin: Software, Writing – review & editing.
This research delves into integrating Natural Language Processing (NLP) and Generative AI within Enterprise Resource Planning (ERP) systems to bolster fraud prevention measures. The study commences by thoroughly mapping the fraud tree taxonomy with existing ERP applications to identify areas susceptible to fraudulent activities, encompassing corruption, asset misappropriation, and financial statement fraud. The research proposes the utilisation of specific NLP and Generative AI APIs tailored to address these areas, effectively building upon these identified vulnerabilities. By establishing standardised criteria for API development, the research provides a comprehensive roadmap for the accounting and finance profession to adopt and implement these advanced technologies to combat fraud more efficiently. The suggested roadmap encompasses crucial stages, including evaluating organisational requirements, assessing API providers, seamlessly integrating APIs into ERP systems, conducting thorough testing, and establishing robust monitoring and governance mechanisms. The findings underscore the tremendous potential of NLP and Generative AI in fortifying fraud prevention endeavours within ERP systems while highlighting the interdisciplinary nature of this research, which amalgamates insights from ERP systems, fraud detection, NLP, and Generative AI. The study also encourages future empirical investigations to validate and refine the proposed solutions within real-world contexts.
ABSTRACT Generative Artificial Intelligence (AI), such as ChatGPT by OpenAI, has revolutionized the business world, with benefits including improved accessibility, efficiency, and cost reduction. This article reviews recent developments of generative AI in business and finance, summarizes its practical applications, provides examples of the latest generative AI tools, and demonstrates that generative AI can revolutionize data analysis in industry and academia. To test the ability of generative AI to support decision-making in financial markets, we use the ChatGPT to capture corporate sentiments towards environmental policy by inputting text extracted from corporate financial statements. Our results demonstrate that the sentiment scores generated by ChatGPT can predict firms’ risk-management capabilities and stock return performance. This study also highlights the potential challenges and limitations associated with generative AI. Finally, we propose several questions for future research at the intersection of generative AI with business and finance.
Background: In the domain of corporate governance, the separation of ownership and control generates significant agency conflicts, primarily manifesting as Earnings Management (EM). Traditional reactive auditing methods fail to detect manipulation concealed within unstructured data, leading to high agency costs and diminished stakeholder trust. Objective: This study proposes an "AI Proactive Monitoring Model" utilizing Generative Artificial Intelligence to fundamentally enhance the monitoring mechanisms of Agency Theory. Methods: The research employs a qualitative conceptual framework analysis. It synthesizes Agency Theory with the Technology Acceptance Model (TAM) and Systemic Risk Theory to construct a novel strategic governance model. Results: The proposed model shifts governance from periodic sampling to real-time, continuous analysis of total data populations. By cross-referencing structured financial data with unstructured communications (e.g., emails, contracts), the system generates "Risk Narratives" that contextualize anomalies and flag opportunistic behavior immediately. Conclusion: The integration of AI significantly reduces information asymmetry and moral hazard by creating a "panopticon" effect. However, successful implementation requires distinct regulatory frameworks to manage the systemic risks associated with algorithmic reliance.
Large language models (LLMs) have been quickly adopted in the financial services industry, allowing for sophisticated automation in document analysis, client service, compliance monitoring, and decision support. However, hallucinations, explainability issues, privacy concerns, and restricted access to valuable institutional information limit their use in regulated financial situations. Financial organizations need systems that ensure factual accuracy, traceability, and regulatory compliance in addition to producing fluid replies. This chapter describes an expert system for FinTech document intelligence based on Retrieval-Augmented Generation (RAG) that combines conventional term-based search with meaningful vector retrieval to guarantee dependable and auditable autonomous reasoning. The entire architecture, document ingesting pipeline, regulated prompt development, hybrid retrieval mechanism, variable structuring approaches, embedding generation, encryption model, and assessment methodology are all described. A thorough discussion is given of practical applications in financial services, identifying fraud, credit risk assessment, and adherence to regulations. A strategy plan for future study and establishment of policies is also presented, along with ethical issues and regulatory harmonization. Financial specialists can examine, confirm, and override artificially generated insights when needed thanks to the system's provision for human during validation. The platform facilitates regulatory examinations and improves transparency by keeping thorough derivation information and audit recordings for each generated answer. This method bridges the gap between the strict governance necessities of real-world economic ecosystems and advanced generative intelligence.
No abstract available
No abstract available
Munshi (2024) describes how ChatGPT is transforming and disrupting the audit profession. Although some studies have examined the use of ChatGPT by auditors, little research has focused on its use by clients in financial statement audits. Given the importance of inquiry in the audit process, erroneous or unsupported client responses generated by ChatGPT could lead to inefficiencies and compromise audit quality. We investigate whether ChatGPT 4.0 can provide plausible responses to audit fieldwork inquiries and explore the implications of clients using AI to answer auditor questions. Our preliminary results show ChatGPT can generate plausible responses to basic inquiries and provides guidance for answering more complex questions. This paper highlights the potential for artificial intelligence to disrupt traditional audit procedures and stresses the need for auditors to adapt inquiry methods to mitigate risks associated with AI-generated responses. These findings offer significant implications for practitioners and present new research avenues for academics.
Auditing is entering an era where generative artificial intelligence (AI) models are increasingly assisting in tasks such as internal control testing. This paper examines the “trust paradox” auditors face when relying on these AI systems that can hallucinate, produce plausible yet fabricated information. We combine qualitative and quantitative methods to investigate how auditors use and trust generative AI in evaluating internal controls. Interviews with audit professionals reveal both enthusiasm about AI’s efficiency and deep concern over its reliability. In an experimental simulation, we find that while an AI model can efficiently analyze vast control data and identify issues, it also generates false outputs (hallucinations) that could mislead auditors. A survey of practitioners further shows a cautious approach: most auditors are willing to use AI suggestions only with verification, balancing the benefits of automation against the risk of error. Our analysis highlights that over-reliance on AI without skepticism can undermine audit quality, yet under-utilizing AI forfeits potential improvements. We discuss strategies to resolve this paradox, including maintaining professional skepticism, implementing AI output validation controls, and enhancing model transparency. The study contributes actionable insights for audit firms and standard-setters on integrating generative AI into internal control testing in a responsible, trust-balanced manner.
Subject. The article discusses artificial intelligence (AI) implementation in accounting and auditing and risks that should be taken into account to successfully integrate modern technologies into professional practice. Objectives. The aim is to identify opportunities and risks inherent in the use of artificial intelligence in the automation of accounting and auditing processes, to justify ways to enhance efficiency and minimize associated risks. Methods. The study employs methods of risk systematization and classification, the theoretical analysis of existing scientific works on the use of AI in auditing and accounting. Results. We systematized risks associated with AI implementation in accounting and auditing, developed methodological recommendations for their mitigation, including strategic, legal, technical, and organizational aspects. The paper notes that professional competencies of accountants and auditors should be adapted to the new challenges and requirements arising from the use of AI. Conclusions. The study provides a deeper understanding of risks associated with the use of AI in accounting and auditing, and proposes approaches for their effective management. The developed recommendations have practical significance for organizations implementing AI, and can be used to improve qualifications of professionals in this field.
The use of automation and artificial intelligence (AI) in audit practice is increasingly becoming a major focus, with significant impact on the profession. This research depicts the current landscape of the use of AI in auditing, highlighting aspects such as automation and empowerment of the workforce in auditing, impact of AI on improving audit quality criteria, key factors in adopting AI-based audit techniques, impact of AI technology on audit evidence , and auditors' perceptions of AI in improving audit quality. The results and discussion show that while there are great benefits from integrating automation and AI in auditing, including improved audit quality, enhanced efficiency, and the ability to perform continuous audits, there are also challenges that need to be overcome, such as high customization costs for specific audit processes industry. The use of AI in auditing requires adaptation from auditors to changes in competencies and workflows to effectively utilize this technology. However, with proper understanding and careful handling of these challenges, AI has great potential to improve overall audit practices.
Payroll systems must execute jurisdiction-specific calculations, disclosures, and compliance checks under strict regulatory and audit constraints. While Retrieval-Augmented Generation (RAG) improves factual grounding, conventional RAG pipelines lack formal guarantees of correctness, authorization control, and provenance—critical for payroll. We present the RAG-Augmented Payroll Intelligence Stack (RAPIS), a policy-verified architecture that fuses hybrid retrieval with tool-based computation, attribute-based access control (ABAC), formal verification via SMT (Z3), and W3C PROV lineage for auditability. RAPIS introduces PayLang, a declarative DSL for encoding payroll rules that compiles into both authorization policies (OPA/Cedar) and verifiable constraints. Evaluated on Payroll-Bench, a multi-jurisdiction synthetic benchmark based on IRS Pub. 15, HMRC PAYE, EPF/ESI, and EU transparency directives, RAPIS outperforms existing baselines in accuracy, explainability, prompt-injection robustness, and provenance coverage. The results demonstrate that verifiable, policy-bounded reasoning is essential for deploying AI in regulated, high-assurance financial domains.
Introduction: the rapid development of the digital economic space in Ukraine has made digitising the accounting system a paramount issue. Ukraine’s integration trend into the European economic sphere has also contributed to this situation. Objectives: this research aims to analyse the current role of digital transformation tools in optimising the accounting system. Method: the study employed general methods of scientific inquiry, including analysis and synthesis, induction and deduction, abstraction, concretisation, and formalisation. Results: the research established that digital accounting transformation is an obligatory optimisation stage in developing the modern business environment. The analysis considered the feasibility and potential of implementing innovative artificial intelligence capabilities in accounting while ensuring adequate security measures. It was concluded that modern digital tools offer opportunities to streamline the collection and aggregation of accounting information through specialised industry software products. The identified risks associated with implementing artificial intelligence technologies into information systems were discussed. Conclusions: the study demonstrated that intensifying the integration of digital technologies into accounting processes can increase managerial decisions’ accuracy and efficiency.
Fin-RAG (Financial Retrieval-Augmented Generation) is an AI-powered chatbot system designed to simplify and accelerate financial data retrieval. Built on Retrieval-Augmented Generation (RAG), it enables natural language querying of financial documents, delivering accurate and context-aware responses in real time. The system supports both text-based and image-based documents, utilizing advanced NLP and image recognition capabilities. Users can extract key insights from balance sheets, profit and loss statements, and scanned invoices effortlessly. Fin-RAG leverages domain-specific embeddings via Hugging Face’s Inference API for precise and relevant search results. Key features include real-time insights, automated reporting, semantic search, and multimodal document analysis. Scalable and compliant, Fin-RAG improves financial decision-making efficiency. It is ideal for auditing, corporate finance, and strategic analysis.
The rapid digitalization of the financial sector has also increased the usage of artificial intelligence (AI) in operations, compliance, and regulatory reporting. Retrieval-augmented generation (RAG) is turning out to be a very prominent AI-driven approach that synergizes retrieval-based and generative models to deliver far better accuracy and efficiency in processing financial documents. Traditional methods for compliance reporting are manual, excruciatingly slow, and vulnerable to human errors, thereby creating a burden of regulatory scrutiny and monetary penalties. By using the power of RAG, financial institutions would automate the encapsulation of relevant information, summarize the sheer volume of regulatory text, and be in a real-time position to comply with ever-changing regulations: IFRS, Basel III, and GDPR. RAG would also provide forensic examination and disparate pattern detection in support of fraud, risk, and due diligence. This paper investigates the role of RAG in the automation of compliance and reporting processes pertaining to financial document processing. It addresses the regulatory compliance challenge, the drawbacks of the traditional document processing approach, and the merits of an AI-based automated approach. A qualitative study of those case studies and industry applications will prove the proposition that RAG enhances financial workflows through lower manual effort, higher data accuracy, and improved decision-making. The paper also discusses strategies for implementation in the context of financial institutions and provides insights into the developments in AI regulation in the future. With the growing embrace of AI-powered alternatives in the financial industry, RAG is an opportunity for game-changing transformation toward optimizing compliance reporting, actualizing risk mitigation, and driving operational efficiencies amid the complexity brought on by the regulatory environment.
Artificial Intelligence (AI) is rapidly reshaping the field of auditing and assurance by shifting the focus from traditional sampling and manual procedures toward continuous, data-driven examination. Emerging tools such as machine learning, natural language processing, and robotic process automation now enable auditors to evaluate entire datasets, uncover subtle anomalies in real time, and integrate both structured information (ledgers, transactions) and unstructured evidence (contracts, correspondence, digital records). For India, this transformation is particularly significant given the rapid expansion of e-invoicing, digital payment ecosystems, and enterprise resource planning platforms, which generate vast volumes of auditable data. While these advances create opportunities to enhance audit efficiency, fraud detection, and governance insights, they also introduce new challenges related to explainability, ethical use of algorithms, regulatory oversight, and disparities in technology readiness across firms. This paper examines the dual dimensions of opportunity and risk in adopting AI for auditing, develops a conceptual framework linking AI capability to audit quality, and proposes a risk–control matrix for designing “assurance-grade AI.” Policy recommendations highlight the need for strong governance structures, transparent documentation, regulatory clarity, and educational reforms to ensure that AI adoption in India strengthens—not undermines—audit quality and public trust.
This paper explores the transformative impact of artificial intelligence (AI) on modern auditing practices. As the audit profession grapples with increasing volumes of complex financial data, the integration of AI technologies has become essential for enhancing efficiency, improving accuracy, and strengthening fraud detection. The paper traces the evolution of auditing from manual processes to the current era of advanced AI-driven tools, highlighting core technologies such as machine learning algorithms, natural language processing, and robotic process automation. These innovations enable auditors to process and analyze large datasets rapidly, identify anomalies, and extract valuable insights from unstructured information. By automating repetitive tasks, AI allows auditors to concentrate on complex, value-added activities, thereby elevating the quality and reliability of financial reporting. Ultimately, the paper underscores how AI is reshaping the auditing landscape and discusses the implications for practitioners and stakeholders in the financial ecosystem.
This study explores how top auditing firms in Bangalore are using artificial intelligence (AI) to improve audit processes. With growing demands for faster and more accurate audits, AI tools are helping firms automate routine tasks, detect fraud, and assess risks more effectively. The purpose of this research is to understand the current level of AI adoption, the types of tools being used, and the benefits and challenges firms face. A qualitative approach was used, based on secondary data collected from company websites, industry reports, and regulatory documents. findings show that leading firms like Deloitte, EY, and KPMG are actively investing in AI, using tools such as machine learning, natural language processing, and automation platforms. While AI improves audit quality and efficiency, challenges include high costs, lack of skilled professionals, and unclear regulations. This study provides useful insights for auditors, policymakers, and educators to support responsible and effective AI adoption in the auditing field.
The rapid integration of Artificial Intelligence (AI) technologies, including Machine Learning (ML), Natural Language Processing (NLP), and Robotics Process Automation (RPA), into auditing practices within the assurance services sector has yielded transformative outcomes. This paper presents a comprehensive analysis of this profound impact on auditing, drawing upon contemporary literature and insightful case studies. The research reveals that the incorporation of AI technologies into auditing processes has led to remarkable advancements in efficiency, accuracy, and cost-effectiveness. AI-driven algorithms and automation tools have expedited the detection of anomalies and fraud, improved risk assessment, and enhanced the overall quality of audits. These developments are reshaping the auditing landscape, enabling auditors to focus on high-value tasks, such as strategic advisory services, while leaving routine, repetitive tasks to AI systems. However, alongside these benefits, several challenges have emerged. Ethical considerations surrounding AI -driven decision-making, data privacy concerns, and the need for continuous adaptation to evolving AI technologies pose significant hurdles for auditing professionals. Balancing the advantages of AI with these ethical and practical issues is essential to harnessing its full potential in the auditing field. In conclusion, this research underscores the undeniable impact of AI on auditing practices and highlights the ongoing evolution of the field in the digital age. It emphasizes the need for auditors to adapt, not only in terms of technology adoption but also in their approach to ethical and privacy considerations. Future research and practice in auditing must navigate these challenges to continue reaping the benefits of AI -driven transformations.
Fund misuse poses a significant challenge in fi-nancial auditing, necessitating the development of precise and efficient detection strategies. In our study, we have framed the detection of fund misuse as a multi-class classification problem, focusing on a select subset of payment data. This approach, tailored to the specifics of the domain, lays the groundwork for creating robust data analysis techniques within auditing. Our methodology begins with the use of a Large Language Model (LLM) to embed textual data, followed by the introduction of a classification framework that utilizes ensembles of one-class classifiers. Specifically, we transform the multi-class classification problem into a series of one-vs-all binary tasks, employing one-class SVM as the primary classifier. We then propose two viable one-vs-all ensemble methods: the straightforward Maximum Confidence strategy and the more complex stacking technique. For the stacking approach, we construct a three-layer fully connected neural network to serve as a meta-model, integrating the outputs of multiple base classifiers. Our experiments, conducted on two distinct real-life datasets, have validated the effectiveness of our model in pinpointing fund misuse instances. Furthermore, we explore the application of our methodology in financial auditing and demonstrate its adaptability through successful transfer learning across various datasets.
Generative AI has significantly reduced the entry barrier to the domain of AI owing to the ease of use and core capabilities of automation, translation, and intelligent actions in our day to day lives. Currently, Large language models (LLMs) that power such chatbots are being utilized primarily for their automation capabilities on a limited scope. One major limitation of the currently evolving family of LLMs is hallucinations, wherein inaccurate responses are reported as factual. Hallucinations are primarily caused by biased training data, ambiguous prompts and inaccurate LLM parameters, and they majorly occur while combining mathematical facts with language-based context. In this work we present the three major stages in the journey of designing hallucination-minimized LLM-based solutions that are specialized for the decision makers of the financial domain, namely: prototyping, scaling and LLM evolution using human feedback. These three stages and the novel data to answer generation modules presented in this work are necessary to ensure that the Generative AI products are reliable and high-quality to aid key decision-making processes.
Large Language Models (LLMs) have been applied to build several automation and personalized question-answering prototypes so far. However, scaling such prototypes to robust products with minimized hallucinations or fake responses still remains an open challenge, especially in niche data-table heavy domains such as financial decision making. In this work, we present a novel Langchain-based framework that transforms data tables into hierarchical textual "data chunks" to enable a wide variety of actionable question answering. First, the user-queries are classified by intention followed by automated retrieval of the most relevant data chunks to generate customized LLM prompts per query. Next, the custom prompts and their responses undergo multi-metric scoring to assess for hallucinations and response confidence. The proposed system is optimized with user-query intention classification, advanced prompting, data scaling capabilities and it achieves over $ 90\%$ confidence scores for a variety of user-queries responses ranging from {What, Where, Why, How, predict, trend, anomalies, exceptions} that are crucial for financial decision making applications. The proposed data to answers framework can be extended to other analytical domains such as sales and payroll to ensure optimal hallucination control guardrails.
Large language models (LLMs) have shown remarkable capabilities across various domains; however, the issue of hallucination poses a significant challenge, particularly in high-stakes areas like finance. This paper provides an empirical examination of hallucination exhibited by LLMs in financial tasks. This study investigates the ability of LLMs to accurately explain financial concepts, retrieve historical stock data, and explore methods for mitigating these hallucinations. The findings reveal that standard LLMs demonstrate substantial hallucination tendencies in financial contexts, highlighting the need for further research to improve their reliability.
Financial institutions of all sizes are increasingly adopting Large Language Models (LLMs) to enhance credit assessments, deliver personalized client advisory services, and automate various language-intensive processes. However, effectively deploying LLMs requires careful management of stringent data governance requirements, heightened demands for interpretability, ethical responsibilities, and rapidly evolving regulatory landscapes. To address these challenges, we introduce a structured six-decision framework specifically designed for the financial sector, guiding organizations systematically from initial feasibility assessments to final deployment strategies. The framework encourages institutions to: (1) evaluate whether an advanced LLM is necessary at all, (2) formalize robust data governance and privacy safeguards, (3) establish targeted risk management mechanisms, (4) integrate ethical considerations early in the development process, (5) justify the initiative's return on investment (ROI) and strategic value, and only then (6) choose the optimal implementation pathway -- open-source versus proprietary, or in-house versus vendor-supported -- aligned with regulatory requirements and operational realities. By linking strategic considerations with practical steps such as pilot testing, maintaining comprehensive audit trails, and conducting ongoing compliance evaluations, this decision framework offers a structured roadmap for responsibly leveraging LLMs. Rather than acting as a rigid, one-size-fits-all solution, it shows how advanced language models can be thoughtfully integrated into existing workflows -- balancing innovation with accountability to uphold stakeholder trust and regulatory integrity.
Financial institutions deploy Large Language Models (LLMs) for reconciliations, regulatory reporting, and client communications, but nondeterministic outputs (output drift) undermine auditability and trust. We quantify drift across five model architectures (7B-120B parameters) on regulated financial tasks, revealing a stark inverse relationship: smaller models (Granite-3-8B, Qwen2.5-7B) achieve 100% output consistency at T=0.0, while GPT-OSS-120B exhibits only 12.5% consistency (95% CI: 3.5-36.0%) regardless of configuration (p<0.0001, Fisher's exact test). This finding challenges conventional assumptions that larger models are universally superior for production deployment. Our contributions include: (i) a finance-calibrated deterministic test harness combining greedy decoding (T=0.0), fixed seeds, and SEC 10-K structure-aware retrieval ordering; (ii) task-specific invariant checking for RAG, JSON, and SQL outputs using finance-calibrated materiality thresholds (plus or minus 5%) and SEC citation validation; (iii) a three-tier model classification system enabling risk-appropriate deployment decisions; and (iv) an audit-ready attestation system with dual-provider validation. We evaluated five models (Qwen2.5-7B via Ollama, Granite-3-8B via IBM watsonx.ai, Llama-3.3-70B, Mistral-Medium-2505, and GPT-OSS-120B) across three regulated financial tasks. Across 480 runs (n=16 per condition), structured tasks (SQL) remain stable even at T=0.2, while RAG tasks show drift (25-75%), revealing task-dependent sensitivity. Cross-provider validation confirms deterministic behavior transfers between local and cloud deployments. We map our framework to Financial Stability Board (FSB), Bank for International Settlements (BIS), and Commodity Futures Trading Commission (CFTC) requirements, demonstrating practical pathways for compliance-ready AI deployments.
No abstract available
For a financial analyst, the question and answer (Q&A) segment of the company financial report is a crucial piece of information for various analysis and investment decisions. However, extracting valuable insights from the Q&A section has posed considerable challenges as the conventional methods such as detailed reading and note-taking lack scalability and are susceptible to human errors, and Optical Character Recognition (OCR) and similar techniques encounter difficulties in accurately processing unstructured transcript text, often missing subtle linguistic nuances that drive investor decisions. Here, we demonstrate the utilization of Large Language Models (LLMs) to efficiently and rapidly extract information from earnings report transcripts while ensuring high accuracy—transforming the extraction process as well as reducing hallucination by combining retrieval-augmented generation technique as well as metadata. We evaluate the outcomes of various LLMs with and without using our proposed approach based on various objective metrics for evaluating Q&A systems, and empirically demonstrate superiority of our method.
LLM agents struggle with regulatory audit replay: when asked to reproduce a flagged transaction decision with identical inputs, many deployments fail to return consistent results. We introduce the Determinism-Faithfulness Assurance Harness (DFAH), a framework for measuring trajectory determinism, decision determinism, and evidence-conditioned faithfulness in tool-using agents deployed in financial services. Across 4,700+ agentic runs (7 models, 4 providers, 3 financial benchmarks with 50 cases each at T=0.0), we find that decision determinism and task accuracy are not detectably correlated (r = -0.11, 95% CI [-0.49, 0.31], p = 0.63, n = 21 configurations): models can be deterministic without being accurate, and accurate without being deterministic. Because neither metric predicts the other in our sample, both must be measured independently, which is precisely what DFAH provides. Small models (7-20B) achieve near-perfect determinism through rigid pattern matching at the cost of accuracy (20-42%), while frontier models show moderate determinism (50-96%) with variable accuracy. No model achieves both perfect determinism and high accuracy, supporting DFAH's multi-dimensional measurement approach. We provide three financial benchmarks (compliance triage, portfolio constraints, and DataOps exceptions; 50 cases each) together with an open-source stress-test harness. Across these benchmarks and DFAH evaluation settings, Tier 1 models with schema-first architectures achieved determinism levels consistent with audit replay requirements.
The article is devoted to the analysis of modern approaches to forecasting financial trends using artificial intelligence (AI) in the conditions of 2024-2025, when the role of news factors, the speed of information dissemination and natural language processing methods has sharply increased. Key classes of models (machine learning on tabular features, neural network models of time series, transformers, ensembles) and practices for extracting signals from news streams are considered. Special attention is paid to the use of large language models (LLM) and Retrieval-Augmented Generation (RAG) approaches for structuring news, extracting events and building features, as well as the limitations of such solutions: data drift, retraining, interpretation difficulties, risks of LLM “hallucinations” and the need for compliance control. A step-by-step scheme for developing a software solution to support analytics and risk management is proposed: data collection and validation, feature formation, training in walk-forward modes, quality and stability assessment, implementation with MLOps monitoring and routine updates. It is concluded that the greatest applied effect is achieved when using AI as a decision support tool, when forecasts are accompanied by reliability metrics, explanations and scenario analysis.
The hallucination issue is recognized as a fundamental deficiency of large language models (LLMs), especially when applied to fields such as finance, education, and law. Despite the growing concerns, there has been a lack of empirical investigation. In this paper, we provide an empirical examination of LLMs' hallucination behaviors in financial tasks. First, we empirically investigate LLM model's ability of explaining financial concepts and terminologies. Second, we assess LLM models' capacity of querying historical stock prices. Third, to alleviate the hallucination issue, we evaluate the efficacy of four practical methods, including few-shot learning, Decoding by Contrasting Layers (DoLa), the Retrieval Augmentation Generation (RAG) method and the prompt-based tool learning method for a function to generate a query command. Finally, our major finding is that off-the-shelf LLMs experience serious hallucination behaviors in financial tasks. Therefore, there is an urgent need to call for research efforts in mitigating LLMs' hallucination.
Financial tasks are pivotal to global economic stability; however, their execution faces challenges including labor intensive processes, low error tolerance, data fragmentation, and tool limitations. Although large language models (LLMs) have succeeded in various natural language processing tasks and have shown potential in automating workflows through reasoning and contextual understanding, current benchmarks for evaluating LLMs in finance lack sufficient domain-specific data, have simplistic task design, and incomplete evaluation frameworks. To address these gaps, this article presents FinMaster, a comprehensive financial benchmark designed to systematically assess the capabilities of LLM in financial literacy, accounting, auditing, and consulting. Specifically, FinMaster comprises three main modules: i) FinSim, which builds simulators that generate synthetic, privacy-compliant financial data for companies to replicate market dynamics; ii) FinSuite, which provides tasks in core financial domains, spanning 183 tasks of various types and difficulty levels; and iii) FinEval, which develops a unified interface for evaluation. Extensive experiments over state-of-the-art LLMs reveal critical capability gaps in financial reasoning, with accuracy dropping from over 90% on basic tasks to merely 40% on complex scenarios requiring multi-step reasoning. This degradation exhibits the propagation of computational errors, where single-metric calculations initially demonstrating 58% accuracy decreased to 37% in multimetric scenarios. To the best of our knowledge, FinMaster is the first benchmark that covers full-pipeline financial workflows with challenging tasks. We hope that FinMaster can bridge the gap between research and industry practitioners, driving the adoption of LLMs in real-world financial practices to enhance efficiency and accuracy.
Large Language Models (LLMs) hold immense promise for revolutionizing financial analysis and decision-making, yet their direct application is often hampered by issues of data hallucination and lack of access to real-time, verifiable financial information. This paper introduces QuantMCP, a novel framework designed to rigorously ground LLMs in financial reality. By leveraging the Model Context Protocol (MCP) for standardized and secure tool invocation, QuantMCP enables LLMs to accurately interface with a diverse array of Python-accessible financial data APIs (e.g., Wind, yfinance). Users can interact via natural language to precisely retrieve up-to-date financial data, thereby overcoming LLM's inherent limitations in factual data recall. More critically, once furnished with this verified, structured data, the LLM's analytical capabilities are unlocked, empowering it to perform sophisticated data interpretation, generate insights, and ultimately support more informed financial decision-making processes. QuantMCP provides a robust, extensible, and secure bridge between conversational AI and the complex world of financial data, aiming to enhance both the reliability and the analytical depth of LLM applications in finance.
Large Language Models (LLMs) have proven their effectiveness in a variety of general Natural Language Processing (NLP) tasks. However, their performance in financial credit assessment tasks has yet to reach its full potential, partly because these tasks require specific financial credit expertise. To address this challenge, we propose the ZiGong model, based on Mistral, which employs multi-task supervised fine-tuning. Furthermore, to address the issue of model hallucination in financial scenarios, we propose a novel data pruning method. Specifically, we employ an agent model to assign scores to training samples, and then integrate the pruned samples with the original data for model training. This approach effectively mitigates hallucinations in large models by refining the training data, ensuring higher reliability in downstream applications. Experimental results demonstrate that our method significantly improves the model's robustness and accuracy in real-world financial scenarios.
In response to the challenges of multimodal data integration, real-time information retrieval, model hallucination, and lack of interpretability in financial stock analysis, this paper proposes an innovative financial analysis framework—FSframe. It aims to address multiple challenges in stock analysis within the financial sector. The framework integrates various technological modules to provide comprehensive and efficient solutions for stock trend prediction and financial question answering tasks. First, FSframe optimizes large language models (LLMs), enhancing their adaptability to financial tasks, and incorporates prompt engineering to mitigate potential hallucination issues during the generation process, thereby improving the accuracy and reliability of the analysis. Secondly, the framework introduces Retrieval-Augmented Generation (RAG) technology, creating a dynamically updated financial knowledge base that enables the model to retrieve and integrate the latest market data, providing real-time external knowledge support for tasks. Furthermore, FSframe adopts a sparse attention mechanism, optimizing the processing efficiency of time-series data by filtering irrelevant information and focusing on key points, while also achieving efficient integration of time-series and textual data. Finally, through its modular design, FSframe organically combines the aforementioned advanced technologies, forming an innovative solution that blends multimodal data processing with real-time analysis, offering strong technical support for intelligent analysis in the financial sector. Validation on large-scale financial datasets (including historical stock prices, financial news, and market announcements) shows that FSframe significantly improves prediction accuracy and real-time responsiveness in stock trend forecasting and financial question answering tasks. Experimental results indicate that FSframe offers significant advantages in multimodal data integration, real-time performance, and interpretability, demonstrating excellent task adaptability and addressing the shortcomings of traditional methods. The FSframe framework not only provides an innovative solution for stock analysis in the financial sector but also opens new pathways for the development of intelligent financial technologies.
The emergence of Large Language Models (LLMs) like ChatGPT, Claude, Grok, and Outlook has introduced a novel paradigm in financial forecasting and stock market analysis. This study presents a comparative evaluation of these transformer-based models in the domain of stock interpretation and prediction. Utilizing standardized prompts across five major Indian equities TCS, Infosys, Wipro, Reliance, and Adani Enterprises the models were assessed on six key dimensions: factual accuracy, technical validity, analytical coherence, forecast plausibility, language clarity, and hallucination rate. A semi-quantitative rubric was applied, and predictions were benchmarked against historical data from 2018 to 2024. ChatGPT demonstrated the highest consistency and accuracy, with Claude offering strong interpretative insights. Outlook displayed conservative but coherent performance, whereas Grok exhibited the highest frequency of hallucinated content. The study also incorporated Random Forest-based forecasting and technical indicators like golden or death crosses to align model outputs with empirical trends. Findings highlight the potential and limitations of LLMs in financial contexts, advocating for cautious integration with traditional econometric tools to enhance reliability.
Customer service systems in the financial industry require accurate and effi-cient question-answering solutions. Traditional methods, such as rule-based chatbots, struggle with complex queries, while Large Language Models (LLMs) face challenges like hallucination and outdated knowledge. This study explores the effectiveness of Retrieval-Augmented Generation (RAG) in answering financial questions using various retrieval methods, including BM25, embedding models, and reranker models. Experimental results show that BM25 is the fastest but less accurate, while the Reranker Model achieves the highest accuracy $(0.8933)$ at a high computational cost. The best balance is found by combining BM25, a Reranker Model, and Recursive Token Chunker, improving accuracy (0.92) while reducing execution time (2427 seconds). This approach enhances AI-driven financial services by providing reliable and up-to-date responses.
To improve stock trend predictions and support personalized investment decisions, this paper proposes FinArena, a novel Human-Agent collaboration framework. Inspired by the mixture of experts (MoE) approach, FinArena combines multimodal financial data analysis with user interaction. The human module features an interactive interface that captures individual risk preferences, allowing personalized investment strategies. The machine module utilizes a Large Language Model-based (LLM-based) multi-agent system to integrate diverse data sources, such as stock prices, news articles, and financial statements. To address hallucinations in LLMs, FinArena employs the adaptive Retrieval-Augmented Generative (RAG) method for processing unstructured news data. Finally, a universal expert agent makes investment decisions based on the features extracted from multimodal data and investors' individual risk preferences. Extensive experiments show that FinArena surpasses both traditional and state-of-the-art benchmarks in stock trend prediction and yields promising results in trading simulations across various risk profiles. These findings highlight FinArena's potential to enhance investment outcomes by aligning strategic insights with personalized risk considerations.
This research project addresses the errors of financial numerical reasoning Question Answering (QA) tasks due to the lack of domain knowledge in finance. Despite recent advances in Large Language Models (LLMs), financial numerical questions remain challenging because they require specific domain knowledge in finance and complex multi-step numeric reasoning. We implement a multi-retriever Retrieval Augmented Generators (RAG) system to retrieve both external domain knowledge and internal question contexts, and utilize the latest LLM to tackle these tasks. Through comprehensive ablation experiments and error analysis, we find that domain-specific training with the SecBERT encoder significantly contributes to our best neural symbolic model surpassing the FinQA paper's top model, which serves as our baseline. This suggests the potential superior performance of domain-specific training. Furthermore, our best prompt-based LLM generator achieves the state-of-the-art (SOTA) performance with significant improvement (>7%), yet it is still below the human expert performance. This study highlights the trade-off between hallucinations loss and external knowledge gains in smaller models and few-shot examples. For larger models, the gains from external facts typically outweigh the hallucination loss. Finally, our findings confirm the enhanced numerical reasoning capabilities of the latest LLM, optimized for few-shot learning.
Financial transactions are increasingly being handled by automated programs called smart contracts . However, one challenge in the adaptation of smart contracts is the presence of vulnerabilities, which can cause significant monetary loss. In 2024, $247.88 M was lost in 20 smart contract exploits. According to a recent study, accounting bugs (i.e., incorrect implementations of domain-specific financial models) are the most prevalent type of vulnerability, and are one of the most difficult to find, requiring substantial human efforts. While Large Language Models (LLMs) have shown promise in identifying these bugs, they often suffer from lack of generalization of vulnerability types, hallucinations, and problems with representing smart contracts in limited token context space. This paper proposes a hybrid system combining LLMs and rule-based reasoning to detect accounting error vulnerabilities in smart contracts. In particular, it utilizes the understanding capabilities of LLMs to annotate the financial meaning of variables in smart contracts, and employs rule-based reasoning to propagate the information throughout a contract’s logic and to validate potential vulnerabilities. To remedy hallucinations, we propose a feedback loop where validation is performed by providing the reasoning trace of vulnerabilities to the LLM for iterative self-reflection. We achieve 75.6% accuracy on the labelling of financial meanings against human annotations. Furthermore, we achieve a recall of 90.5% from running on 23 real-world smart contract projects containing 21 accounting error vulnerabilities. Finally, we apply the automated technique on 8 recent projects, finding 4 known and 2 unknown bugs.
Going beyond simple text processing, financial auditing requires detecting semantic, structural, and numerical inconsistencies across large-scale disclosures. As financial reports are filed in XBRL, a structured XML format governed by accounting standards, auditing becomes a structured information extraction and reasoning problem involving concept alignment, taxonomy-defined relations, and cross-document consistency. Although large language models (LLMs) show promise on isolated financial tasks, their capability in professional-grade auditing remains unclear. We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings. It contains 1,102 annotated instances averaging over 33k tokens and defines three tasks: Financial Semantic Matching (FinSM), Financial Relationship Extraction (FinRE), and Financial Mathematical Reasoning (FinMR). Evaluations of 13 state-of-the-art LLMs reveal substantial gaps in concept retrieval, taxonomy-aware relation modeling, and consistent cross-document reasoning. These findings highlight the need for realistic, structure-aware benchmarks. We release the evaluation code at https://github.com/The-FinAI/FinAuditing and the dataset at https://huggingface.co/collections/TheFinAI/finauditing. The task currently serves as the official benchmark of an ongoing public evaluation contest at https://open-finance-lab.github.io/SecureFinAI_Contest_2026/.
This study offers an in-depth analysis of the next-generation intelligent audit concept—a financial control system that integrates artificial intelligence (AI), autonomous digital platforms, Explainable AI (XAI), and large language models (LLM). An innovative approach to implementing Audit-as-a-Service (AaaS), specifically tailored to the needs of small and medium-sized enterprises (SMEs), is described. The focus is placed on platform architecture, AI model transparency, personalized analytics, as well as automation of compliance processes and enhancing the strategic value of managerial decision-making.
Accurately extracting and representing the structure of tabular data from financial documents remains a critical challenge in document understanding, particularly for regulatory and analytical use cases. This study addresses the complexity of converting financial tables from Malaysian audited financial reports into Markdown format, a task complicated by rotated layouts, multi-level headers, and implicit structural cues. We propose a fine-tuned vision-language model (VLM), based on Qwen2.5-VL-7B, optimized for high-fidelity Markdown generation from document images. Our approach includes a curated dataset of 2,152 image-text pairs with augmentations and a supervised fine-tuning strategy using LoRA. To assess performance, we evaluated our model on 100 out-of-sample tables using a dual framework: a criteria-based LLM-as-a-judge for fine-grained accuracy and our novel Markdown Tree-Edit-Distance-based Similarity (TEDS) metric for holistic structural fidelity. Our model achieves a 92.20% overall accuracy on the criteria-based assessment and a 96.53% Markdown TEDS score. This performance significantly surpasses its Qwen2.5-VL-7B base model, larger-scale VLMs, and specialized reasoning-enabled models. Compared to these self-hosted alternatives, it also significantly reduces inference time. Furthermore, its accuracy exceeds that of widely used proprietary models such as OpenAI's GPT-4o and Gemini 2.5 Flash. These results demonstrate that domain-specific fine-tuning provides an effective and efficient method to bridge the gap between unstructured financial documents and downstream automation, rivalling much larger and more general models without their computational overhead.
The traditional invoice issuance process within tax administration is labor-intensive and prone to errors, necessitating a shift towards digitalization. Despite the advent of digital invoicing systems that streamline invoice generation and automate rule-based audits, integration with existing financial accounting systems remains a challenge. Particularly in the hospitality and bookkeeping sectors, the adoption of these systems is hindered by the lack of standardized software, high costs, and the absence of technical expertise among small and micro enterprises. The integration of digital invoicing systems with diverse financial software presents significant barriers to uniform adaptation. Furthermore, the complexity of tax regulations and the dynamic nature of tax categories require advanced understanding beyond the capabilities of standard Large Language Models (LLMs). The need for a specialized system that can comprehend finance and tax contexts, securely handle sensitive information, and adapt to user interactions is paramount. This paper introduces an autonomous agent based on a finance and tax-specific Large Language Model (LLM) designed to address the aforementioned challenges. The system includes a Specialized Training Framework to enhance domain comprehension, a Hierarchical Memory Architecture for dynamic user interaction, and a Tax Domain Security Module to ensure compliance with tax regulations. The proposed agent aims to improve the efficiency and accuracy of the invoice issuance process, providing a robust solution for tax administration in the digital era.
Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning required for transparency and verification. To address this gap, we introduce FINCHAIN, the first benchmark specifically designed for verifiable Chain-of-Thought (CoT) evaluation in finance. FINCHAIN spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python traces that enable fully machine-verifiable reasoning and scalable, contamination-free data generation. To assess reasoning capacity, we propose CHAINEVAL, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Our evaluation of 26 leading LLMs reveals that even frontier proprietary LLMs exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap. Overall, FINCHAIN exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI.
Corporate credit rating (CCR) plays a crucial role in maintaining the safety and stability of financial markets. Existing methods mainly focus on financial data (FD) while overlooking the rich business information contained in non-financial data (NFD) (such as corporate annual reports). Additionally, when processing long NFD documents, there is often the challenge of losing contextual information. To address these issues, this paper proposes an innovative multi-source fusion CCR framework—GLCM, which effectively integrates structured information from FD and qualitative information from NFD. Experimental results on the C2R2 dataset show that our framework achieves an F1 of 0.840, outperforming the strongest baseline by +9.1 points.
This study investigates the application of large language models (LLMs) in sentiment analysis of financial news and their use in developing effective investment strategies. We conducted sentiment analysis on news articles related to the top 30 companies listed on Nasdaq using both discriminative models such as BERT and FinBERT, and generative models including Llama 3.1, Mistral, and Gemma 2. To enhance the robustness of the analysis, advanced prompting techniques—such as Chain of Thought (CoT), Super In-Context Learning (SuperICL), and Bootstrapping—were applied to generative LLMs. The results demonstrate that long strategies generally yield superior portfolio performance compared to short and long–short strategies. Notably, generative LLMs outperformed discriminative models in this context. We also found that the application of SuperICL to generative LLMs led to significant performance improvements, with further enhancements noted when both SuperICL and Bootstrapping were applied together. These findings highlight the profitability and stability of the proposed approach. Additionally, this study examines the explainability of LLMs by identifying critical data considerations and potential risks associated with their use. The research highlights the potential of integrating LLMs into financial strategy development to provide a data-driven foundation for informed decision-making in financial markets.
As Large Language Models (LLMs) become more pervasive, their capability to generate convincing financial news poses an escalating threat to investor decision-making and market stability. However, contemporary content moderation and AIbased verification systems exhibit notable vulnerabilities when confronted with the subtle linguistic manipulations introduced by advanced prompt engineering techniques and adversarial training. This study investigated the comparative credibility, influence, and detectability of AI-generated financial headlines produced via Zero-Shot, Few-Shot (8-Shot), and Chain-of-Thought (CoT) prompting, with CoT outputs further used to train a GAN for adversarially enhanced text generation. We compiled a combined dataset of NASDAQ-listed securities and web-scraped, human authored news, generated additional AI-driven headlines under three prompting paradigms, and conducted a survey of randomly sampled headlines ($\mathbf{n} \boldsymbol{=} \mathbf{3 0 0}$) to assess the credibility, market perception impact, investment influence, and AI detectability. The analysis revealed that headlines generated through Chain-of-Thought prompting consistently scored higher in perceived authenticity, influenced investment sentiment more profoundly, and were harder for participants to classify as AI-written. The findings underscore the urgent need for adversarially robust content moderation and verification mechanisms, capable of adapting to the rapidly evolving landscape of AI-generated financial misinformation, particularly when Chain-of-Thought reasoning is leveraged to enhance GAN-generated content.
Within the complex manifold of financial linguistics, robust sentiment analysis demands domain-specific language modeling and transparent reasoning mechanisms—yet con-ventional pre-trained architectures frequently demonstrate inadequate performance when confronted with the subtle complexities of market discourse. We present herein a pa-rameter-efficient framework constructed upon a compact nanoLLaVA architecture to address fine-grained financial sentiment classification with mathematical precision. Our methodology incorporates chain-of-thought prompting to extract sequential logical rea-soning and employs direct preference optimization to harmonize model outputs with ex-pertly curated exemplars. Experimental evidence from the StockEmotions corpus demonstrates that this convergence of CoT prompting and DPO substantially enhances both accuracy metrics and logical coherence. Our approach yields significant perfor-mance improvements over existing models. These empirical findings confirm that do-main-specialized alignment combined with explicit reasoning pathways constitutes a fundamental requirement for reliable and interpretable sentiment analysis within algo-rithmic trading systems.
Financial report question answering (FRQA) presents unique challenges due to the need for precise numerical reasoning, complex table structures, and multi-table associations. Existing approaches often overlook the domain-specific complexities of financial reports and struggle with accurate numerical computation, leading to suboptimal performance in real-world financial intelligence applications. In this study, we propose FinQA-PKD, a framework designed to mitigate these challenges through a novel integration of progressive knowledge distillation and numerical reasoning enhancement. Our method introduces a difficulty-aware curriculum learning strategy that organizes training into two progressive stages, facilitating more effective and stable model learning. To address the limitations of large language models in numerical reasoning, we develop a numerical reasoning enhancement module that automatically decomposes calculation chains, augments numerical tokens, and validates results using a financial formula library. Furthermore, we implement a domain-adaptive selective knowledge distillation strategy, which evaluates teacher model outputs based on numerical accuracy, calculation correctness, and terminology precision, and selectively distills knowledge from high-quality samples. Experimental results in benchmark datasets demonstrate that FinQA-PKD improves numerical and calculation accuracy, achieving competitive performance with reduced computational resources. This framework provides a robust and efficient solution for answering financial report questions in practical financial analysis scenarios.
In this paper, we extend financial sentiment analysis~(FSA) to event-level since events usually serve as the subject of the sentiment in financial text. Though extracting events from the financial text may be conducive to accurate sentiment predictions, it has specialized challenges due to the lengthy and discontinuity of events in a financial text. To this end, we reconceptualize the event extraction as a classification task by designing a categorization comprising coarse-grained and fine-grained event categories. Under this setting, we formulate the \textbf{E}vent-Level \textbf{F}inancial \textbf{S}entiment \textbf{A}nalysis~(\textbf{EFSA} for short) task that outputs quintuples consisting of (company, industry, coarse-grained event, fine-grained event, sentiment) from financial text. A large-scale Chinese dataset containing $12,160$ news articles and $13,725$ quintuples is publicized as a brand new testbed for our task. A four-hop Chain-of-Thought LLM-based approach is devised for this task. Systematically investigations are conducted on our dataset, and the empirical results demonstrate the benchmarking scores of existing methods and our proposed method can reach the current state-of-the-art. Our dataset and framework implementation are available at https://anonymous.4open.science/r/EFSA-645E
As financial institutions and professionals increasingly incorporate Large Language Models (LLMs) into their workflows, substantial barriers, including proprietary data and specialized knowledge, persist between the finance sector and the AI community. These challenges impede the AI community's ability to enhance financial tasks effectively. Acknowledging financial analysis's critical role, we aim to devise financial-specialized LLM-based toolchains and democratize access to them through open-source initiatives, promoting wider AI adoption in financial decision-making. In this paper, we introduce FinRobot, a novel open-source AI agent platform supporting multiple financially specialized AI agents, each powered by LLM. Specifically, the platform consists of four major layers: 1) the Financial AI Agents layer that formulates Financial Chain-of-Thought (CoT) by breaking sophisticated financial problems down into logical sequences; 2) the Financial LLM Algorithms layer dynamically configures appropriate model application strategies for specific tasks; 3) the LLMOps and DataOps layer produces accurate models by applying training/fine-tuning techniques and using task-relevant data; 4) the Multi-source LLM Foundation Models layer that integrates various LLMs and enables the above layers to access them directly. Finally, FinRobot provides hands-on for both professional-grade analysts and laypersons to utilize powerful AI techniques for advanced financial analysis. We open-source FinRobot at \url{https://github.com/AI4Finance-Foundation/FinRobot}.
Financial narratives from U.S. Securities and Exchange Commission (SEC) filing reports and quarterly earnings call transcripts (ECTs) are very important for investors, auditors, and regulators. However, their length, financial jargon, and nuanced language make fine-grained analysis difficult. Prior sentiment analysis in the financial domain required a large, expensive labeled dataset, making the sentence-level stance towards specific financial targets challenging. In this work, we introduce a sentence-level corpus for stance detection focused on three core financial metrics: debt, earnings per share (EPS), and sales. The sentences were extracted from Form 10-K annual reports and ECTs, and labeled for stance (positive, negative, neutral) using the advanced ChatGPT-o3-pro model under rigorous human validation. Using this corpus, we conduct a systematic evaluation of modern large language models (LLMs) using zero-shot, few-shot, and Chain-of-Thought (CoT) prompting strategies. Our results show that few-shot with CoT prompting performs best compared to supervised baselines, and LLMs'performance varies across the SEC and ECT datasets. Our findings highlight the practical viability of leveraging LLMs for target-specific stance in the financial domain without requiring extensive labeled data.
This paper introduces an open-source framework designed to facilitate the development and deployment of Large Language Model (LLM) -orchestrated agents for financial applications. The framework addresses challenges in integrating LLMs into finance by providing a layered architecture that supports the creation of specialized agents and incorporates a novel Financial Chain-of-Thought (CoT) prompting technique. The platform's design emphasizes modularity, multi-source LLM integration, and efficient data handling to enhance financial analysis workflows.
Large Language Models (LLMs) are increasingly deployed in sensitive domains such as healthcare, finance, and law, yet their integration raises pressing concerns around trust, accountability, and reliability. This paper explores adaptive trust metrics for multi LLM ecosystems, proposing a framework for quantifying and improving model reliability under regulated constraints. By analyzing system behaviors, evaluating uncertainty across multiple LLMs, and implementing dynamic monitoring pipelines, the study demonstrates practical pathways for operational trustworthiness. Case studies from financial compliance and healthcare diagnostics illustrate the applicability of adaptive trust metrics in real world settings. The findings position adaptive trust measurement as a foundational enabler for safe and scalable AI adoption in regulated industries.
Financial news is essential for accurate market prediction, but evolving narratives across macroeconomic regimes introduce semantic and causal drift that weaken model reliability. We present an evaluation framework to quantify robustness in financial NLP under regime shifts. The framework defines four metrics: (1) Financial Causal Attribution Score (FCAS) for alignment with causal cues, (2) Patent Cliff Sensitivity (PCS) for sensitivity to semantic perturbations, (3) Temporal Semantic Volatility (TSV) for drift in latent text representations, and (4) NLI-based Logical Consistency Score (NLICS) for entailment coherence. Applied to LSTM and Transformer models across four economic periods (pre-COVID, COVID, post-COVID, and rate hike), the metrics reveal performance degradation during crises. Semantic volatility and Jensen-Shannon divergence correlate with prediction error. Transformers are more affected by drift, while feature-enhanced variants improve generalisation. A GPT-4 case study confirms that alignment-aware models better preserve causal and logical consistency. The framework supports auditability, stress testing, and adaptive retraining in financial AI systems.
Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.
Financial question answering (QA) over long corporate filings requires evidence to satisfy strict constraints on entities, financial metrics, fiscal periods, and numeric values. However, existing LLM-based rerankers primarily optimize semantic relevance, leading to unstable rankings and opaque decisions on long documents. We propose FinCards, a structured reranking framework that reframes financial evidence selection as constraint satisfaction under a finance-aware schema. FinCards represents filing chunks and questions using aligned schema fields (entities, metrics, periods, and numeric spans), enabling deterministic field-level matching. Evidence is selected via a multi-stage tournament reranking with stability-aware aggregation, producing auditable decision traces. Across two corporate filing QA benchmarks, FinCards substantially improves early-rank retrieval over both lexical and LLM-based reranking baselines, while reducing ranking variance, without requiring model fine-tuning or unpredictable inference budgets. Our code is available at https://github.com/XanderZhou2022/FINCARDS.
As a typical and practical application of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) techniques have gained extensive attention, particularly in vertical domains where LLMs may lack domain-specific knowledge. In this paper, we introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including (1) a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios; (2) a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47\% acceptance ratio in human evaluations on generated instances; (3) a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline; and (4) robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets and highlights the performance variations of RAG systems across diverse topics and tasks, revealing significant opportunities for RAG models to improve their capabilities in vertical domains. We open source the code of our benchmark in \href{https://github.com/RUC-NLPIR/OmniEval}{https://github.com/RUC-NLPIR/OmniEval}.
The financial domain poses unique challenges for knowledge graph (KG) construction at scale due to the complexity and regulatory nature of financial documents. Despite the critical importance of structured financial knowledge, the field lacks large-scale, open-source datasets capturing rich semantic relationships from corporate disclosures. We introduce an open-source, large-scale financial knowledge graph dataset built from the latest annual SEC 10-K filings of all S&P 100 companies - a comprehensive resource designed to catalyze research in financial AI. We propose a robust and generalizable knowledge graph (KG) construction framework that integrates intelligent document parsing, table-aware chunking, and schema-guided iterative extraction with a reflection-driven feedback loop. Our system incorporates a comprehensive evaluation pipeline, combining rule-based checks, statistical validation, and LLM-as-a-Judge assessments to holistically measure extraction quality. We support three extraction modes-single-pass, multi-pass, and reflection-agent-based allowing flexible trade-offs between efficiency, accuracy, and reliability based on user requirements. Empirical evaluations demonstrate that the reflection-agent-based mode consistently achieves the best balance, attaining a 64.8% compliance score against all rule-based policies (CheckRules) and outperforming baseline methods (single-pass & multi-pass) across key metrics such as precision, comprehensiveness, and relevance in LLM-guided evaluations. The utility of our KG pipeline is demonstrated through its flexible extraction modes, coupled with a multi-faceted evaluation methodology. By releasing a high-quality, thoroughly evaluated dataset along with a comprehensive KG construction & evaluation framework, we aim to advance transparency, reproducibility, and innovation in financial KG research. The dataset is publicly available at: https://anonymous.4open.science/r/KG-Financial-Datasets-SP-100-529B/README.md
Large Language Models (LLMs) have revolutionised the landscape of natural language processing (NLP), offering sophisticated conversational capabilities across various domains. This paper explores the adaptation of Meta’s LLaMA model for financial chatbot applications, emphasising domain-specific fine-tuning and performance evaluation. Fine-tuning LLaMA for finance requires specialised datasets, encompassing market trends, financial regulations, and investment strategies to enhance contextual understanding and response accuracy. Key aspects of this process include data curation, supervised fine-tuning, and reinforcement learning techniques, which aim to align model outputs with financial reasoning and industry standards. Furthermore, evaluation metrics such as perplexity, response coherence, and financial sentiment analysis are examined to gauge chatbot effectiveness. By integrating domain-specific knowledge, LLaMA-powered financial chatbots can provide users with more precise, context-aware insights, facilitating tasks such as portfolio management, risk assessment, and regulatory compliance. Advancements in retrieval-augmented generation (RAG) and model distillation further optimise performance, ensuring efficiency and reliability in financial applications. The paper also addresses ethical considerations, including bias mitigation and regulatory compliance, to promote the responsible deployment of AI in the financial services sector.
Long-context large language models (LC LLMs) promise to increase reliability of LLMs in real-world tasks requiring processing and understanding of long input documents. However, this ability of LC LLMs to reliably utilize their growing context windows remains under investigation. In this work, we evaluate the performance of state-of-the-art GPT-4 suite of LC LLMs in solving a series of progressively challenging tasks, as a function of factors such as context length, task difficulty, and position of key information by creating a real world financial news dataset. Our findings indicate that LC LLMs exhibit brittleness at longer context lengths even for simple tasks, with performance deteriorating sharply as task complexity increases. At longer context lengths, these state-of-the-art models experience catastrophic failures in instruction following resulting in degenerate outputs. Our prompt ablations also reveal unfortunate continued sensitivity to both the placement of the task instruction in the context window as well as minor markdown formatting. Finally, we advocate for more rigorous evaluation of LC LLMs by employing holistic metrics such as F1 (rather than recall) and reporting confidence intervals, thereby ensuring robust and conclusive findings.
Stablecoins such as USDT and USDC aspire to peg stability by coupling issuance controls with reserve attestations. In practice, however, the transparency is split across two worlds: verifiable on-chain traces and off-chain disclosures locked in unstructured text that are unconnected. We introduce a large language model (LLM)-based automated framework that bridges these two dimensions by aligning on-chain issuance data with off-chain disclosure statements. First, we propose an integrative framework using LLMs to capture and analyze on- and off-chain data through document parsing and semantic alignment, extracting key financial indicators from issuer attestations and mapping them to corresponding on-chain metrics. Second, we integrate multi-chain issuance records and disclosure documents within a model context protocol (MCP) framework that standardizes LLMs access to both quantitative market data and qualitative disclosure narratives. This framework enables unified retrieval and contextual alignment across heterogeneous stablecoin information sources and facilitates consistent analysis. Third, we demonstrate the capability of LLMs to operate across heterogeneous data modalities in blockchain analytics, quantifying discrepancies between reported and observed circulation and examining their implications for cross-chain transparency and price dynamics. Our findings reveal systematic gaps between disclosed and verifiable data, showing that LLM-assisted analysis enhances cross-modal transparency and supports automated, data-driven auditing in decentralized finance (DeFi).
Financial forecasting increasingly uses large neural network models, but their opacity raises challenges for trust and regulatory compliance. We present several approaches to explainable and reliable AI in finance. \emph{First}, we describe how Time-LLM, a time series foundation model, uses a prompt to avoid a wrong directional forecast. \emph{Second}, we show that combining foundation models for time series forecasting with a reliability estimator can filter our unreliable predictions. \emph{Third}, we argue for symbolic reasoning encoding domain rules for transparent justification. These approaches shift emphasize executing only forecasts that are both reliable and explainable. Experiments on equity and cryptocurrency data show that the architecture reduces false positives and supports selective execution. By integrating predictive performance with reliability estimation and rule-based reasoning, our framework advances transparent and auditable financial AI systems.
No abstract available
本报告综合了LLM在智能审计与财报分析领域的全方位研究。行业正经历从“工具化替代”到“系统性重构”的转变:一方面,通过RAG、智能体架构和跨模态对齐技术,LLM在自动化审计、舞弊识别和复杂财报解析中的效率显著提升;另一方面,学术界和实务界正通过构建严苛的行业基准(如FinAuditing)和治理框架,系统性地应对模型幻觉、算法偏差及审计信任悖论。最终目标是构建一个可解释、高可靠且符合监管要求的金融智能生态系统。