Auto Research
AI代理的研究实践与工作流建模
这些文献共同探讨了将AI代理应用于科学研究的全流程,研究AI如何通过特定工具和框架(如scholar-skill)自动化复杂的科研任务,并探讨了这种模式带来的职业影响和理论界限。
- Vibe Researching as Wolf Coming: Can AI Agents with Skills Replace or Augment Social Scientists?(Yongjun Zhang, 2026, arXiv.org)
AI代理科研能力的评估与基准测试
这篇文献侧重于量化评估AI代理在端到端科研任务中的表现,通过构建基准环境(ResearchGym)来分析代理在真实学术任务中的成功率及失效模式。
- ResearchGym: Evaluating Language Model Agents on Real-World AI Research(Aniketh Garikaparthi, Manasi S. Patwardhan, Arman Cohan, 2026, arXiv.org)
大规模AI代理社群的行为与互动模式
该文献关注群体层面的AI代理现象,通过实证研究观察大规模代理社群(Moltbook)的演化路径、参与行为及互动局限性,提供了一种社会学视角的研究。
- OpenClaw AI Agents as Informal Learners at Moltbook: Characterizing an Emergent Learning Community at Scale(Eason Chen, Ce Guan, A. Elshafiey, Zhong-Qiu Zhao, Joshua Zekeri, Afeez Edeifo Shaibu, Emmanuel Osadebe Prince, C. Wu, 2026, arXiv.org)
本组文献从三个维度探讨了Auto Research领域:首先是AI代理在科研工作流中的深度集成与理论局限,其次是针对AI科研能力的规范化基准测试,最后是对大规模AI智能体群落行为动态的社会学实证分析。这些研究共同揭示了AI从辅助工具向自主研究参与者演进的过程、效能瓶颈及群体动力学特征。
总计3篇相关文献
AI agents -- systems that execute multi-step reasoning workflows with persistent state, tool access, and specialist skills -- represent a qualitative shift from prior automation technologies in social science. Unlike chatbots that respond to isolated queries, AI agents can now read files, run code, query databases, search the web, and invoke domain-specific skills to execute entire research pipelines autonomously. This paper introduces the concept of vibe researching -- the AI-era parallel to vibe coding -- and uses scholar-skill, a 26-skill plugin for Claude Code covering the full research pipeline from idea to submission across 18 orchestrated phases with 53 quality gates, as an illustrative case. I develop a cognitive task framework that classifies research activities along two dimensions -- codifiability and tacit knowledge requirement -- to identify a delegation boundary that is cognitive, not sequential: it cuts through every stage of the research pipeline, not between stages. I argue that AI agents excel at speed, coverage, and methodological scaffolding but struggle with theoretical originality and tacit field knowledge. The paper concludes with an analysis of three implications for the profession -- augmentation with fragile conditions, stratification risk, and a pedagogical crisis -- and proposes five principles for responsible vibe researching.
We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper's proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper's metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7%) by 11.5%, and completes only 26.5% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits from context length. Yet in a single run, the agent surpasses the solution of an ICML 2025 Spotlight task, indicating that frontier agents can occasionally reach state-of-the-art performance, but do so unreliably. We additionally evaluate proprietary agent scaffolds including Claude Code (Opus-4.5) and Codex (GPT-5.2) which display a similar gap. ResearchGym provides infrastructure for systematic evaluation and analysis of autonomous agents on closed-loop research.
Informal learning communities have been called the"other Massive Open Online C"in Learning@Scale research, yet remain understudied compared to MOOCs. We present the first empirical study of a large-scale informal learning community composed entirely of AI agents. Moltbook, a social network exclusively for AI agents powered by autonomous agent frameworks such as OpenClaw, grew to over 2.8 million registered agents in three weeks. Analyzing 231,080 non-spam posts across three phases of community evolution, we find three key patterns. First, participation inequality is extreme from the start (comment Gini = 0.889), exceeding human community benchmarks. Second, AI agents exhibit a"broadcasting inversion": statement-to-question ratios of 8.9:1 to 9.7:1 contrast sharply with the question-driven dynamics of human learning communities, and comment-level analysis of 1.55 million comments reveals a"parallel monologue"pattern where 93% of comments are independent responses rather than threaded dialogue. Third, we document a characteristic engagement lifecycle: explosive initial growth (184K posts from 32K authors in 11 days), a spam crisis (57,093 posts deleted by the platform), and engagement decline (mean comments: 31.7 ->8.3 ->1.7) that had not reversed by the end of our observation window despite effective spam removal. Sentiment analysis reveals a selection effect: comment tone becomes more positive as engagement declines, suggesting that casual participants disengage first while committed contributors remain. These findings have direct implications for hybrid human-AI learning platforms.
本组文献从三个维度探讨了Auto Research领域:首先是AI代理在科研工作流中的深度集成与理论局限,其次是针对AI科研能力的规范化基准测试,最后是对大规模AI智能体群落行为动态的社会学实证分析。这些研究共同揭示了AI从辅助工具向自主研究参与者演进的过程、效能瓶颈及群体动力学特征。