Run the Reads: AI Agents Take Center Stage

August 28, 2025

Aug 28, 2025

Welcome to this week's edition of Run the Reads, your curated weekly recap of research papers, articles, and technical content as part of the "SO"cial series. Published every week, we dive into the latest developments shaping the world of AI and beyond.

This week's selections reveal a fascinating convergence around autonomous AI systems and their practical applications. From breakthrough agent frameworks that enable end-to-end problem solving to specialized medical AI achieving clinical-grade reliability, the research demonstrates how AI is rapidly evolving from computational tool to autonomous partner. The pathology and biomedical papers showcase remarkable progress in specialized domains, while the agent-focused research explores new paradigms for memory, reasoning, and scientific discovery. These developments collectively suggest we're witnessing a pivotal moment where AI systems gain increasing independence and domain expertise.

🧬 Medical AI & Pathology

This section highlights advances in AI applications for medical diagnostics and pathological analysis.

Towards Comprehensive Cellular Characterisation of H&E slides

Researchers developed HistoPLUS, a novel AI model for cell detection, segmentation, and classification in tumor microenvironment slides, trained on a curated pan-cancer dataset of 108,722 nuclei covering 13 cell types. In external validation, HistoPLUS outperformed current models by 5.2% in detection quality and improved overall F1 classification score by 23.7% while using 5x fewer parameters. The model enables study of seven previously understudied cell types and demonstrates robust transfer to two oncology indications not seen during training, with model weights and inference code made publicly available.

PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue

This research presents a multi-modal slide-level foundation model for pathology AI trained on nearly 700,000 specimens comprising 2.3 million whole slide images. The two-stage training process first aligns whole slide embeddings with textual clinical diagnosis using contrastive and captioning objectives, then unfreezes the language model to enable diagnostic conversation. PRISM2 outperforms prior slide-level models like PRISM and TITAN, introduces a zero-shot yes/no classification approach surpassing CLIP-style methods, and improves generalization on both data-rich and low-sample tasks while creating clinically useful AI diagnostic agents.

New Method Advances Reliability of AI with Applications in Medical Diagnostics

Johns Hopkins researchers developed MIGHT (Multidimensional Informed Generalized Hypothesis Testing), an AI method that fine-tunes itself using real data and checks accuracy on different data subsets using tens of thousands of decision-trees. Applied to early cancer detection using circulating cell-free DNA, MIGHT achieved 72% cancer detection sensitivity at 98% specificity when tested on 1,000 individuals. The research discovered that ccfDNA fragmentation signatures also occur in autoimmune and vascular diseases, while developing a companion algorithm CoMIGHT to combine multiple biological signals for more reliable medical diagnostics.

🔬 Scientific AI & Research Automation

Research exploring AI's role in automating and advancing scientific discovery processes.

SSRL: Self-Search Reinforcement Learning

This research investigates using large language models as efficient simulators for search tasks in reinforcement learning, introducing a "Self-Search" methodology that quantifies LLMs' intrinsic search capabilities through structured prompting. The Self-Search RL approach enhances LLMs' search capabilities through format-based and rule-based rewards, enabling models to iteratively refine their knowledge utilization internally while reducing dependence on external search engines. The study demonstrates that LLMs possess world knowledge that can be effectively utilized, with SSRL showing potential for reducing hallucination and enabling integration with external search engines for more scalable RL agent training.

From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery

This comprehensive survey explores how AI is transforming scientific research, positioning "Agentic Science" as a new paradigm where AI systems evolve from computational tools to autonomous research partners. The researchers trace AI's evolution in scientific discovery, identify five core capabilities underpinning scientific agency, and model scientific discovery as a dynamic four-stage workflow across life sciences, chemistry, materials science, and physics. The work highlights how AI, enabled by large language models and multimodal systems, can now perform tasks like hypothesis generation, experimental design, execution, analysis, and iterative refinement - capabilities previously considered uniquely human.

AI-Researcher: Autonomous Scientific Innovation

Researchers introduce AI-Researcher, a fully autonomous research system that transforms AI-driven scientific discovery by automating the entire research pipeline including literature review, hypothesis generation, algorithm implementation, and manuscript preparation. The framework includes "Scientist-Bench", a benchmark to assess autonomous research capabilities across AI research domains, demonstrating that AI-Researcher can achieve high implementation success rates and produce research papers approaching human-level quality. The system leverages powerful reasoning capabilities of Large Language Models in mathematics and coding to systematically explore solution spaces beyond human cognitive limitations with minimal human intervention.

Intern-S1: A Scientific Multimodal Foundation Model

This work presents a specialized multimodal AI model targeting scientific domains, featuring 28 billion activated parameters and 241 billion total parameters, pre-trained on 5T tokens including 2.5T scientific domain tokens. Using Mixture-of-Experts architecture and implementing "Mixture-of-Rewards" for reinforcement learning across 1000+ tasks, Intern-S1 aims to bridge the performance gap between open-source and closed-source models in scientific fields. The model demonstrates competitive performance on general reasoning tasks while significantly outperforming other open-source models in scientific domains, excelling in specialized tasks like molecular synthesis planning, reaction condition prediction, and predicting thermodynamic stabilities for crystals.

A Taxonomy of Transcendence

This research paper by Natalie Abreu, Edwin Zhang, Eran Malach, and Naomi Saphra explores theoretical frameworks related to artificial intelligence capabilities and limitations. Published under Creative Commons BY 4.0 license, the work contributes to ongoing discussions about AI development and its implications, though specific methodological details and findings require access to the complete paper text for comprehensive analysis.

🧠 LLM Reasoning & Efficiency

Advances in improving large language model reasoning capabilities and computational efficiency.

Deep Think with Confidence

Researchers introduce "Deep Think with Confidence" (DeepConf), a method to improve reasoning efficiency in Large Language Models by leveraging model-internal confidence signals to dynamically filter out low-quality reasoning traces. The approach requires no additional model training or hyperparameter tuning and can be integrated into existing serving frameworks, achieving up to 99.9% accuracy on the AIME 2025 benchmark with DeepConf@512 while reducing generated tokens by up to 84.7% compared to full parallel thinking. This addresses limitations in current test-time scaling methods like self-consistency, which often result in diminishing returns in accuracy and high computational overhead.

🤖 AI Agents & Autonomous Systems

This section covers breakthrough research in autonomous AI agents and their frameworks for complex problem-solving.

Memp: Exploring Agent Procedural Memory

Researchers introduce Memp, a novel approach to enhancing procedural memory in LLM agents by distilling past trajectories into step-by-step instructions and script-like abstractions. The method employs a dynamic regimen that continuously updates, corrects, and deprecates memory contents, demonstrating that refined memory repositories lead to steadily higher success rates and greater efficiency on analogous tasks. Notably, procedural memory built from stronger models can be migrated to weaker models, yielding substantial performance gains, with empirical evaluation conducted on TravelPlanner and ALFWorld platforms.

Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL

This research introduces the Chain-of-Agents paradigm for complex problem-solving in large language models, enabling native end-to-end solutions by dynamically activating different tool and role-playing agents. The approach uses multi-agent distillation to convert multi-agent systems into chain-of-agents trajectories, followed by agentic supervised fine-tuning and reinforcement learning to improve model capabilities. The resulting Agent Foundation Models establish new performance benchmarks in web and code agent settings, with the entire research being open-sourced including model weights, training code, and training data.

MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers

Researchers developed MCP-Universe, a comprehensive benchmark evaluating LLMs through interaction with real-world Model Context Protocol servers across six domains including location navigation, repository management, and financial analysis. Using execution-based evaluators, the study revealed significant performance limitations in top models, with GPT-5 achieving 43.72%, Grok-4 reaching 33.33%, and Claude-4.0-Sonnet attaining 29.44%. The benchmark highlights challenges in long-context reasoning and unfamiliarity with tool usage, while providing an open-sourced extensible evaluation framework to foster innovation in the MCP ecosystem.

Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

This work introduces a Memory Decoder, a novel domain adaptation approach that enables efficient specialization without changing original model parameters. The small transformer decoder learns to imitate external non-parametric retriever behavior and can be seamlessly integrated with any pretrained language model sharing the same tokenizer. Demonstrated effectiveness includes adapting Qwen and Llama models to biomedicine, finance, and law domains, reducing perplexity by an average of 6.17 points while addressing limitations of existing domain adaptation methods like costly full-parameter training and high inference latency.

AI Agentic Programming: A Survey of Techniques, Challenges, and Opportunities

This comprehensive survey examines how large language models can autonomously plan, execute, and interact with external tools for software development. The research introduces a taxonomy of agent behaviors and system architectures, covering core techniques including planning, memory and context management, tool integration, and execution monitoring. Key challenges identified include limitations in handling long context, lack of persistent memory across tasks, and concerns around safety and alignment with user intent, providing a foundation for developing next-generation intelligent and trustworthy AI coding agents.

Memento: Fine-tuning LLM Agents without Fine-tuning LLMs

Researchers present a novel learning paradigm that eliminates the need to fine-tune underlying LLMs while enabling low-cost continual adaptation via memory-based online reinforcement learning. Using a Memory-augmented Markov Decision Process, the approach stores past experiences in episodic memory and continually updates policy through memory rewriting and retrieval. The method achieved top-1 performance on GAIA validation with 87.88% Pass@3 and 79.40% on the test set, outperforming state-of-the-art training-based methods while adding 4.7% to 9.6% performance on out-of-distribution tasks.

This week's research landscape reveals a fascinating duality - while AI systems gain increasing autonomy in scientific discovery and medical diagnostics, they're simultaneously becoming more efficient and reliable through advanced reasoning frameworks. The convergence of specialized domain expertise with general-purpose agent capabilities suggests we're entering an era where AI transitions from tool to partner in complex problem-solving. What stood out to you this week? Share your thoughts in the comments, and join us next Wednesday for more insights. Stay curious!

Run Data Run

Discussion about this post

Ready for more?