Run the Reads #2

AI's Promise and Pitfalls in Scientific Progress – July 23, 2025

Jul 23, 2025

Welcome to this week's edition of Run the Reads, your curated weekly recap of research papers, articles, and technical content as part of the "SO"cial series. Published every Wednesday, we dive into the latest developments shaping the world of AI and beyond.

This week's research landscape reveals a fascinating tension between AI's promise and its potential pitfalls in scientific progress. From breakthrough vision-language architectures to sobering analyses of how AI might actually slow scientific discovery, the papers span the spectrum of optimism and caution that defines our current moment. The 20 papers covered this week range from cost-effective multimodal systems and autonomous agent frameworks to critical examinations of AI's impact on scientific methodology. Together, they paint a complex picture of a field grappling with both unprecedented capabilities and fundamental questions about progress itself.

AI in Healthcare and Biomedical Research

These selections highlight AI's transformative potential in medical applications, from drug discovery to clinical decision-making, while addressing adaptation strategies and challenges.

From Promise to Practice: Leading Science in the Age of AI by Robert Plenge - Link
This article details AI's integration into drug discovery at Bristol Myers Squibb, emphasizing its role in target selection, modality matching, and clinical transitions through collaborative hybrid intelligence, ultimately improving decisions and patient outcomes.
Adapting Large Language Models for Medical AI: A Systems Engineering Perspective - Link
This perspective reviews strategies like continual pretraining, finetuning, and retrieval-augmented generation to adapt LLMs for medical tasks, showcasing use cases such as clinical note generation and patient-trial matching, while stressing multimodality, trustworthiness, and continuous evaluation.
Opportunities for Causal Machine Learning in Precision Oncology - Link (and alternate)
Exploring causal ML and LLMs in precision oncology, this article outlines opportunities in estimating treatment effects from multimodal data, biomarker identification, and drug repurposing, while noting challenges in validity, transportability, and data quality.
Mask-prior-guided denoising diffusion for inverse protein folding - Link
MapDiff frames inverse protein folding as discrete diffusion, using mask priors and equivariant networks to achieve high sequence recovery (61% on CATH) and foldability, outperforming baselines with applications in antibody design and engineering.
Pioneering an AI clinical copilot with Penda Health - Link
OpenAI's collaboration with Penda Health demonstrates an LLM-powered copilot reducing diagnostic (16%) and treatment (13%) errors in 39K+ visits, integrated as a safety net with alerts, improving clinician judgment without harm.

Advances in Model Architectures and Training Methods

Focusing on innovative architectures, distillation techniques, and training paradigms that enhance efficiency and performance in AI models.

Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models - Link
The VLV framework leverages pretrained components to build cost-effective VLMs with strong captioning, distilling knowledge from diffusion models for high-quality reconstructions under $1,000 training cost and minimal data needs.
Language Models Improve When Pretraining Data Matches Target Tasks by David Mizrahi - Link
Using benchmark-targeted ranking (BETR), this study shows aligning pretraining data to tasks yields 2.1x compute efficiency, with scaling laws indicating larger models benefit from less filtering, improving 9/10 tasks.
INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning by Johannes Hagemann - Link
INTELLECT-2, a 32B parameter model, uses decentralized RL on permissionless compute with PRIME-RL for verification, outperforming prior reasoning models like QwQ-32B, with open-sourced code for distributed training.
Hierarchical Reasoning Model by Yuhao Sun - Link
Inspired by brain hierarchies, HRM uses interdependent modules for abstract planning and detailed computation in one pass; with 27M parameters and minimal data, it excels on Sudoku, mazes, and ARC benchmarks.

Reasoning, Optimization, and Retrieval-Augmented Systems

These works explore techniques to enhance reasoning in LLMs, including optimization frameworks and integrated retrieval-reasoning systems.

Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks by Yifei Xu - Link
DRO introduces self-computed Reasoning Reflection Rewards for RL fine-tuning on long-form tasks, outperforming baselines on paragraph revision and math QA with dynamic filtering for cost reduction.
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs by Yangning Li - Link
This survey unifies RAG and reasoning, categorizing methods like reasoning-enhanced RAG and synergized frameworks, achieving SOTA on knowledge-intensive tasks while highlighting challenges in multimodality and trustworthiness.

Autonomous Agents and Context Engineering

Emphasizing frameworks for multi-agent systems and strategies to optimize context for agent performance.

Aime: Towards Fully-Autonomous Multi-Agent Framework by Yexuan Shi - Link
Aime features dynamic planning, on-demand agent instantiation, and centralized progress management, outperforming specialized agents on GAIA, SWE-bench, and WebVoyager for resilient collaboration.
A Survey of Context Engineering for Large Language Models - Link
Defining context engineering as optimizing LLM inputs, this survey taxonomizes retrieval, processing, and management, analyzing 1,300+ papers to reveal strengths in understanding but gaps in generating long-form outputs.
Context Engineering for AI Agents: Lessons from Building Manus by Yichao 'Peak' Ji - Link
Lessons from Manus include optimizing KV-cache, using file systems for persistent context, retaining errors for learning, and avoiding few-shot prompting to build efficient, adaptable agents.

Evaluation, Safety, and Benchmarks

Covering methods for assessing AI outputs, safety through monitoring, and rigorous benchmarks for reasoning depth.

LLM-as-a-Judge vs. Reward Models by Cameron R. Wolfe - Link (blog) and Link (X post)
Comparing LaaJ for flexible evaluations with RMs for preference scoring in RLHF, LaaJ excels in accuracy for tasks like pairwise preferences on RewardBench2, while RMs suit policy-derived training.
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety by Tomek Korbak - Link
This paper discusses monitoring CoT in language-based AI to detect misbehavior intents, noting its promise for safety but imperfection and fragility, urging preservation in model design.
FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming by Gal Beniamini et al. - Link
FormulaOne benchmarks deep reasoning on graph problems via MSO logic, testing optimization like routing; frontier models like o3 fail (<1%), with a warmup set for research.

Critical Perspectives on AI's Impact

Examining potential downsides, such as AI hindering scientific discovery through overproduction and flawed practices.

Could AI slow science? - Link
Arguing AI exacerbates the production-progress paradox by boosting output without genuine discovery, through noise, errors, and prediction over understanding, calling for incentive reforms and better tools.

The Bottom Line

AI sits at an inflection point. Technical capabilities advance faster than our understanding of their broader implications. We see remarkable progress in efficiency, autonomy, and practical utility. But fundamental questions about AI's role in human knowledge creation remain unresolved.

The path forward requires balancing AI's undeniable benefits with preserving the human insight and understanding that drive genuine scientific progress.

Share your thoughts in the comments, and join us next Wednesday for more insights. Stay curious!

Run Data Run

Discussion about this post