Run the Reads #3

Foundational Models, Agentic AI, and Biomedical Applications

Jul 30, 2025

July 29, 2025

Welcome to this week's edition of Run the Reads, your curated weekly recap of research papers, articles, and technical content as part of the "SO"cial series. Published every Wednesday, we dive into the latest developments shaping the world of AI and beyond.

This week's selections cover a wide range of topics, from the inner workings of large language models to their real-world applications in biomedical research. We see a continued focus on improving the reasoning and efficiency of these models, as well as a growing body of work on their use in specialized domains. The papers also highlight the increasing importance of data-centric approaches and the need for robust evaluation methods. This collection underscores the dual trends of building more capable general models while also tailoring them for specific, high-impact applications.

🧠 AI Reasoning & LLMs

This section explores the core mechanics of large language models, from multi-token prediction to the dynamics of in-context learning and the power of multi-agent systems.

Kimi K2: Open Agentic Intelligence by Kimi Team

This technical report introduces Kimi K2, a Mixture-of-Experts (MoE) model with 32 billion activated parameters. The paper details the MuonClip optimizer, which was developed to improve training stability and token efficiency, allowing the model to be trained on 15.5 trillion tokens without loss spikes. Kimi K2's multi-stage post-training process, which includes a large-scale agentic data synthesis pipeline and a joint reinforcement learning stage, has resulted in state-of-the-art performance among open-source non-thinking models, particularly in agentic capabilities, coding, and reasoning.

Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential by Mohammad Samragh et al.

This work addresses the sequential nature of autoregressive language models by proposing a framework for simultaneous multi-token prediction. The authors introduce a masked-input formulation, a gated LoRA for multi-token prediction, a learnable sampler module, and auxiliary training losses to enhance coherence. Their approach achieves significant speedups in code, math, and general chat tasks without a loss in quality, demonstrating the potential to overcome the limitations of one-token-at-a-time generation.

Open-Source LLMs Collaboration Beats Closed-Source LLMs: A Scalable Multi-Agent System by Shengji Tang et al.

This paper introduces SMACS, a scalable multi-agent collaboration system that harnesses the power of multiple open-source LLMs. The framework uses a Retrieval-based Prior Selection (RPS) to choose the best-suited models for a given question and an Exploration-Exploitation-Driven Posterior Enhancement (EPE) to generate and select high-quality responses. By integrating fifteen open-source LLMs, SMACS outperforms leading closed-source models like Claude-3.7-Sonnet and GPT-4.1 across several benchmarks, pushing the upper bound of AI intelligence.

SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam? by Jingyi Chai et al.

This paper introduces X-Master, a tool-augmented reasoning agent designed to emulate human researchers by interacting with external tools. The authors also propose X-Masters, a scattered-and-stacked agentic workflow that enhances the breadth and depth of reasoning. Their open-source solution sets a new state-of-the-art record on the Humanity's Last Exam (HLE) benchmark, surpassing results from OpenAI and Google and becoming the first to exceed the 30% threshold.

Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models by Rithesh Murthy et al.

Promptomatix is a framework designed to automatically optimize natural language task descriptions into high-quality prompts for LLMs. The system analyzes user intent, generates synthetic training data, and refines prompts using cost-aware objectives. It supports both a lightweight meta-prompt-based optimizer and a DSPy-powered compiler, achieving competitive or superior performance compared to existing libraries while reducing prompt length and computational overhead.

Learning without training: The implicit dynamics of in-context learning by Benoit Dherin et al.

This work investigates the mechanisms behind in-context learning in LLMs. The authors argue that the combination of a self-attention layer and an MLP allows a transformer block to implicitly modify the MLP's weights based on the context. Through theory and experimentation, they show how a transformer block can transform a context into a low-rank weight-update of the MLP layer, providing a potential explanation for how LLMs can learn at inference time without additional training.

A Survey of Context Engineering for Large Language Models by Lingrui Mei et al.

This survey introduces "Context Engineering" as a formal discipline for optimizing the information provided to LLMs during inference. The authors present a comprehensive taxonomy of the field, covering context retrieval, generation, processing, and management. Based on an analysis of over 1400 research papers, the survey establishes a technical roadmap and identifies a critical research gap: the asymmetry between models' ability to understand complex contexts and their limitations in generating equally sophisticated, long-form outputs.

Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning by Yu Li et al.

This paper presents a systematic investigation of multi-domain reasoning within the Reinforcement Learning with Verifiable Rewards (RLVR) framework. The study explores the interplay between mathematical reasoning, code generation, and logical puzzle solving, evaluating in-domain improvements and cross-domain generalization. The findings offer insights into the dynamics of domain interactions and provide guidance for optimizing RL methodologies to foster more comprehensive reasoning capabilities in LLMs.

🧬 Genomics & Biomed AI

This section focuses on the application of AI in genomics and biomedicine, from clinical decision support to automated gene-set analysis and the integration of multimodal data in cancer research.

Large Language Models for Clinical Decision Support by NEJM AI

This article explores the potential of large language models to support clinical decision-making. While acknowledging the capabilities of AI in this domain, the authors also highlight the current limitations, particularly in autonomous ethical reasoning. The piece serves as a balanced overview of the opportunities and challenges of integrating LLMs into clinical workflows, emphasizing the need for further research and development to ensure their safe and effective use.

GeneAgent: self-verification language agent for gene-set analysis using domain databases by Zhizheng Wang et al.

GeneAgent is an LLM-based AI agent designed for gene-set analysis that reduces hallucinations by autonomously interacting with biological databases to verify its own output. An evaluation of over 1,100 gene sets showed that GeneAgent is consistently more accurate than GPT-4. When applied to novel gene sets from mouse melanoma cell lines, it produced more relevant and comprehensive functional descriptions, demonstrating its potential to expedite knowledge discovery in genomics.

From Classical Machine Learning to Emerging Foundation Models: Review on Multimodal Data Integration for Cancer Research by Amgad Muneer et al.

This review provides a comprehensive overview of multimodal data integration strategies in cancer research, mapping the transition from classical machine learning to foundation models. The paper examines methodological frameworks, validation protocols, and open-source resources for tasks such as cancer subtype classification, biomarker discovery, and treatment guidance. The authors argue that current integrative methods are laying the groundwork for the next generation of large-scale, pre-trained models that will revolutionize oncology.

Artificial intelligence in radiology: 173 commercially available products and their scientific evidence by Noa Antonissen et al.

This study assesses the evolution of peer-reviewed evidence for commercially available radiological AI products between 2020 and 2023. The number of CE-certified products with peer-reviewed evidence increased from 36% to 66%, with a notable rise in multicenter studies. However, the focus remains on lower-efficacy studies, and there has been a decrease in vendor-independent and multinational studies, highlighting the persistent challenges in establishing unbiased, real-world evidence for these tools.

🛡️ AI Safety & Alignment

This section covers research on ensuring the safety and integrity of AI systems, with a focus on control flow attestation.

Efficient Control Flow Attestation by Speculating on Control Flow Path Representations by Liam Tyler et al.

This paper introduces RESPEC-CFA, an architectural extension for Control Flow Attestation (CFA) that allows for speculation on the locality of control flows and their Huffman encoding. This approach reduces the size of control flow logs by up to 90.1% on its own and up to 99.7% when combined with prior methods. The work represents a significant step toward making CFA more practical for verifying the run-time software integrity of embedded systems.

🔬 Research Methods

This section highlights new methodologies in AI research, including test-time diffusion and the need for better observability and optimization of agentic systems.

Deep Researcher with Test-Time Diffusion by Rujun Han et al.

The authors propose the Test-Time Diffusion Deep Researcher (TTD-DR), a framework that conceptualizes research report generation as a diffusion process. TTD-DR starts with a preliminary draft and iteratively refines it through a "denoising" process that is dynamically informed by a retrieval mechanism. This approach is enhanced by a self-evolutionary algorithm, leading to state-of-the-art results on benchmarks that require intensive search and multi-hop reasoning.

Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems by Dany Moshkovich et al.

This paper addresses the challenges of observing, analyzing, and optimizing agentic AI systems. The authors explore issues such as natural language variability and unpredictable execution flows, which hinder predictability and control. They introduce taxonomies for expected analytics outcomes and propose a novel approach for benchmarking agent evaluation systems that uses runtime logs as input, moving beyond traditional "black box" performance evaluation.

🏭 Industry Applications

This section looks at the practical application of AI in industry, with a focus on a recent healthcare hackathon.

Google France AI Healthcare Hackathon

This blog post provides an overview of a recent AI in healthcare hackathon hosted by Google in France. The event brought together developers, researchers, and healthcare professionals to collaborate on innovative solutions to pressing challenges in the medical field. The post highlights the key themes and outcomes of the hackathon, showcasing the potential of AI to drive progress in healthcare through collaborative, hands-on problem-solving.

👁️ Computer Vision & Multimodal

This section covers the intersection of language and other data modalities, with a focus on wearable sensor data.

SensorLM: Learning the Language of Wearable Sensors by Yuwei Zhang et al.

SensorLM is a family of sensor-language foundation models designed to enable the understanding of wearable sensor data through natural language. The authors developed a hierarchical caption generation pipeline to create the largest sensor-language dataset to date, with over 59.7 million hours of data. SensorLM demonstrates superior performance in zero-shot recognition, few-shot learning, and cross-modal retrieval, as well as intriguing capabilities in sensor captioning and generalization to unseen tasks.

🎯 Closing Thoughts

This week's research highlights the rapid advancements in both the theoretical underpinnings and practical applications of AI. The dual focus on enhancing the core reasoning capabilities of LLMs while simultaneously deploying them in specialized fields like genomics and healthcare underscores the maturation of the field.

What stood out to you this week? Share your thoughts in the comments, and join us next Wednesday for more insights. Stay curious!

If you enjoyed this edition of Run the Reads, please share it with your network and subscribe for weekly updates on the latest in AI research and applications. Follow me at @bioinfo and on LinkedIn.

Run Data Run

Discussion about this post