ArXiv AI: Weekly Top Picks

cover

Coverage: 2026-01-18 → 2026-01-25

This week in AI papers

We keep an eye on new AI papers on arXiv, pick one or two that really matter each day, and share the key ideas — no hype, just clear explanations.

Unpacked by our trio: Alex the plain-language host, Marc the hands-on power user, and Jamie the senior ML engineer.

LLM Daily – Controlling Long-Horizon Behavior in Language Model Agents with Explicit State D

2026-01-25 · 10 min · 15.3 MB

Excerpt — Large language model (LLM) agents often exhibit abrupt shifts in tone and persona during extended interaction, reflecting the absence of explicit temporal structure governing agent-level state. While prior work…

LLM Daily – Controlling Long-Horizon Behavior in Language Model Agents with Explicit State D

📝 Article 📄 PDF

LLM Daily – Deja Vu in Plots: Leveraging Cross-Session Evidence with Retrieval-Augmented LLM

2026-01-25 · 10 min · 13.3 MB

Excerpt — The rise of live streaming has transformed online interaction, enabling massive real-time engagement but also exposing platforms to complex risks such as scams and coordinated malicious behaviors. Detecting these risks…

LLM Daily – Deja Vu in Plots: Leveraging Cross-Session Evidence with Retrieval-Augmented LLM

📝 Article 📄 PDF

LLM Daily – From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantific

2026-01-24 · 10 min · 15.7 MB

Excerpt — While Large Language Models (LLMs) show remarkable capabilities, their unreliability remains a critical barrier to deployment in high-stakes domains. This survey charts a functional evolution in addressing this…

LLM Daily – From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantific

📝 Article 📄 PDF

LLM Daily – AgentSM: Semantic Memory for Agentic Text-to-SQL

2026-01-24 · 10 min · 15.0 MB

Excerpt — Recent advances in LLM-based Text-to-SQL have achieved remarkable gains on public benchmarks such as BIRD and Spider. Yet, these systems struggle to scale in realistic enterprise settings with large, complex schemas,…

LLM Daily – AgentSM: Semantic Memory for Agentic Text-to-SQL

📝 Article 📄 PDF

LLM Daily – Agentic Confidence Calibration

2026-01-24 · 10 min · 14.2 MB

Excerpt — AI agents are rapidly advancing from passive language models to autonomous systems executing complex, multi-step tasks. Yet their overconfidence in failure remains a fundamental barrier to deployment in high-stakes…

LLM Daily – Agentic Confidence Calibration

📝 Article 📄 PDF

LLM Daily – Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via T

2026-01-24 · 10 min · 15.1 MB

Excerpt — Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing efforts focus on enhancing policy capabilities via post-training, we…

LLM Daily – Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via T

📝 Article 📄 PDF

LLM Daily – Introducing the Generative Application Firewall (GAF)

2026-01-24 · 10 min · 15.7 MB

Excerpt — This paper introduces the Generative Application Firewall (GAF), a new architectural layer for securing LLM applications. Existing defenses -- prompt filters, guardrails, and data-masking -- remain fragmented; GAF…

LLM Daily – Introducing the Generative Application Firewall (GAF)

📝 Article 📄 PDF

LLM Daily – Emerging from Ground: Addressing Intent Deviation in Tool-Using Agents via Deriv

2026-01-22 · 10 min · 15.3 MB

Excerpt — LLMs have advanced tool-using agents for real-world applications, yet they often lead to unexpected behaviors or results. Beyond obvious failures, the subtle issue of "intent deviation" severely hinders reliable…

LLM Daily – Emerging from Ground: Addressing Intent Deviation in Tool-Using Agents via Deriv

📝 Article 📄 PDF

LLM Daily – How to Build AI Agents by Augmenting LLMs with Codified Human Expert Domain Know

2026-01-22 · 10 min · 14.8 MB

Excerpt — Critical domain knowledge typically resides with few experts, creating organizational bottlenecks in scalability and decision-making. Non-experts struggle to create effective visualizations, leading to suboptimal…

LLM Daily – How to Build AI Agents by Augmenting LLMs with Codified Human Expert Domain Know

📝 Article 📄 PDF

LLM Daily – When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Lab

2026-01-22 · 10 min · 14.2 MB

Excerpt — Large Language Models (LLMs) have revolutionized intelligent application development. While standalone LLMs cannot perform any actions, LLM agents address the limitation by integrating tools. However, debugging LLM…

LLM Daily – When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Lab

📝 Article 📄 PDF

LLM Daily – Taxonomy-Aligned Risk Extraction from 10-K Filings with Autonomous Improvement U

2026-01-22 · 10 min · 15.2 MB

Excerpt — We present a methodology for extracting structured risk factors from corporate 10-K filings while maintaining adherence to a predefined hierarchical taxonomy. Our three-stage pipeline combines LLM extraction with…

LLM Daily – Taxonomy-Aligned Risk Extraction from 10-K Filings with Autonomous Improvement U

📝 Article 📄 PDF

LLM Daily – Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Mo

2026-01-22 · 10 min · 17.4 MB

Excerpt — We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including…

LLM Daily – Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Mo

📝 Article 📄 PDF

LLM Daily – Look-Ahead-Bench: a Standardized Benchmark of Look-ahead Bias in Point-in-Time L

2026-01-21 · 10 min · 12.5 MB

Excerpt — We introduce Look-Ahead-Bench, a standardized benchmark measuring look-ahead bias in Point-in-Time (PiT) Large Language Models (LLMs) within realistic and practical financial workflows. Unlike most existing approaches…

📝 Article 📄 PDF

LLM Daily – VulnResolver: A Hybrid Agent Framework for LLM-Based Automated Vulnerability Iss

2026-01-21 · 10 min · 12.8 MB

Excerpt — As software systems grow in complexity, security vulnerabilities have become increasingly prevalent, posing serious risks and economic costs. Although automated detection tools such as fuzzers have advanced…

LLM Daily – VulnResolver: A Hybrid Agent Framework for LLM-Based Automated Vulnerability Iss

📝 Article 📄 PDF

LLM Daily – HALT: Hallucination Assessment via Latent Testing

2026-01-21 · 10 min · 11.7 MB

Excerpt — Hallucination in large language models (LLMs) can be understood as a failure of faithful readout: although internal representations may encode uncertainty about a query, decoding pressures still yield a fluent answer.…

LLM Daily – HALT: Hallucination Assessment via Latent Testing

📝 Article 📄 PDF

LLM Daily – APEX-Agents

2026-01-21 · 10 min · 12.5 MB

Excerpt — We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management…

📝 Article 📄 PDF

LLM Daily – A Systematic Analysis of Chunking Strategies for Reliable Question Answering

2026-01-21 · 10 min · 12.3 MB

Excerpt — We study how document chunking choices impact the reliability of Retrieval-Augmented Generation (RAG) systems in industry. While practice often relies on heuristics, our end-to-end evaluation on Natural Questions…

LLM Daily – A Systematic Analysis of Chunking Strategies for Reliable Question Answering

📝 Article 📄 PDF

LLM Daily – Do We Always Need Query-Level Workflows? Rethinking Agentic Workflow Generation

2026-01-19 · 10 min · 14.1 MB

Excerpt — Multi-Agent Systems (MAS) built on large language models typically solve complex tasks by coordinating multiple agents through workflows. Existing approaches generates workflows either at task level or query level, but…

LLM Daily – Do We Always Need Query-Level Workflows? Rethinking Agentic Workflow Generation

📝 Article 📄 PDF

LLM Daily – Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation

2026-01-19 · 10 min · 14.9 MB

Excerpt — Large Language Models (LLMs) face the "knowledge cutoff" challenge, where their frozen parametric memory prevents direct internalization of new information. While Supervised Fine-Tuning (SFT) is commonly used to update…

LLM Daily – Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation

📝 Article 📄 PDF

Listen on Spotify (EN) Copy RSS (EN) Listen on Spotify (FR) Copy RSS (FR)

ArXiv AI: Weekly Top Picks

This week in AI papers

LLM Daily – Controlling Long-Horizon Behavior in Language Model Agents with Explicit State D

LLM Daily – Deja Vu in Plots: Leveraging Cross-Session Evidence with Retrieval-Augmented LLM

LLM Daily – From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantific

LLM Daily – AgentSM: Semantic Memory for Agentic Text-to-SQL

LLM Daily – Agentic Confidence Calibration

LLM Daily – Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via T

LLM Daily – Introducing the Generative Application Firewall (GAF)

LLM Daily – Emerging from Ground: Addressing Intent Deviation in Tool-Using Agents via Deriv

LLM Daily – How to Build AI Agents by Augmenting LLMs with Codified Human Expert Domain Know

LLM Daily – When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Lab

LLM Daily – Taxonomy-Aligned Risk Extraction from 10-K Filings with Autonomous Improvement U

LLM Daily – Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Mo

LLM Daily – Look-Ahead-Bench: a Standardized Benchmark of Look-ahead Bias in Point-in-Time L

LLM Daily – VulnResolver: A Hybrid Agent Framework for LLM-Based Automated Vulnerability Iss

LLM Daily – HALT: Hallucination Assessment via Latent Testing

LLM Daily – APEX-Agents

LLM Daily – A Systematic Analysis of Chunking Strategies for Reliable Question Answering

LLM Daily – Do We Always Need Query-Level Workflows? Rethinking Agentic Workflow Generation

LLM Daily – Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation

Read more

AI Signals Report — Control planes, not just models

Your Bankers Are Ready. Your Bank Isn't.

One Line in Shanghai: What Xi's AI Speech Tells European Banks Betting on Chinese Open Models

Article 50 Goes Live in Five Days — and It Stopped Being a Legal Problem