ArXiv AI: Weekly Top Picks

1766007197438

Coverage: 2026-01-25 → 2026-02-01

This week in AI papers

We keep an eye on new AI papers on arXiv, pick one or two that really matter each day, and share the key ideas — no hype, just clear explanations.

Unpacked by our trio: Alex the plain-language host, Marc the hands-on power user, and Jamie the senior ML engineer.

LLM Daily – CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Re

2026-02-01 · 10 min · 14.3 MB

Excerpt — Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants,…

LLM Daily – CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Re

📝 Article 📄 PDF

LLM Daily – Optimizing Agentic Workflows using Meta-tools

2026-02-01 · 10 min · 13.8 MB

Excerpt — Agentic AI enables LLM to dynamically reason, plan, and interact with tools to solve complex tasks. However, agentic workflows often require many iterative reasoning steps and tool invocations, leading to significant…

LLM Daily – Optimizing Agentic Workflows using Meta-tools

📝 Article 📄 PDF

LLM Daily – $G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA

2026-02-01 · 10 min · 13.4 MB

Excerpt — Retrieval-augmented generation is a practical paradigm for question answering over long documents, but it remains brittle for multimodal reading where text, tables, and figures are interleaved across many pages. First,…

LLM Daily – $G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA

📝 Article 📄 PDF

LLM Daily – World of Workflows: a Benchmark for Bringing World Models to Enterprise Systems

2026-01-30 · 10 min · 11.4 MB

Excerpt — Frontier large language models (LLMs) excel as autonomous agents in many domains, yet they remain untested in complex enterprise systems where hidden workflows create cascading effects across interconnected databases.…

LLM Daily – World of Workflows: a Benchmark for Bringing World Models to Enterprise Systems

📝 Article 📄 PDF

LLM Daily – RedSage: A Cybersecurity Generalist LLM

2026-01-30 · 10 min · 15.7 MB

Excerpt — Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain…

LLM Daily – RedSage: A Cybersecurity Generalist LLM

📝 Article 📄 PDF

LLM Daily – StepShield: When, Not Whether to Intervene on Rogue Agents

2026-01-30 · 10 min · 15.5 MB

Excerpt — Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis. A detector that flags a violation at step 8 enables intervention; one that reports it at step 48 provides…

LLM Daily – StepShield: When, Not Whether to Intervene on Rogue Agents

📝 Article 📄 PDF

LLM Daily – BMAM: Brain-inspired Multi-Agent Memory Framework

2026-01-29 · 10 min · 14.4 MB

Excerpt — Language-model-based agents operating over extended interaction horizons face persistent challenges in preserving temporally grounded information and maintaining behavioral consistency across sessions, a failure mode we…

LLM Daily – BMAM: Brain-inspired Multi-Agent Memory Framework

📝 Article 📄 PDF

LLM Daily – AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agen

2026-01-29 · 10 min · 13.2 MB

Excerpt — The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow, demonstrating exceptional performance in coding, deep research, and complex problem-solving evaluations.…

LLM Daily – AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agen

📝 Article 📄 PDF

LLM Daily – AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Envir

2026-01-29 · 10 min · 14.8 MB

Excerpt — The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks…

LLM Daily – AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Envir

📝 Article 📄 PDF

LLM Daily – AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

2026-01-27 · 10 min · 14.7 MB

Excerpt — When humans face problems beyond their immediate capabilities, they rely on tools, providing a promising paradigm for improving visual reasoning in multimodal large language models (MLLMs). Effective reasoning,…

LLM Daily – AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

📝 Article 📄 PDF

LLM Daily – TEA-Bench: A Systematic Benchmarking of Tool-enhanced Emotional Support Dialogue

2026-01-27 · 10 min · 12.6 MB

Excerpt — Emotional Support Conversation requires not only affective expression but also grounded instrumental support to provide trustworthy guidance. However, existing ESC systems and benchmarks largely focus on affective…

LLM Daily – TEA-Bench: A Systematic Benchmarking of Tool-enhanced Emotional Support Dialogue

📝 Article 📄 PDF

LLM Daily – U-Fold: Dynamic Intent-Aware Context Folding for User-Centric Agents

2026-01-27 · 10 min · 16.2 MB

Excerpt — Large language model (LLM)-based agents have been successfully deployed in many tool-augmented settings, but their scalability is fundamentally constrained by context length. Existing context-folding methods mitigate…

LLM Daily – U-Fold: Dynamic Intent-Aware Context Folding for User-Centric Agents

📝 Article 📄 PDF

LLM Daily – AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

2026-01-27 · 10 min · 16.8 MB

Excerpt — The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk…

LLM Daily – AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

📝 Article 📄 PDF

LLM Daily – FadeMem: Biologically-Inspired Forgetting for Efficient Agent Memory

2026-01-27 · 10 min · 13.3 MB

Excerpt — Large language models deployed as autonomous agents face critical memory limitations, lacking selective forgetting mechanisms that lead to either catastrophic forgetting at context boundaries or information overload…

LLM Daily – FadeMem: Biologically-Inspired Forgetting for Efficient Agent Memory

📝 Article 📄 PDF

LLM Daily – Controlling Long-Horizon Behavior in Language Model Agents with Explicit State D

2026-01-25 · 10 min · 15.3 MB

Excerpt — Large language model (LLM) agents often exhibit abrupt shifts in tone and persona during extended interaction, reflecting the absence of explicit temporal structure governing agent-level state. While prior work…

📝 Article 📄 PDF

LLM Daily – Deja Vu in Plots: Leveraging Cross-Session Evidence with Retrieval-Augmented LLM

2026-01-25 · 10 min · 13.3 MB

Excerpt — The rise of live streaming has transformed online interaction, enabling massive real-time engagement but also exposing platforms to complex risks such as scams and coordinated malicious behaviors. Detecting these risks…

📝 Article 📄 PDF

Listen on Spotify (EN) Copy RSS (EN) Listen on Spotify (FR) Copy RSS (FR)

ArXiv AI: Weekly Top Picks

This week in AI papers

LLM Daily – CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Re

LLM Daily – Optimizing Agentic Workflows using Meta-tools

LLM Daily – $G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA

LLM Daily – World of Workflows: a Benchmark for Bringing World Models to Enterprise Systems

LLM Daily – RedSage: A Cybersecurity Generalist LLM

LLM Daily – StepShield: When, Not Whether to Intervene on Rogue Agents

LLM Daily – BMAM: Brain-inspired Multi-Agent Memory Framework

LLM Daily – AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agen

LLM Daily – AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Envir

LLM Daily – AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

LLM Daily – TEA-Bench: A Systematic Benchmarking of Tool-enhanced Emotional Support Dialogue

LLM Daily – U-Fold: Dynamic Intent-Aware Context Folding for User-Centric Agents

LLM Daily – AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

LLM Daily – FadeMem: Biologically-Inspired Forgetting for Efficient Agent Memory

LLM Daily – Controlling Long-Horizon Behavior in Language Model Agents with Explicit State D

LLM Daily – Deja Vu in Plots: Leveraging Cross-Session Evidence with Retrieval-Augmented LLM

Read more

AI Signals Report — Control planes, not just models

Your Bankers Are Ready. Your Bank Isn't.

One Line in Shanghai: What Xi's AI Speech Tells European Banks Betting on Chinese Open Models

Article 50 Goes Live in Five Days — and It Stopped Being a Legal Problem