←── back to feed
/topics/arxiv-cs-ai-papers-june-6-2026
arXiv cs.AI papers June 6 2026
50 items●1 sources●updated 11d ago●trend 0
On June 6, 2026, arXiv's cs.AI section published 20 papers spanning multi-agent communication efficiency, time series forecasting, AI evaluation benchmarks, program synthesis, medical literature summarization, and AI governance. Topics include LLM-based agents for long-horizon tasks, interpretability frameworks, quantization methods for efficient deployment, and technical verification of frontier AI training.
- LeanMarathon: multi-agent harness for research-level Lean autoformalization using evolving blueprint abstraction with contract-scoped agents for construction, auditing, proving, and repair
- SentinelBench: benchmark for evaluating long-running monitoring agents that sustain attention rather than continuous action, measuring performance on tasks spanning minutes to hours
- SAGE-PTQ: ultra-low-bit quantization framework for LLMs that minimizes hidden scaling overhead by separating salient and unsalient weights using distributional statistics
- Agents' Last Exam (ALE): benchmark evaluating AI agents on long-horizon, economically valuable real-world tasks with verifiable outcomes, addressing gap between benchmark performance and professional deployment
- Zero-knowledge proof framework proposed for verifying frontier AI training compute without self-reporting, enabling technical verification for international AI governance agreements
[BLG]blog/rss50
How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment
What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems
I Know What You Meme, Even If it Emerged Today: Understanding Evolving Memes through Open-World Knowledge Acquisition
GITCO: Gated Inference-Time Context Optimization in TSFMs
Uncertainty Aware Functional Behavior Prediction and Material Fatigue Assessment for Circular Factory
SentinelBench: A Benchmark for Long-Running Monitoring Agents
An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)
Synthetic Contrastive Reasoning for Multi-Table Q&A
Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges
Residual Modeling for High-Fidelity Learned Compression of Scientific Data
LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization
Harnessing Generalist Agents for Contextualized Time Series
Agents' Last Exam
Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution
A Motivational Architecture for Conversational AGI
Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers
Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models
Zero knowledge verification for frontier AI training is possible
Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison
Brick-Composer: Using MLLMs for Assembly with Diverse Bricks
Insurance of Agentic AI
Output Type Before Quality: A Standards-Derived XAI Admissibility Rubric for Autonomous-Driving Safety
PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage
Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces
Severity-Aware Curriculum Learning with Multi-Model Response Selection for Medical Text Generation
EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts
SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization
When Should We Protect AI? A Precautionary Framework for Consciousness Uncertainty
Individual Gain, Collective Loss: Metacognitive Adaptation in AI-Assisted Creativity
SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations
GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection
Fix the Mind, Not the Move: Interpretable AI Assistance via Knowledge-Gap Localization
Multilingual Fine-Tuning via Localized Gradient Conflict Resolution
Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack
Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking
Evaluation of LLMs for Mathematical Formalization in Lean
Answer Presence Drives RAG Rewriting Gains
FIDES: Faithful Inference via Deep Evidence Signals for Retrieval-Memory Conflict in RAG
Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?
Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments
Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows
Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillatio
AdaMEM: Test-Time Adaptive Memory for Language Agents
PerceptUI: LLM Agents as Human-Aligned Synthetic Users for UI/UX Evaluation
Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models
Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving
DiG-Plan: Mitigating Early Commitment for Tool-Graph Planning via Diffusion Guidance
When AI Says It Feels
Class-Specific Branch Attention for Mitigating Gradient Interference under Class Imbalance
SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents