←── back to feed
/topics/arxiv-cs-ai-papers-june-12-2026
arXiv cs.AI papers June 12 2026
50 items●1 sources●updated 4d ago●trend 0
On June 12, 2026, arXiv's cs.AI section published 20 papers spanning agent frameworks, formal reasoning, safety evaluation, and multimodal learning. Topics include tool-use optimization for LLM agents, tree-search cognition layers, clinical LLM deployment, AGI definitions, and unlearning benchmarks for multimodal models.
- ToolSense (arXiv:2606.12451) audits parametric tool retrieval in LLMs using virtual token encoding fine-tuned in two stages
- Arbor (arXiv:2606.12563) introduces tree search as shared working memory across multi-agent systems in stateful action spaces
- Pythagoras-Prover (arXiv:2606.12594) offers compute-efficient Lean theorem provers at 4B and 32B parameters plus diffusion-based variant
- SciAgentArena (arXiv:2606.12736) benchmarks AI agents on real-world scientific tasks with interactive evaluation support
- MLUBench (arXiv:2606.12809) provides large-scale benchmark with 127 examples for lifelong unlearning in multimodal LLMs
[BLG]blog/rss50
ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs
Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Strategic Decision Support for AI Agents
Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation
PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation
"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation
Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents
From AGI to ASI
Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System
Definitional alignment before capability alignment: a Design-Science framework for adjudicating claims about AGI
The Theory of Mind Utility: Formal Specification of a Mentalizing Mechanism
Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior
Benchmarking AI Agents for Addressing Scientific Challenges Across Scales
Reducing the Complexity of Deep Learning Models for EEG Analysis on Wearable Devices
Prefill Awareness in Large Language Models
Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage
A Tutorial on World Models and Physical AI
The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements
MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs
Teach-and-Repeat: Accurately Extracting Operational Knowledge from Mobile Screen Demonstrations to Empower GUI Agents
GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models
Topical Phase Transitions in Artificial Intelligence Research: Large-Scale Evidence and an Early-Warning Signature for Emerging Topics
Fantastic Scientific Agents and How to Build Them: AgentBuild for Rietveld Refinement
(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable
WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning
DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks
HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness
The Hidden Power of Scaling Factor in LoRA Optimization
Zero-source LLM Hallucination Detection with Human-like Criteria Probing
MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback
Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce
MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling
PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization
Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory
OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models
Multi-Modal Agents for Power Distribution Defect Detection: An Evaluation of Foundation Models
A Mathematical Forum Platform for Collaborative Problem Solving and Dataset Generation for AI Reasoning
Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation
APCyc: Property-Informed Design of Cyclic Peptides via Automated Cyclization
The Illusion of Multi-Agent Advantage
Otters++: A Time-to-first-spike Based Energy Efficient Optical Spiking Transformer
SciR: A Controllable Benchmark for Scientific Reasoning in LLMs
Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior
Augmentation techniques for video surveillance in the visible and thermal spectral range
AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction
Rethinking RAG in Long Videos: What to Retrieve and How to Use It?
TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?
Mental-R1: Aligning LLM Reasoning for Mental Health Assessment
Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach