← heapsort-ai

LLM

611 items

RESEARCHarXiv CS.AI·22d ago

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

LinAlg-Bench is a new diagnostic benchmark evaluating 10 frontier large language models (LLMs) on structured linear algebra computation, revealing structural failure modes. It assesses LLM performance across a dimensional gradient of matrices, classifying failures into ten primary error types and identifying a behavioral threshold at 4x4 matrices.

28
RESEARCHarXiv CS.LG·4/23/2026

WorkflowGen:an adaptive workflow generation mechanism driven by trajectory experience

WorkflowGen addresses the high overhead and instability of LLM agents in complex tasks by proposing an adaptive, trajectory experience-driven framework for workflow generation. It captures full execution trajectories to extract reusable knowledge and performs lightweight generation on variable nodes, significantly reducing token usage and improving efficiency.

28
RESEARCHarXiv CS.CL·28d ago

RETUYT-INCO at BEA 2026 Shared Task 2: Meta-prompting in Rubric-based Scoring for German

This paper introduces RETUYT-INCO's participation in the BEA 2026 shared task on rubric-based short answer scoring for German, proposing a "Meta-prompting" method where an LLM generates custom prompts for grading. The team achieved 6th place in Track 1 and 4th place in Track 3, demonstrating the effectiveness of their LLM-based and other approaches.

28
ARTICLEDEV.to AI·5d ago

Why LLM Agents Still Can't Query NoSQL Databases

LLMs excel at querying SQL databases due to SQL's precise nature and abundant training data, making it a natural interface. However, LLM agents struggle significantly with NoSQL databases, a common production data store, primarily because NoSQL lacks the specificity and consistent syntax found in SQL.

28