← heapsort-ai

datasets

19 items

RESEARCHarXiv CS.CL·1d ago

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

The HKJudge project introduces the first sentence-level, expert-annotated legal discourse corpus of Hong Kong criminal judgments, comprising approximately 290k sentences. It utilizes a two-tier discourse schema to identify what courts find, how they reason, and what they rule, with high inter-annotator agreement.

36
RESEARCHarXiv CS.LG·19d ago

MagBridge-Battery: A Synthetic Bridge Dataset for Li-ion Magnetometry and State-of-Health Diagnostics

This research introduces MagBridge-Battery v1.0, a new synthetic dataset comprising 6,760 magnetic-field signatures for diagnosing the health of Li-ion batteries. It bridges real magnetic data with state-of-health labels to overcome the lack of public datasets for magnetic sensing in battery degradation studies.

30
RESEARCHarXiv CS.AI·4d ago

Synthetic Contrastive Reasoning for Multi-Table Q&A

This paper introduces a synthetic contrastive reasoning-trace dataset for multi-table question answering (MMQA), addressing the lack of reasoning supervision in existing resources. Open-weight LLMs fine-tuned with Contrastive Preference Optimization (CPO) using this dataset achieved significant performance improvements, highlighting the benefits of heterogeneous trace generators.

28
RESEARCHarXiv CS.CL·5/8/2026

When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models

When2Speak is a new synthetic dataset and four-stage generation pipeline designed to teach Large Language Models (LLMs) appropriate intervention timing in multi-party conversations. It addresses the challenge of avoiding excessive interruptions and improving conversational coherence in group interactions.

27
ARTICLEDEV.to AI·22d ago

Medical AI Doesn’t Just Need Bigger Models. It Needs an ImageNet for State Transitions

This article proposes the creation of "Biomedical TransitionNet", a new type of dataset analogous to ImageNet, but focused on biological state transitions for the next generation of medical AI. It argues for the necessity of such infrastructure to build real-world models in biomedicine, moving beyond classification and prediction.

27
RESEARCHarXiv CS.CL·4/20/2026

"Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations

CoLabScience is introduced as a proactive LLM assistant aimed at accelerating biomedical discovery by facilitating collaborations between AI and human experts. It features PULI, a novel reinforcement learning framework for timely interventions in scientific discussions, and also presents BSDD, a new benchmark dataset of simulated research dialogue.

27
RESEARCHarXiv CS.CL·5/1/2026

BatteryPass-12K: The First Dataset for the Novel Digital Battery Passport Conformance Task

This paper introduces BatteryPass-12K, the first public dataset for the novel task of digital battery passport (DBP) conformance classification, addressing a critical need before new EU regulations. It benchmarks 22 language models, finding that "Thinking models" like GPT-5.4 achieve the best performance, and few-shot examples significantly enhance results on this challenging task.

27
RESEARCHarXiv CS.CL·5/8/2026

Generating Query-Focused Summarization Datasets from Query-Free Summarization Datasets

This paper proposes an evidence-based model to generate queries from query-free summarization datasets, addressing the challenge of finding suitable datasets for Query-Focused Summarization (QFS). Experimental results indicate that summaries generated using these evidence-based queries achieve competitive ROUGE scores, supporting their effectiveness for the QFS task.

27
RESEARCHarXiv CS.CL·5/4/2026

Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues

New research addresses the gap in evaluating cultural reasoning in LLMs, introducing ArabCulture-Dialogue, a culturally grounded conversational dataset covering 13 Arabic-speaking countries. Experiments indicate that models perform worse on cultural reasoning, translation, and generation tasks in dialectal setups compared to Modern Standard Arabic.

27