datasets

19 items

ARTICLE↑ trendingReddit r/MachineLearning·4/20/2026

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

An independent researcher created SGOCR, an open-source dataset pipeline for spatially-grounded, OCR-focused VQA, to fill a gap in visual datasets for text grounding in imagery. This pipeline generates VQA tuples with rich metadata, supporting diverse VLM training strategies.

Open Source Vision-Language Models datasets OCR

RESEARCHarXiv CS.CL·1d ago

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

The HKJudge project introduces the first sentence-level, expert-annotated legal discourse corpus of Hong Kong criminal judgments, comprising approximately 290k sentences. It utilizes a two-tier discourse schema to identify what courts find, how they reason, and what they rule, with high inter-annotator agreement.

Natural Language Processing datasets linguistics legal tech

RESEARCHarXiv CS.AI·1d ago

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

This paper introduces CrowdMath, a dataset of 164 expert-annotated progress chains from the MIT PRIMES--Art of Problem Solving CrowdMath program. It aims to evaluate large language models on collaborative open-problem solving in mathematical research, diverging from benchmarks focused on final answers or complete proofs.

mathematical reasoning LLMs datasets Benchmarks

RESEARCHarXiv CS.LG·19d ago

MagBridge-Battery: A Synthetic Bridge Dataset for Li-ion Magnetometry and State-of-Health Diagnostics

This research introduces MagBridge-Battery v1.0, a new synthetic dataset comprising 6,760 magnetic-field signatures for diagnosing the health of Li-ion batteries. It bridges real magnetic data with state-of-health labels to overcome the lack of public datasets for magnetic sensing in battery degradation studies.

Battery Diagnostics State-of-Health Magnetometry Li-ion Batteries

RESEARCHarXiv CS.AI·4d ago

Synthetic Contrastive Reasoning for Multi-Table Q&A

This paper introduces a synthetic contrastive reasoning-trace dataset for multi-table question answering (MMQA), addressing the lack of reasoning supervision in existing resources. Open-weight LLMs fine-tuned with Contrastive Preference Optimization (CPO) using this dataset achieved significant performance improvements, highlighting the benefits of heterogeneous trace generators.

Question Answering machine learning NLP datasets

RESEARCHHugging Face Blog·5d ago

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

EVA-Bench Data 2.0 introduces an updated benchmark featuring 3 domains, 121 tools, and 213 scenarios. This dataset is designed for evaluating AI systems and tools.

AI benchmarking datasets AI tools AI evaluation

RESEARCHarXiv CS.CL·5/8/2026

When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models

When2Speak is a new synthetic dataset and four-stage generation pipeline designed to teach Large Language Models (LLMs) appropriate intervention timing in multi-party conversations. It addresses the challenge of avoiding excessive interruptions and improving conversational coherence in group interactions.

LLMs machine learning datasets Conversational AI

ARTICLEDEV.to AI·22d ago

Medical AI Doesn’t Just Need Bigger Models. It Needs an ImageNet for State Transitions

This article proposes the creation of "Biomedical TransitionNet", a new type of dataset analogous to ImageNet, but focused on biological state transitions for the next generation of medical AI. It argues for the necessity of such infrastructure to build real-world models in biomedicine, moving beyond classification and prediction.

Biomedical TransitionNet datasets AI infrastructure healthcare AI

RESEARCHarXiv CS.CL·4/20/2026

"Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations

CoLabScience is introduced as a proactive LLM assistant aimed at accelerating biomedical discovery by facilitating collaborations between AI and human experts. It features PULI, a novel reinforcement learning framework for timely interventions in scientific discussions, and also presents BSDD, a new benchmark dataset of simulated research dialogue.

LLMs AI collaboration reinforcement learning datasets

RESEARCHDEV.to AI·5/10/2026

GQA: A New Dataset for Real-World Visual Reasoning and Compositional QuestionAnswering

GQA is a new dataset designed to challenge and evaluate AI systems in visual reasoning and compositional question answering. It aims to advance scene understanding and multimodal interaction in real-world scenarios.

Question Answering visual reasoning computer vision datasets

RESEARCHDEV.to AI·4/25/2026

JSUT corpus: free large-scale Japanese speech corpus for end-to-end speechsynthesis

The JSUT corpus is a free, large-scale Japanese speech dataset designed for end-to-end speech synthesis research. It provides valuable resources for developing advanced AI models in speech technology for the Japanese language.

japanese language speech synthesis machine learning Natural Language Processing

DOCHugging Face (YouTube)·7d ago

How to Create an LLM Dataset | FineWeb Overview

This content provides a guide on how to create datasets for Large Language Models (LLMs). It includes an an overview of FineWeb, a resource relevant for this process.

learning datasets AI development FineWeb

How to Create an LLM Dataset | FineWeb Overview

RESEARCHarXiv CS.CL·5/1/2026

BatteryPass-12K: The First Dataset for the Novel Digital Battery Passport Conformance Task

This paper introduces BatteryPass-12K, the first public dataset for the novel task of digital battery passport (DBP) conformance classification, addressing a critical need before new EU regulations. It benchmarks 22 language models, finding that "Thinking models" like GPT-5.4 achieve the best performance, and few-shot examples significantly enhance results on this challenging task.

evaluation Benchmarking Natural Language Processing datasets

RESEARCHarXiv CS.CL·5/8/2026

Generating Query-Focused Summarization Datasets from Query-Free Summarization Datasets

This paper proposes an evidence-based model to generate queries from query-free summarization datasets, addressing the challenge of finding suitable datasets for Query-Focused Summarization (QFS). Experimental results indicate that summaries generated using these evidence-based queries achieve competitive ROUGE scores, supporting their effectiveness for the QFS task.

query generation Natural Language Processing datasets summarization

RESEARCHarXiv CS.CL·5/4/2026

Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues

New research addresses the gap in evaluating cultural reasoning in LLMs, introducing ArabCulture-Dialogue, a culturally grounded conversational dataset covering 13 Arabic-speaking countries. Experiments indicate that models perform worse on cultural reasoning, translation, and generation tasks in dialectal setups compared to Modern Standard Arabic.

LLMs Arabic dialects cultural reasoning Benchmarking

RESEARCHarXiv CS.CL·6d ago

IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation

IdiomX is a large-scale multilingual benchmark introduced to address the challenges of idiomatic expressions in natural language processing. It contains over 190K contextualized examples spanning 12K+ idioms with aligned semantic representations in English, Arabic, and French.

language models Natural Language Processing datasets Benchmarks

RESEARCHarXiv CS.LG·8d ago

QASM-Eval: A Dataset to Train and Evaluate LLMs on OpenQASM-3 Beyond Quantum Circuits

QASM-Eval is a new comprehensive dataset designed to train and evaluate Large Language Models (LLMs) on OpenQASM-3 programs that involve advanced hardware-oriented features. It addresses a gap in LLM capability to handle quantum computing programming beyond gate-sequence circuit specification.

Quantum Computing LLMs datasets OpenQASM-3

RESEARCHarXiv CS.LG·14d ago

A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?

The paper introduces InteractBind, a large-scale dataset of approximately 100k protein-ligand pairs, and a benchmark for fine-grained evaluation. It aims to assess whether models can localize binding sites and identify non-covalent interactions, addressing a gap in existing evaluations.

molecular modeling Benchmarking drug discovery datasets

RESEARCHarXiv CS.CL·6d ago

Translating Classical Poetry into Modern Prose

Padyam2Gadyam is a new dataset for poem-to-prose translation, covering 13th-17th Century Telugu Classical Poetry into contemporary Telugu and English prose. Evaluation of five Large Language Models on this dataset indicated that their overall performance leaves significant room for improvement.

poetry LLMs Translation Natural Language Processing