Dataset

12 items

ARTICLEDEV.to AI·4/18/2026

Part 2: The Data — Building the First Public Coffee Roasting Audio Dataset with Warp/Oz

This article describes the creation of the first public audio dataset for coffee roasting first crack detection, addressing a significant gap in available resources. The dataset, comprising 973 annotated 10-second segments, was meticulously built from scratch and led to a model achieving 100% precision thanks to careful data splitting and loss weighting.

Dataset audio processing data engineering machine learning

RESEARCHarXiv CS.CL·4/10/2026

Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

Apesar da estagnação da precisão em benchmarks acadêmicos de fala para texto, as aplicações industriais exigem melhor reconhecimento de vocabulário raro e contextual. Este artigo introduz o Contextual Earnings-22, um novo dataset e benchmark para promover a pesquisa e revelar avanços no reconhecimento contextual de fala com vocabulário personalizado.

Dataset custom vocabulary Speech-to-Text benchmark

RESEARCHarXiv CS.CL·7d ago

DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset

This paper introduces DraDDP, the first publicly available English multimodal dataset for multi-party dialogue discourse parsing, based on American TV dramas. It contains 495 dialogue segments and demonstrates the value of multimodal information in capturing dialogue structures and relation types.

Dataset Dialogue Parsing multimodal AI natural language processing

RESEARCHDEV.to AI·4/13/2026

FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age

FairFace is a face attribute dataset designed to mitigate biases in AI models by providing balanced representation across race, gender, and age. It aims to improve the fairness and robustness of computer vision systems, ensuring more equitable performance.

FairFace Dataset Bias Mitigation computer vision

ARTICLEDEV.to AI·5/5/2026

We Built Sign Language AI for a Language With Almost No Dataset. Here's What That Actually Looks Like.

This article details the development of OmniSign, a real-time Lebanese Sign Language (LSL) translator, addressing the challenges of building AI for a language with an almost non-existent dataset. The author emphasizes that the hardest problems encountered were not technical but human. The inspiration came from witnessing communication difficulties between a deaf man and a barista in Beirut.

Dataset Low-Resource Language machine learning Sign Language AI

RESEARCHarXiv CS.CL·4/10/2026

TR-EduVSum: A Turkish-Focused Dataset and Consensus Framework for Educational Video Summarization

Este estudo apresenta o dataset TR-EduVSum, focado em vídeos educacionais turcos, e propõe o método AutoMUP. Este método gera resumos padrão-ouro de forma automática e reproduzível a partir de múltiplos resumos humanos, usando agrupamento de unidades de significado e modelagem estatística de consenso.

Dataset consensus framework educational video summarization machine learning

RESEARCHarXiv CS.CL·4/30/2026

MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese

This paper introduces MATH-PT, a novel dataset of 1,729 mathematical problems in European and Brazilian Portuguese, to address the linguistic bias in LLM mathematical reasoning evaluations. The benchmark reveals that frontier reasoning models achieve strong performance in multiple-choice questions but their performance decreases for open-ended questions.

Dataset mathematical reasoning LLMs Benchmarking

RESEARCHarXiv CS.CL·5/4/2026

ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts

This article introduces ViLegalNLI, the first large-scale Vietnamese Natural Language Inference (NLI) dataset specifically constructed for the legal domain. It consists of 42,012 premise-hypothesis pairs derived from official statutory documents, developed using a semi-automatic framework that integrates large language models for hypothesis generation and quality validation.

Dataset Legal AI Natural Language Inference Vietnamese NLI

RESEARCHarXiv CS.CL·4/21/2026

CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark

CFMS introduces the first fine-grained Chinese multimodal sarcasm detection benchmark, comprising 2,796 image-text pairs with triple-level annotations. This dataset aims to improve AI's fine-grained semantic understanding and metaphoric reasoning, addressing limitations in existing benchmarks.

Dataset multimodal AI natural language processing benchmark

RESEARCHarXiv CS.CL·8d ago

When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models

This research paper investigates global narrative dominance in Large Language Models (LLMs), where local cultural knowledge is often overshadowed by global narratives. It introduces the CulturalNB dataset for Bengali cultural contexts and demonstrates that questions asked in English tend to increase global substitution and institutional framing, reducing local perspective coverage.

Dataset Cross-lingual Cultural Bias natural language processing

RESEARCHarXiv CS.AI·4/23/2026

ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

ThermoQA is a new three-tier benchmark of 293 open-ended engineering thermodynamics problems introduced to evaluate thermodynamic reasoning in LLMs. Leading LLMs like Claude Opus 4.6 and GPT-5.4 achieve high scores, but cross-tier degradation confirms that property memorization does not imply thermodynamic reasoning, with the dataset and code being open-source.

Dataset Benchmarking large language models AI evaluation

RESEARCHDEV.to AI·4/9/2026

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

Charades-Ego is a large-scale dataset featuring paired third and first-person videos. This resource is valuable for research in computer vision and video analysis.

Dataset First-person vision Third-person vision computer vision