synthetic data

20 items

RESEARCHarXiv CS.CL·1d ago

Re-Centering Humans in LLM Personalization

This paper investigates the discrepancy in LLM personalization performance between synthetic and human data. It finds that human data reveals significant system limitations in attribute extraction, attribute relevance, and generating genuinely personalized responses.

user data synthetic data LLM personalization AI evaluation

ARTICLEDEV.to AI·4/14/2026

Stop Generating Synthetic Datasets. Start Generating Synthetic Systems.

The article argues that most synthetic data platforms fail by generating isolated datasets instead of interconnected systems, leading to AI model failures and QA issues in sensitive sectors like BFSI and healthtech. It stresses that AI products rely on complex databases, requiring synthetic data to reflect actual user behavior across multiple tables to be effective.

synthetic data insurance Healthtech BFSI

ARTICLE↑ trendingReddit r/MachineLearning·4/23/2026

OpenSimula — open implementation of Simula-style mechanism design for synthetic data (in AfterImage) [P]

OpenSimula is an experimental Python implementation of Simula mechanism design, added to the AfterImage open-source dataset tool. It addresses the need for controlled diversity in LLM SFT/eval setups by generating varied synthetic data through LLM-built taxonomies, weighted sampling, and critic loops.

synthetic data mechanism-design open-source-tool LLM evaluation

ARTICLEDEV.to AI·4/23/2026

Stop Shipping AI on Toy Datasets: How to Treat Synthetic Data as Infrastructure

The article argues that using "toy datasets" for AI testing breaks an unwritten contract, leading to deployment failures. It proposes treating synthetic data as robust infrastructure—standardized, versioned, and monitored—rather than mere glue code, exemplified by SyntheholDB.

synthetic data MLOps Data Infrastructure

RESEARCHHugging Face Blog·5d ago

Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining

This content discusses the generation of synthetic question-and-answer pairs, which are utilized for the pretraining of AI models, specifically Nemotron. The technique aims to enhance model performance through artificial training data.

synthetic data AI models pretraining Q&A generation

RESEARCHarXiv CS.CL·4d ago

Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics

The paper proposes a bilayer SIR/SIRS framework to model synthetic data contamination and model collapse within the AI ecosystem. This phenomenological mean-field model treats data corpora and AI models as interacting populations, deriving a basic reproduction number to analyze cross-contamination.

synthetic data AI models data contamination model collapse

ARTICLEDEV.to AI·4/9/2026

The model looked great on validation until one real invoice broke four assumptions

O artigo relata a experiência de ajuste fino de um modelo Gemma para analisar faturas indianas. Apesar das métricas de treinamento sintéticas serem excelentes, um único documento real revelou falhas cruciais e o problema do "domain gap", destacando a importância de dados reais.

synthetic data machine learning AI

RESEARCHarXiv CS.CL·4/14/2026

Generating High Quality Synthetic Data for Dutch Medical Conversations

This paper presents a pipeline for generating synthetic Dutch medical dialogues using a fine-tuned Large Language Model to address the scarcity of clinical data due to privacy constraints. Evaluations showed strong lexical variety but a scripted conversation flow and issues in domain specificity during qualitative review.

synthetic data Clinical Communication Dutch Language Medical NLP

RESEARCHarXiv CS.LG·4/16/2026

Synthetic Tabular Generators Fail to Preserve Behavioral Fraud Patterns: A Benchmark on Temporal, Velocity, and Multi-Account Signals

This research introduces "behavioral fidelity" as a new evaluation dimension for synthetic tabular data, measuring whether generated data preserves temporal and structural behavioral patterns critical for fraud detection. It proves that dominant row-independent generators are inherently incapable of reproducing complex multi-account fraud graph motifs.

synthetic data fraud detection behavioral patterns

RESEARCHarXiv CS.AI·12d ago

On the Origin of Synthetic Information by Means of Steganographic Inheritance

This research paper posits the origin of synthetic information as a core mystery in information science, drawing an analogy to the origin of species. It introduces a steganographic inheritance mechanism to help trace the evolutionary lineage of AI-generated synthetic information, acknowledging the moral implications and technical challenges.

information theory synthetic data steganography AI ethics

RESEARCHarXiv CS.AI·4/20/2026

LACE: Lattice Attention for Cross-thread Exploration

LACE is a novel framework enabling Large Language Models (LLMs) to coordinate and share insights across multiple parallel reasoning paths through cross-thread attention. It leverages a synthetic data pipeline to teach collaborative error-correction, leading to over 7 points improvement in reasoning accuracy.

synthetic data LLMs attention mechanisms AI Reasoning

RESEARCHarXiv CS.CL·4/13/2026

SynDocDis: A Metadata-Driven Framework for Generating Synthetic Physician Discussions Using Large Language Models

SynDocDis is a novel framework that utilizes Large Language Models and de-identified case metadata to generate clinically accurate synthetic physician-to-physician dialogues. This approach addresses the scarcity of real discussion data due to privacy concerns, aiming to enrich AI agents with valuable clinical knowledge.

synthetic data Medical Dialogue Generation privacy healthcare AI

ARTICLEDEV.to AI·6d ago

hat Makes a Good SFT Sample (And Why Most Synthetic Datasets Get It Wrong)

Many fine-tuned language models result in worse performance due to poor quality synthetic data. The issue is not with the training setup, but with the lack of mechanisms to filter out errors during synthetic data generation.

synthetic data LLMs model training Fine-tuning

DOCDEV.to AI·4/27/2026

BlenderProc

BlenderProc is a procedural renderer based on Blender, used to generate synthetic datasets for computer vision research. It facilitates the creation of diverse and realistic data for training AI models.

synthetic data computer vision 3d-rendering AI tools

ARTICLEHugging Face Blog·4/17/2026

Building a Fast Multilingual OCR Model with Synthetic Data

This content discusses building a fast and multilingual Optical Character Recognition (OCR) model. The proposed methodology involves using synthetic data for model training and optimization.

synthetic data Multilingual AI machine learning OCR

RESEARCHarXiv CS.LG·5/1/2026

Fidelity, Diversity, and Privacy: A Multi-Dimensional LLM Evaluation for Clinical Data Augmentation

This research proposes using LLMs (DeepSeek-R1, OpenBioLLM-Llama3, Qwen 3.5) for synthetic mental health data augmentation to address data scarcity and privacy regulations. A comprehensive evaluation framework is introduced, assessing semantic fidelity, lexical diversity, and privacy/plagiarism to mitigate risks like mode collapse or memorization.

synthetic data LLMs security Data Augmentation

RESEARCHarXiv CS.CL·4/17/2026

SeaAlert: Critical Information Extraction From Maritime Distress Communications with Large Language Models

SeaAlert is an LLM-based framework designed for the robust analysis of maritime distress communications, which are challenging due to noise, deviations from format, and ASR errors. To overcome the lack of real-world labeled data, the framework utilizes an LLM-powered synthetic data generation pipeline.

synthetic data Information Extraction NLP Speech Recognition

RESEARCHarXiv CS.CL·12d ago

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

This research addresses the Stability-Expressivity Gap in Spoken Language Models (SLMs) for low-resource languages, caused by the extensive use of synthetic data. While synthetic data improves phonetic accuracy, it degrades prosodic expressivity, a phenomenon termed Synthetic Erosion. The paper introduces self-alignment frameworks to recover expressivity.

synthetic data speech synthesis spoken language models Low-resource languages

DOCHugging Face Blog·4/21/2026

How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas

The content describes how to ground a Korean AI agent in real demographics. It explores the use of synthetic personas to create culturally relevant and accurate AI responses.

synthetic data localization Demographics AI agents

RESEARCHarXiv CS.AI·4/6/2026

ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents

ESL-Bench é um benchmark longitudinal sintético e orientado a eventos. Ele foi desenvolvido para a avaliação de agentes de saúde, provavelmente envolvendo inteligência artificial.

synthetic data Agentes de Saúde IA na Saúde Healthcare