← heapsort-ai

synthetic data

20 items

ARTICLEDEV.to AI·4/14/2026

Stop Generating Synthetic Datasets. Start Generating Synthetic Systems.

The article argues that most synthetic data platforms fail by generating isolated datasets instead of interconnected systems, leading to AI model failures and QA issues in sensitive sectors like BFSI and healthtech. It stresses that AI products rely on complex databases, requiring synthetic data to reflect actual user behavior across multiple tables to be effective.

43
ARTICLE↑ trendingReddit r/MachineLearning·4/23/2026

OpenSimula — open implementation of Simula-style mechanism design for synthetic data (in AfterImage) [P]

OpenSimula is an experimental Python implementation of Simula mechanism design, added to the AfterImage open-source dataset tool. It addresses the need for controlled diversity in LLM SFT/eval setups by generating varied synthetic data through LLM-built taxonomies, weighted sampling, and critic loops.

43
RESEARCHarXiv CS.LG·4/16/2026

Synthetic Tabular Generators Fail to Preserve Behavioral Fraud Patterns: A Benchmark on Temporal, Velocity, and Multi-Account Signals

This research introduces "behavioral fidelity" as a new evaluation dimension for synthetic tabular data, measuring whether generated data preserves temporal and structural behavioral patterns critical for fraud detection. It proves that dominant row-independent generators are inherently incapable of reproducing complex multi-account fraud graph motifs.

28
RESEARCHarXiv CS.AI·12d ago

On the Origin of Synthetic Information by Means of Steganographic Inheritance

This research paper posits the origin of synthetic information as a core mystery in information science, drawing an analogy to the origin of species. It introduces a steganographic inheritance mechanism to help trace the evolutionary lineage of AI-generated synthetic information, acknowledging the moral implications and technical challenges.

28
RESEARCHarXiv CS.CL·4/13/2026

SynDocDis: A Metadata-Driven Framework for Generating Synthetic Physician Discussions Using Large Language Models

SynDocDis is a novel framework that utilizes Large Language Models and de-identified case metadata to generate clinically accurate synthetic physician-to-physician dialogues. This approach addresses the scarcity of real discussion data due to privacy concerns, aiming to enrich AI agents with valuable clinical knowledge.

27
RESEARCHarXiv CS.LG·5/1/2026

Fidelity, Diversity, and Privacy: A Multi-Dimensional LLM Evaluation for Clinical Data Augmentation

This research proposes using LLMs (DeepSeek-R1, OpenBioLLM-Llama3, Qwen 3.5) for synthetic mental health data augmentation to address data scarcity and privacy regulations. A comprehensive evaluation framework is introduced, assessing semantic fidelity, lexical diversity, and privacy/plagiarism to mitigate risks like mode collapse or memorization.

27
RESEARCHarXiv CS.CL·12d ago

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

This research addresses the Stability-Expressivity Gap in Spoken Language Models (SLMs) for low-resource languages, caused by the extensive use of synthetic data. While synthetic data improves phonetic accuracy, it degrades prosodic expressivity, a phenomenon termed Synthetic Erosion. The paper introduces self-alignment frameworks to recover expressivity.

27