Speech Recognition

18 items

NEWSMicrosoft Research (YouTube)·1d ago

Introducing MAI-Transcribe-1.5 | Microsoft AI Models

Microsoft introduces MAI-Transcribe-1.5, a new AI model focused on transcription. This release is part of Microsoft's collection of AI models.

transcription AI models Product Launch Microsoft AI

Introducing MAI-Transcribe-1.5 | Microsoft AI Models

RESEARCH↑ trendingReddit r/MachineLearning·18d ago

Live Human Detector on Outbound Phone Calls [R]

This content outlines the goal and requirements for a "Live Human Detector" tool designed for call centers. Its primary function is to identify when a call has successfully connected to a live person, differentiating them from automated system announcements, to prevent customers from waiting unnecessarily.

audio analysis customer service AI human detection call center automation

RESEARCHarXiv CS.AI·4/16/2026

Listening Alone, Understanding Together: Collaborative Context Recovery for Privacy-Aware AI

CONCORD is a privacy-aware A2A framework for speech-based AI assistants that ensures owner-only speech capture via real-time speaker verification. It recovers missing context through spatio-temporal resolution and minimal A2A queries, achieving 91.4% recall.

privacy AI Assistants Speech Recognition

RESEARCHarXiv CS.CL·4/10/2026

Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

Apesar da estagnação da precisão em benchmarks acadêmicos de fala para texto, as aplicações industriais exigem melhor reconhecimento de vocabulário raro e contextual. Este artigo introduz o Contextual Earnings-22, um novo dataset e benchmark para promover a pesquisa e revelar avanços no reconhecimento contextual de fala com vocabulário personalizado.

Dataset custom vocabulary Speech-to-Text benchmark

RESEARCHarXiv CS.CL·5/1/2026

Selective Augmentation: Improving Universal Automatic Phonetic Transcription via G2P Bootstrapping

This research proposes Selective Augmentation, a bootstrapping method to improve universal automatic phonetic transcription (APT) by selectively transferring linguistic distinctions to address limited high-quality training data. Exemplified with the MultIPA model, the approach enhanced plosive voicing accuracy by 17.6% and introduced aspiration recognition using data augmented from a helper language like Hindi.

machine learning phonetics Data Augmentation Speech Recognition

ARTICLEDEV.to AI·4/12/2026

"Talk to Your Terminal: Building a Voice AI Agent in Python"

This article details the design and implementation of a voice-controlled AI agent in Python, operating locally. It utilizes OpenAI Whisper for transcription, an LLM for intent classification, and performs file system operations, aiming for personalized automation.

Local AI Python Speech Recognition LLM

RESEARCHarXiv CS.CL·5/6/2026

The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

This paper introduces a self-contained TTS-STT flywheel to close the gap in niche-domain Indic ASR where commercial and open-source systems fail. It synthesizes entity-dense audio to significantly improve the Entity-Hit-Rate on challenging datasets for languages like Telugu.

Indic languages machine learning TTS ASR

ARTICLEDEV.to AI·5/7/2026

Voice AI for jobsite estimating: a developer perspective

The main challenge in developing voice AI for jobsite estimating is not the technology itself, but rather the user experience in blue-collar environments. This article details the technical and UX decisions made by a company to optimize voice interfaces for blue-collar workers, aiming to prevent common mistakes.

UX/UI developer guide Speech Recognition voice AI

RESEARCHDEV.to AI·4/26/2026

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

This content describes the Transformer-Transducer model, a novel architecture for end-to-end speech recognition that leverages the self-attention mechanism of Transformers. It focuses on improving the accuracy and efficiency of transcribing spoken language directly into text.

deep learning Transformer Speech Recognition

DOCDEV.to AI·4/18/2026

Transcription Glossary: 25+ Terms You Need to Know

This glossary defines over 25 essential terms in transcription and speech recognition, such as WER and diarization. It aims to demystify technical jargon from speech science, machine learning, and audio engineering for AI tool users.

glossary audio-engineering machine learning ASR

ARTICLEDEV.to AI·4/15/2026

Local Voice Controlled AI Agent

This content describes a self-built local voice-controlled AI agent that acts directly on your machine, rather than just conversing. It can perform various actions like creating files, generating code, opening applications, and browsing websites, significantly bridging the gap between thought and computer execution.

AI agent Local AI voice control Desktop automation

RESEARCHarXiv CS.CL·4/17/2026

SeaAlert: Critical Information Extraction From Maritime Distress Communications with Large Language Models

SeaAlert is an LLM-based framework designed for the robust analysis of maritime distress communications, which are challenging due to noise, deviations from format, and ASR errors. To overcome the lack of real-world labeled data, the framework utilizes an LLM-powered synthetic data generation pipeline.

synthetic data Information Extraction NLP Speech Recognition

RESEARCHarXiv CS.CL·14d ago

Raon-Speech Technical Report

Raon-Speech is a top-performing 9B-parameter speech language model (SpeechLM) for English and Korean speech understanding, answering, and generation, achieving strong overall results across 42 benchmarks. It successfully transforms a pre-trained LLM into a SpeechLM while preserving strong text capabilities through specific training stages.

multimodal AI Benchmarking natural language processing large language models

RESEARCHarXiv CS.CL·14d ago

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

This paper investigates failures in Audio LLMs when transcribing English-Mandarin code-switching speech, identifying issues like language omission and translation. Applying Direct Preference Optimization (DPO) aligns models to preserve mixed-language content, leading to significant reductions in Mixed Error Rate (MER).

Multilingual AI Audio LLMs Code-Switching Direct Preference Optimization

RESEARCHarXiv CS.CL·8d ago

Your Multimodal Speech Model Says I Have a Face for Radio

This paper proposes the first bias evaluation of multimodal speech recognition, revealing significant quality-of-service differences across mWhisper-Flamingo and Gemini models based on self-declared gender and ethnicity. These findings highlight a priority for developers to evaluate, fix, and communicate such biases.

multimodal AI AI bias ethnicity bias gender bias

RESEARCHHugging Face Blog·5/6/2026

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

This content announces the integration of Benchmaxxer Repellant into the Open ASR Leaderboard. This new addition aims to enhance the robustness and fairness of automatic speech recognition system evaluations.

AI models evaluation Benchmarking ASR

ARTICLEDEV.to AI·4/14/2026

OpenClaw Voice Assistant: Voice Wake and Talk Mode Setup

OpenClaw Voice Assistant integrates Voice Wake and Talk Mode to become a controllable voice assistant, similar to Siri or Alexa. It uses an on-device processed wake word and can be powered by AI models like Claude, GPT, or Gemini, connecting to OpenClaw integrations.

OpenClaw Voice Assistant AI Wake Word

ARTICLEDEV.to AI·4/14/2026

Whisper Hallucination on Silence: Why Your Transcript Loops the Same Phrase

This content explores the phenomenon of hallucination in the Whisper model, explaining why transcripts might loop the same phrase. It details the causes behind this behavior when the model processes periods of silence.

hallucination audio processing Whisper Model AI