Introducing MAI-Transcribe-1.5 | Microsoft AI Models
Microsoft introduces MAI-Transcribe-1.5, a new AI model focused on transcription. This release is part of Microsoft's collection of AI models.

Microsoft introduces MAI-Transcribe-1.5, a new AI model focused on transcription. This release is part of Microsoft's collection of AI models.

This content outlines the goal and requirements for a "Live Human Detector" tool designed for call centers. Its primary function is to identify when a call has successfully connected to a live person, differentiating them from automated system announcements, to prevent customers from waiting unnecessarily.
CONCORD is a privacy-aware A2A framework for speech-based AI assistants that ensures owner-only speech capture via real-time speaker verification. It recovers missing context through spatio-temporal resolution and minimal A2A queries, achieving 91.4% recall.
Apesar da estagnação da precisão em benchmarks acadêmicos de fala para texto, as aplicações industriais exigem melhor reconhecimento de vocabulário raro e contextual. Este artigo introduz o Contextual Earnings-22, um novo dataset e benchmark para promover a pesquisa e revelar avanços no reconhecimento contextual de fala com vocabulário personalizado.
This research proposes Selective Augmentation, a bootstrapping method to improve universal automatic phonetic transcription (APT) by selectively transferring linguistic distinctions to address limited high-quality training data. Exemplified with the MultIPA model, the approach enhanced plosive voicing accuracy by 17.6% and introduced aspiration recognition using data augmented from a helper language like Hindi.
This article details the design and implementation of a voice-controlled AI agent in Python, operating locally. It utilizes OpenAI Whisper for transcription, an LLM for intent classification, and performs file system operations, aiming for personalized automation.
This paper introduces a self-contained TTS-STT flywheel to close the gap in niche-domain Indic ASR where commercial and open-source systems fail. It synthesizes entity-dense audio to significantly improve the Entity-Hit-Rate on challenging datasets for languages like Telugu.
The main challenge in developing voice AI for jobsite estimating is not the technology itself, but rather the user experience in blue-collar environments. This article details the technical and UX decisions made by a company to optimize voice interfaces for blue-collar workers, aiming to prevent common mistakes.
This content describes the Transformer-Transducer model, a novel architecture for end-to-end speech recognition that leverages the self-attention mechanism of Transformers. It focuses on improving the accuracy and efficiency of transcribing spoken language directly into text.
This glossary defines over 25 essential terms in transcription and speech recognition, such as WER and diarization. It aims to demystify technical jargon from speech science, machine learning, and audio engineering for AI tool users.
This content describes a self-built local voice-controlled AI agent that acts directly on your machine, rather than just conversing. It can perform various actions like creating files, generating code, opening applications, and browsing websites, significantly bridging the gap between thought and computer execution.
SeaAlert is an LLM-based framework designed for the robust analysis of maritime distress communications, which are challenging due to noise, deviations from format, and ASR errors. To overcome the lack of real-world labeled data, the framework utilizes an LLM-powered synthetic data generation pipeline.
Raon-Speech is a top-performing 9B-parameter speech language model (SpeechLM) for English and Korean speech understanding, answering, and generation, achieving strong overall results across 42 benchmarks. It successfully transforms a pre-trained LLM into a SpeechLM while preserving strong text capabilities through specific training stages.
This paper investigates failures in Audio LLMs when transcribing English-Mandarin code-switching speech, identifying issues like language omission and translation. Applying Direct Preference Optimization (DPO) aligns models to preserve mixed-language content, leading to significant reductions in Mixed Error Rate (MER).
This paper proposes the first bias evaluation of multimodal speech recognition, revealing significant quality-of-service differences across mWhisper-Flamingo and Gemini models based on self-declared gender and ethnicity. These findings highlight a priority for developers to evaluate, fix, and communicate such biases.
This content announces the integration of Benchmaxxer Repellant into the Open ASR Leaderboard. This new addition aims to enhance the robustness and fairness of automatic speech recognition system evaluations.
OpenClaw Voice Assistant integrates Voice Wake and Talk Mode to become a controllable voice assistant, similar to Siri or Alexa. It uses an on-device processed wake word and can be powered by AI models like Claude, GPT, or Gemini, connecting to OpenClaw integrations.
This content explores the phenomenon of hallucination in the Whisper model, explaining why transcripts might loop the same phrase. It details the causes behind this behavior when the model processes periods of silence.