speech synthesis
PulseAugur coverage of speech synthesis — every cluster mentioning speech synthesis across labs, papers, and developer communities, ranked by signal.
11 day(s) with sentiment data
-
New Japanese TTS system tackles kanji polyphony with massive data scaling
Researchers have developed Sarashina2.2-TTS, a novel text-to-speech system specifically designed for Japanese, addressing the challenge of kanji polyphony. The system utilizes a massive dataset of approximately 361,000 …
-
New benchmark evaluates Chinese news TTS pronunciation accuracy
Researchers have introduced the CN-NewsTTS Bench, a new benchmark designed to evaluate the pronunciation accuracy of Chinese news Text-to-Speech (TTS) systems. This benchmark specifically targets complex written forms l…
-
Wan-Streamer v0.1: Unified model for real-time audio-visual interaction
Researchers have introduced Wan-Streamer v0.1, a novel end-to-end multimodal foundation model designed for real-time, low-latency audio-visual interaction. Unlike traditional cascaded systems, Wan-Streamer integrates la…
-
LLMs benchmarked for Japanese Grapheme-to-Phoneme conversion
A new study benchmarks over 30 large language models (LLMs) for Japanese grapheme-to-phoneme (G2P) conversion, a crucial step for text-to-speech systems. Researchers compared LLM performance against traditional morpholo…
-
Gemini API introduces streaming TTS for faster AI voice apps
Google's Gemini API now offers streaming Text-to-Speech (TTS) capabilities, enabling developers to create AI voice applications that feel more responsive. This feature is crucial for reducing perceived latency, as users…
-
New Hebrew G2P systems improve text-to-speech accuracy
Researchers have developed new methods for Hebrew grapheme-to-phoneme (G2P) conversion, crucial for improving text-to-speech (TTS) applications. The ReNikud system utilizes audio supervision from unlabeled Hebrew audio …
-
New research explores advanced speech quality assessment methods beyond MOS
Researchers are exploring new methods for assessing speech quality beyond traditional Mean Opinion Scores (MOS). One paper introduces PrefSQA, which uses pairwise preference prediction to reduce rater variability and im…
-
Neural audio codecs achieve smooth degradation down to 1.6 Hz
Researchers have investigated the degradation mechanisms in neural audio codecs operating at low frame rates, which are beneficial for autoregressive speech synthesis. Their study identified that a previously observed q…
-
AI Speech Technologies: A Resource Compilation
This Mastodon post compiles resources on AI speech technologies, covering Text-to-Speech (TTS), Speech-to-Text (STT), voice synthesis, and voice cloning. The collection aims to provide notes and links for those interest…
-
New TTS research explores discrete flow matching for efficiency
Two new research papers explore advancements in zero-shot text-to-speech (TTS) technology, focusing on discrete flow matching techniques. The first paper introduces DiFlow-TTS, a framework that uses a discrete flow matc…
-
New TTS Benchmark Uses Blind Voting for Objective Model Ratings
A new benchmark for Text-to-Speech (TTS) models has been launched, incorporating objective standards and blind voting to create an ELO rating system. This revamped benchmark aims to simplify the process of choosing the …
-
New models unify speech and singing voice generation
Researchers have developed new unified models for generating human vocal audio, capable of producing both speech and singing. UniVoice uses a conditional flow matching approach, separating content, melody, and timbre to…
-
New TTS framework GLASS enables independent acoustic style control
Researchers have developed GLASS, a novel framework for controlling acoustic style in zero-shot text-to-speech (TTS) systems. Unlike previous methods that entangle speaker identity with prosody, GLASS treats attributes …
-
Sparse autoencoders enable interpretable emotion control in TTS
Researchers have developed a new method for controlling emotions in text-to-speech (TTS) systems by utilizing sparse autoencoders (SAEs) to identify and manipulate latent features within large language models. This appr…
-
xAI launches Custom Voices for voice cloning and management
xAI has launched Custom Voices, a new feature allowing users to clone their own voice from a short audio recording for use in various applications. This technology enables personalized narration for videos, podcasts, an…
-
AI advances in 3D simulation, Bengali TTS, and Google Cloud Next trends
A researcher named Jousef Murad has introduced a new AI framework called Rigid-Deformation Decomposition for simulating 3D vehicle crash dynamics. Separately, a user named Himu is urging Google developers to integrate n…
-
LLM preference optimization advances TTS accuracy and user personalization
Researchers have developed new methods for aligning large language models (LLMs) with user preferences. One approach, TKTO, focuses on text-to-speech systems, enabling data-efficient, token-level optimization to improve…
-
New benchmarks and platforms advance voice agent evaluation and development
New research introduces EVA-Bench, a comprehensive framework for evaluating voice agents, addressing challenges in simulating realistic conversations and measuring performance across various failure modes. Simultaneousl…