Synopsis

Per Sarvam cofounder Pratyush Kumar, Bulbul V3 is the fifth of 14 planned launches. In a study, Bulbul V3 topped the charts for 8 kHz audio, setting what Kumar called a new benchmark for speech synthesis for voice agents, with listeners tagging real failure cases to test stability, and the model recording the lowest average error rates.

Listen to this article in summarized format

Loading...
×
Artificial intelligence (AI) startup Sarvam AI on Thursday launched Bulbul V3, a new text-to-speech model built to create natural, expressive, and production-ready voices for Indian languages.

The release has received strong praise from across the AI community. Notably, Deedy Das, partner at Menlo Ventures — which backs companies such as Anthropic — walked back on his earlier criticism of Sarvam. He said he was “wrong” about the startup and added that Sarvam now offers the best text-to-speech, speech-to-text, and optical character recognition (OCR) models for Indic languages, calling the work “really valuable.”

Bulbul V3: About the model


Introducing the model on X, Sarvam cofounder Pratyush Kumar described Bulbul V3 as the fifth of 14 planned launches. “In an independent third-party human listening study, Bulbul V3 delivers the highest listener preference and low error rates across use cases and languages,” he said.


In the following thread, Kumar explained that the model was tested in a blind listening study conducted by independent research partner Josh Talks AI. Listeners compared Bulbul V3 with ElevenLabs (v3 alpha and v2.5 flash) and Cartesia Sonic-3.

The study collected over 20,000 votes, and Bulbul V3 topped the charts for 8 kHz audio, setting what he called a “new benchmark for speech synthesis for voice agents.”

What sets Bulbul V3 apart

In a blog post, Sarvam said Bulbul V3 raises the bar across three areas that matter most for real-world speech systems:

  • Naturalness: Achieves high listener preference at 48 kHz and ranks as the most preferred model for 8 kHz telephony, outperforming competitors.
  • Robustness: Shows low character error rates on difficult inputs such as code-mixing and numerics.
  • Stability: Records the fewest word skips and mispronunciations, even in long-form and high-volume usage.

The company said the study covered two test conditions — general full-band audio and 8 kHz telephony-grade audio — to reflect both studio-quality and real-world use. Each language had 50 to 70 annotators, producing around 2,000 votes per language, with more than 500 annotators taking part overall.

In the post, Kumar added that listeners also tagged real failure cases to measure stability. “Bulbul V3 comes out on top, with the lowest average error rates,” he said.

“We also evaluated for the long tail of language challenges, such as speaking numerics, technical content, and named entities. Bulbul V3 consistently has the lowest error rates across languages,” he added.

New voice library

Alongside the model, Sarvam unveiled a new voice library with over 30 professional-quality voices across 11 Indian languages, all recorded by trained voice artists. According to the company, this gives voices greater depth, clarity, and emotional range, especially for long-form audio.

Sarvam said support will soon expand to 22 Indian languages.

In addition, the model also allows voice cloning, enabling custom voices to be created while retaining natural quality. This, the company said, “enables brand-specific voices, consistent character identities, and personalised experiences at scale.”

Contact to : xlf550402@gmail.com


Privacy Agreement

Copyright © boyuanhulian 2020 - 2023. All Right Reserved.