Introducing AI Conversations: Natural Language Interaction for Your Apps! Learn More

Custom Provider

Overview

Shiny Speech uses a pluggable cloud provider architecture. You can implement your own STT and/or TTS providers by implementing the ISpeechToTextProvider and ITextToSpeechProvider interfaces from Shiny.Speech.Cloud.

Cloud providers replace the platform-native ISpeechToTextService / ITextToSpeechService registrations. The required platform-native audio services (IAudioSource and IAudioPlayer) are automatically registered by the cloud extension methods.

Custom Speech-to-Text Provider

Implement ISpeechToTextProvider to receive raw PCM audio (16kHz, 16-bit, mono) and yield recognition results:

using Shiny.Speech;
using Shiny.Speech.Cloud;

public class MyCloudSttProvider : ISpeechToTextProvider
{
    public event EventHandler<SpeechRecognitionError>? Error;

    public async IAsyncEnumerable<SpeechRecognitionResult> RecognizeAsync(
        Stream audioStream,
        SpeechRecognitionOptions? options = null,
        [EnumeratorCancellation] CancellationToken cancellationToken = default)
    {
        try
        {
            // Read PCM audio from audioStream
            // Send to your cloud API
            // Yield results as they arrive
            yield return new SpeechRecognitionResult(
                "recognized text",
                IsFinal: true,
                Confidence: 0.95f
            );
        }
        catch (HttpRequestException ex)
        {
            // Non-fatal: raise Error and keep the session running.
            // Throwing here would terminate the IAsyncEnumerable and end the session.
            Error?.Invoke(this, new SpeechRecognitionError(ex.Message, ex));
        }
    }
}

The Error event lets continuous providers surface transient failures (e.g. a network blip between chunked requests) without tearing down the RecognizeAsync enumerator. CloudSpeechToText subscribes to it internally and forwards each error to the service-level ISpeechToTextService.Error event, so app code only needs to wire one handler.

builder.Services.AddCloudSpeechToText<MyCloudSttProvider>();
// IAudioSource is automatically registered

Custom Text-to-Speech Provider

Implement ITextToSpeechProvider to synthesize text into an audio stream:

using Shiny.Speech;
using Shiny.Speech.Cloud;

public class MyCloudTtsProvider : ITextToSpeechProvider
{
    public async Task<IReadOnlyList<VoiceInfo>> GetVoicesAsync(
        CultureInfo? culture = null,
        CancellationToken cancellationToken = default)
    {
        // Return available voices from your cloud provider
        return new[]
        {
            new VoiceInfo("voice-1", "Default Voice", CultureInfo.GetCultureInfo("en-US"))
        };
    }

    public async Task<Stream> SynthesizeAsync(
        string text,
        TextToSpeechOptions? options = null,
        CancellationToken cancellationToken = default)
    {
        // Send text to your cloud API
        // Return the audio stream (MP3 format)
        return audioStream;
    }
}

builder.Services.AddCloudTextToSpeech<MyCloudTtsProvider>();
// IAudioPlayer is automatically registered

Provider Interfaces

ISpeechToTextProvider

public interface ISpeechToTextProvider
{
    IAsyncEnumerable<SpeechRecognitionResult> RecognizeAsync(
        Stream audioStream,
        SpeechRecognitionOptions? options = null,
        CancellationToken cancellationToken = default
    );

    // Raised for non-fatal errors during continuous recognition.
    // CloudSpeechToText forwards this to ISpeechToTextService.Error.
    event EventHandler<SpeechRecognitionError>? Error;
}

ITextToSpeechProvider

public interface ITextToSpeechProvider
{
    Task<IReadOnlyList<VoiceInfo>> GetVoicesAsync(
        CultureInfo? culture = null,
        CancellationToken cancellationToken = default
    );

    Task<Stream> SynthesizeAsync(
        string text,
        TextToSpeechOptions? options = null,
        CancellationToken cancellationToken = default
    );
}

How It Works

When you register a cloud provider, Shiny Speech creates a bridge class (CloudSpeechToText or CloudTextToSpeech) that:

Captures audio from IAudioSource (for STT) or receives synthesized audio (for TTS)
Delegates recognition/synthesis to your provider implementation
Exposes the standard ISpeechToTextService / ITextToSpeechService interface

This means your app code doesn’t change — you can swap between platform-native, Azure, ElevenLabs, or your own provider by changing only the DI registration.