# Speech-02-Turbo

> Text-to-Audio (T2A) that offers voice synthesis, emotional expression, and multilingual capabilities. Designed for real-time applications with low latency

- **Provider**: replicate
- **Model ID**: minimax/speech-02-turbo
- **Category**: tts_voice
- **Credits**: 108 per request
- **Pricing Type**: token_based

## API Endpoint

Base URL: https://api.core.today/v1

### Create Prediction
POST /predictions

### Get Status
GET /predictions/{job_id}

### Cancel
DELETE /predictions/{job_id}

## Authentication

Header: `X-API-Key: YOUR_API_KEY`

## Input Parameters

- `voice_id` (string, optional): Voice to synthesize. Pick any MiniMax system voice or a voice_id returned by https://replicate.com/minimax/voice-cloning.
- `channel` (string, optional): mono for 1 channel (default), stereo for 2 channels.
- `audio_format` (string, optional): File format for the generated audio. Choose mp3 for general use, wav/flac for lossless, or pcm for raw bytes.
- `english_normalization` (boolean, optional): Improve number/date reading for English text (adds a small amount of latency).
- `bitrate` (integer, optional): MP3 bitrate in bits per second. Only used when audio_format is mp3.
- `speed` (number, optional): Speech speed multiplier (0.5–2.0). Lower is slower, higher is faster. (Range: min: 0.5, max: 2)
- `language_boost` (string, optional): Optional language hint. Choose Automatic to let MiniMax detect the language, or pick a specific locale.
- `subtitle_enable` (boolean, optional): Return MiniMax subtitle metadata with sentence timestamps (non-streaming only).
- `volume` (number, optional): Relative loudness. 1.0 is default MiniMax gain. Range 0–10. (Range: min: 0, max: 10)
- `emotion` (string, optional): Desired delivery style. Use auto to let MiniMax choose, or pick a specific emotion.
- `sample_rate` (integer, optional): Audio sample rate in Hz.
- `text` (string, **required**): Text to narrate (max 10,000 characters). Use markers like <#0.5#> to insert pauses in seconds.
- `pitch` (integer, optional): Semitone offset applied to the voice (−12 to +12). (Range: min: -12, max: 12)

## Example Request

```json
{
  "model": "minimax/speech-02-turbo",
  "input": {
    "volume": 1,
    "emotion": "angry",
    "sample_rate": 32000,
    "voice_id": "Deep_Voice_Man",
    "channel": "mono",
    "english_normalization": true,
    "bitrate": 128000,
    "text": "Speech-02-series is a Text-to-Audio and voice cloning technology that offers voice synthesis, emotional expression, and multilingual capabilities.\n\nThe HD version is optimized for high-fidelity applications like voiceovers and audiobooks. While the turbo one is designed for real-time applications with low latency.\n\nWhen using this model on Replicate, each character represents 1 token.",
    "pitch": 0,
    "speed": 1,
    "language_boost": "English"
  }
}
```

## Response Format

```json
{
  "job_id": "abc123",
  "status": "pending",
  "provider": "replicate",
  "model": "black-forest-labs/flux-schnell",
  "created_at": "2026-01-01T00:00:00Z",
  "result": null,
  "error": null
}
```

Status values: `pending`, `processing`, `completed`, `failed`, `cancelled`

## Usage Flow

1. POST /predictions with model and input → receive job_id
2. Poll GET /predictions/{job_id} until status is `completed` or `failed`
3. Result contains output URL(s) or data

## Output Type

url

## Tags

text-to-speech, tts, voice-synthesis, voice-cloning, multilingual, emotion-control, real-time, low-latency, minimax

## Documentation

https://replicate.com/minimax/speech-02-turbo


## Token Pricing

- Input: 0.108 credits/token
- Output: 0.108 credits/token