CartesiaPP

C++ wrapper for the Cartesia.ai audio processing API. Supports text-to-speech synthesis and speech-to-text transcription via REST and WebSocket connections.

Features

Text-to-Speech: Generate audio from text with various voice options and output formats
Speech-to-Text: Transcribe audio files or byte streams to text with word-level timing
Voice Management: List and retrieve available voices with filtering options
Multiple Output Formats: WAV, MP3, PCM with configurable sample rates and encodings
Streaming Support: Real-time audio processing via WebSocket connections

Build

Requires CMake 3.16+ and vcpkg for dependencies.

Building from Source

git clone https://github.com/fatehmtd/cartesiapp.git
cd cartesiapp
cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=[vcpkg-root]/scripts/buildsystems/vcpkg.cmake
cmake --build build

Using as a Dependency

CMake FetchContent

Add CartesiaPP as a dependency in your CMake project:

include(FetchContent)

FetchContent_Declare(
    cartesiapp
    GIT_REPOSITORY https://github.com/fatehmtd/cartesiapp.git
    GIT_TAG        main  # or specify a specific tag/commit
)

FetchContent_MakeAvailable(cartesiapp)

# Link to your target
target_link_libraries(your_target PRIVATE cartesiapp)

Git Submodule

Add CartesiaPP as a git submodule:

# Add as submodule
git submodule add https://github.com/fatehmtd/cartesiapp.git deps/cartesiapp
git submodule update --init --recursive

# In your CMakeLists.txt
add_subdirectory(deps/cartesiapp)
target_link_libraries(your_target PRIVATE cartesiapp)

vcpkg (if available)

vcpkg install cartesiapp

Then in your CMakeLists.txt:

find_package(cartesiapp CONFIG REQUIRED)
target_link_libraries(your_target PRIVATE cartesiapp::cartesiapp)

Usage

Set your API key as an environment variable:

export CARTESIA_API_KEY=your_api_key_here

Basic Text-to-Speech (Bytes)

### Basic Text-to-Speech (Bytes)

```cpp
#include <cartesiapp/cartesiapp.hpp>

cartesiapp::Cartesia client(apiKey);

// Get available voices
cartesiapp::request::VoiceListRequest voiceRequest;
voiceRequest.gender = cartesiapp::request::voice_gender::FEMININE;
auto voices = client.getVoiceList(voiceRequest);

// Configure TTS request
cartesiapp::request::TTSBytesRequest ttsRequest;
ttsRequest.transcript = "Hello world, this is a test message.";
ttsRequest.voice.id = voices.voices[0].id;
ttsRequest.output_format.container = cartesiapp::request::container::WAV;
ttsRequest.output_format.encoding = cartesiapp::request::tts_encoding::PCM_S16LE;
ttsRequest.output_format.sample_rate = cartesiapp::request::sample_rate::SR_44100;

// Generate audio
std::string audioData = client.ttsBytes(ttsRequest);

// Save to file
std::ofstream outFile("output.wav", std::ios::binary);
outFile.write(audioData.data(), audioData.size());

Streaming Text-to-Speech

#include <cartesiapp/streaming_tts.hpp>

class MyTTSListener : public cartesiapp::TTSResponseListener {
public:
    void onAudioChunkReceived(const cartesiapp::response::tts::AudioChunkResponse& response) override {
        // Process audio chunks in real-time
        processAudioChunk(response.data);
    }
    
    void onDoneReceived(const cartesiapp::response::tts::DoneResponse& response) override {
        spdlog::info("TTS complete");
    }
};

auto listener = std::make_shared<MyTTSListener>();
auto websocketClient = std::make_unique<cartesiapp::TTSWebsocketClient>(apiKey, apiVersion);
websocketClient->registerTTSListener(listener);
websocketClient->connectAndStart();

cartesiapp::request::tts::GenerationRequest request;
request.transcript = "Your text here";
request.voice.id = "voice-id";
websocketClient->requestTTS(request);

Speech-to-Text (File)

cartesiapp::request::stt::BatchRequest sttRequest;
auto response = client.sttWithFile("audio.mp3", sttRequest);

std::cout << "Transcribed: " << response.text << std::endl;
std::cout << "Duration: " << response.duration << " seconds" << std::endl;

// Word-level timestamps
for (const auto& word : response.words) {
    std::cout << word.word << " [" << word.start << "-" << word.end << "ms]" << std::endl;
}

Streaming Speech-to-Text

#include <cartesiapp/streaming_stt.hpp>

class MySTTListener : public cartesiapp::STTResponseListener {
public:
    void onTranscriptionReceived(const cartesiapp::response::stt::TranscriptionResponse& response) override {
        std::cout << "Partial: " << response.text << std::endl;
        if (response.is_final) {
            std::cout << "Final: " << response.text << std::endl;
        }
    }
};

cartesiapp::STTWebsocketClient sttClient(apiKey, model, "en", encoding, sampleRate);
auto listener = std::make_shared<MySTTListener>();
sttClient.registerSTTListener(listener);
sttClient.connectAndStart();

// Stream audio data
sttClient.writeAudioBytes(audioBuffer.data(), audioBuffer.size());

Sample Applications

The samples/ directory contains working examples demonstrating all library features:

Text-to-Speech Samples

sample-tts-bytes.cpp - Basic TTS with byte output
- Demonstrates voice listing and selection
- Configurable audio formats (WAV, MP3, PCM)
- Emotion and speed control
- Saves generated audio to file
sample-tts-streaming.cpp - Real-time TTS streaming
- WebSocket-based streaming audio generation
- Custom response listener implementation
- Real-time audio chunk processing
- Performance metrics (time to first byte)

Speech-to-Text Samples

sample-stt-file.cpp - Batch STT from audio files
- Supports MP3, WAV, and other audio formats
- Word-level timestamp extraction
- Language detection and configuration
sample-stt-streaming.cpp - Real-time STT streaming
- WebSocket-based streaming transcription
- Live audio processing with chunked data
- Partial and final transcription results
- Custom audio encoding support

Running the Samples

cd build/samples/Debug
./CartesiaPP_Sample_TTS_Bytes.exe
./CartesiaPP_Sample_TTS_Streaming.exe
./CartesiaPP_Sample_STT_File.exe
./CartesiaPP_Sample_STT_Streaming.exe

Each sample includes detailed logging and error handling to help you understand the API workflow.

API Feature Coverage

CartesiaPP currently implements a subset of the full Cartesia API. Here's what's supported:

✅ Implemented Features

Core API

✅ API status and version checking
✅ Authentication via API keys
✅ Multiple API version support (2024-06-10, 2024-11-13, 2025-04-16)

Text-to-Speech (TTS)

✅ Byte-based synthesis (REST API)
✅ Real-time streaming synthesis (WebSocket)
✅ Voice selection and management
✅ Multiple output formats (WAV, MP3, PCM)
✅ Audio encoding options (PCM_S16LE, PCM_F32LE, PCM_MULAW, PCM_ALAW)
✅ Sample rate configuration (8kHz to 48kHz)
✅ Emotion control (50+ emotions supported)
✅ Speed and volume adjustment
✅ Context management for streaming
✅ Word and phoneme timestamp support
✅ Multiple TTS models (Sonic-3, Sonic-2)

Speech-to-Text (STT)

✅ Batch transcription from files
✅ Batch transcription from byte arrays
✅ Real-time streaming transcription (WebSocket)
✅ Multiple audio formats support
✅ Word-level timestamps
✅ Language specification
✅ Audio encoding configuration
✅ INK-Whisper model support

Voice Management

✅ List available voices with filtering
✅ Voice details retrieval
✅ Gender-based filtering
✅ Owner/starred voice filtering
✅ Pagination support

❌ Missing Features (TODO)

Voice Cloning & Management

❌ Voice cloning from audio samples
❌ Custom voice creation and upload
❌ Voice deletion
❌ Voice sharing and privacy controls
❌ Voice embedding management

Advanced TTS Features

❌ Server-Sent Events (SSE) streaming
❌ Pronunciation dictionary support
❌ Voice embedding mode
❌ Advanced generation controls (pitch, emphasis)
❌ Multi-speaker synthesis

Advanced STT Features

❌ Language auto-detection
❌ Custom vocabulary/models
❌ Confidence scores
❌ Speaker diarization
❌ Advanced audio preprocessing

Conversation Agents

❌ Agent creation and management
❌ Agent deployment
❌ Phone number integration
❌ Webhook configuration
❌ Conversation metrics and analytics
❌ Git repository integration

Enterprise Features

❌ Usage analytics and metrics
❌ Rate limiting information
❌ Billing and quota management
❌ Team/organization management
❌ Audit logging

Roadmap

Phase 1: Voice Management

Implement voice cloning API
Add voice deletion functionality
Support voice privacy controls

Phase 2: Advanced TTS

Add Server-Sent Events (SSE) support
Implement pronunciation dictionary features
Support voice embedding mode

Phase 3: Enterprise Features

Add usage metrics and analytics
Implement rate limiting support
Add billing/quota information APIs

Phase 4: Conversation Agents

Basic agent management
Agent deployment support
Webhook integration

Contributions are welcome! See the missing features above for areas where help is needed.

License

See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
lib/cartesiapp		lib/cartesiapp
samples		samples
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
cartesiapp.png		cartesiapp.png
vcpkg.json		vcpkg.json

License

fatehmtd/cartesiapp

Folders and files

Latest commit

History

Repository files navigation

CartesiaPP

Features

Build

Building from Source

Using as a Dependency

CMake FetchContent

Git Submodule

vcpkg (if available)

Usage

Basic Text-to-Speech (Bytes)

Streaming Text-to-Speech

Speech-to-Text (File)

Streaming Speech-to-Text

Sample Applications

Text-to-Speech Samples

Speech-to-Text Samples

Running the Samples

API Feature Coverage

✅ Implemented Features

Core API

Text-to-Speech (TTS)

Speech-to-Text (STT)

Voice Management

❌ Missing Features (TODO)

Voice Cloning & Management

Advanced TTS Features

Advanced STT Features

Conversation Agents

Enterprise Features

Roadmap

Phase 1: Voice Management

Phase 2: Advanced TTS

Phase 3: Enterprise Features

Phase 4: Conversation Agents

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages