C++ wrapper for the Cartesia.ai audio processing API. Supports text-to-speech synthesis and speech-to-text transcription via REST and WebSocket connections.
- Text-to-Speech: Generate audio from text with various voice options and output formats
- Speech-to-Text: Transcribe audio files or byte streams to text with word-level timing
- Voice Management: List and retrieve available voices with filtering options
- Multiple Output Formats: WAV, MP3, PCM with configurable sample rates and encodings
- Streaming Support: Real-time audio processing via WebSocket connections
Requires CMake 3.16+ and vcpkg for dependencies.
git clone https://github.com/fatehmtd/cartesiapp.git
cd cartesiapp
cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=[vcpkg-root]/scripts/buildsystems/vcpkg.cmake
cmake --build buildAdd CartesiaPP as a dependency in your CMake project:
include(FetchContent)
FetchContent_Declare(
cartesiapp
GIT_REPOSITORY https://github.com/fatehmtd/cartesiapp.git
GIT_TAG main # or specify a specific tag/commit
)
FetchContent_MakeAvailable(cartesiapp)
# Link to your target
target_link_libraries(your_target PRIVATE cartesiapp)Add CartesiaPP as a git submodule:
# Add as submodule
git submodule add https://github.com/fatehmtd/cartesiapp.git deps/cartesiapp
git submodule update --init --recursive
# In your CMakeLists.txt
add_subdirectory(deps/cartesiapp)
target_link_libraries(your_target PRIVATE cartesiapp)vcpkg install cartesiappThen in your CMakeLists.txt:
find_package(cartesiapp CONFIG REQUIRED)
target_link_libraries(your_target PRIVATE cartesiapp::cartesiapp)Set your API key as an environment variable:
export CARTESIA_API_KEY=your_api_key_here### Basic Text-to-Speech (Bytes)
```cpp
#include <cartesiapp/cartesiapp.hpp>
cartesiapp::Cartesia client(apiKey);
// Get available voices
cartesiapp::request::VoiceListRequest voiceRequest;
voiceRequest.gender = cartesiapp::request::voice_gender::FEMININE;
auto voices = client.getVoiceList(voiceRequest);
// Configure TTS request
cartesiapp::request::TTSBytesRequest ttsRequest;
ttsRequest.transcript = "Hello world, this is a test message.";
ttsRequest.voice.id = voices.voices[0].id;
ttsRequest.output_format.container = cartesiapp::request::container::WAV;
ttsRequest.output_format.encoding = cartesiapp::request::tts_encoding::PCM_S16LE;
ttsRequest.output_format.sample_rate = cartesiapp::request::sample_rate::SR_44100;
// Generate audio
std::string audioData = client.ttsBytes(ttsRequest);
// Save to file
std::ofstream outFile("output.wav", std::ios::binary);
outFile.write(audioData.data(), audioData.size());#include <cartesiapp/streaming_tts.hpp>
class MyTTSListener : public cartesiapp::TTSResponseListener {
public:
void onAudioChunkReceived(const cartesiapp::response::tts::AudioChunkResponse& response) override {
// Process audio chunks in real-time
processAudioChunk(response.data);
}
void onDoneReceived(const cartesiapp::response::tts::DoneResponse& response) override {
spdlog::info("TTS complete");
}
};
auto listener = std::make_shared<MyTTSListener>();
auto websocketClient = std::make_unique<cartesiapp::TTSWebsocketClient>(apiKey, apiVersion);
websocketClient->registerTTSListener(listener);
websocketClient->connectAndStart();
cartesiapp::request::tts::GenerationRequest request;
request.transcript = "Your text here";
request.voice.id = "voice-id";
websocketClient->requestTTS(request);cartesiapp::request::stt::BatchRequest sttRequest;
auto response = client.sttWithFile("audio.mp3", sttRequest);
std::cout << "Transcribed: " << response.text << std::endl;
std::cout << "Duration: " << response.duration << " seconds" << std::endl;
// Word-level timestamps
for (const auto& word : response.words) {
std::cout << word.word << " [" << word.start << "-" << word.end << "ms]" << std::endl;
}#include <cartesiapp/streaming_stt.hpp>
class MySTTListener : public cartesiapp::STTResponseListener {
public:
void onTranscriptionReceived(const cartesiapp::response::stt::TranscriptionResponse& response) override {
std::cout << "Partial: " << response.text << std::endl;
if (response.is_final) {
std::cout << "Final: " << response.text << std::endl;
}
}
};
cartesiapp::STTWebsocketClient sttClient(apiKey, model, "en", encoding, sampleRate);
auto listener = std::make_shared<MySTTListener>();
sttClient.registerSTTListener(listener);
sttClient.connectAndStart();
// Stream audio data
sttClient.writeAudioBytes(audioBuffer.data(), audioBuffer.size());The samples/ directory contains working examples demonstrating all library features:
-
sample-tts-bytes.cpp- Basic TTS with byte output- Demonstrates voice listing and selection
- Configurable audio formats (WAV, MP3, PCM)
- Emotion and speed control
- Saves generated audio to file
-
sample-tts-streaming.cpp- Real-time TTS streaming- WebSocket-based streaming audio generation
- Custom response listener implementation
- Real-time audio chunk processing
- Performance metrics (time to first byte)
-
sample-stt-file.cpp- Batch STT from audio files- Supports MP3, WAV, and other audio formats
- Word-level timestamp extraction
- Language detection and configuration
-
sample-stt-streaming.cpp- Real-time STT streaming- WebSocket-based streaming transcription
- Live audio processing with chunked data
- Partial and final transcription results
- Custom audio encoding support
cd build/samples/Debug
./CartesiaPP_Sample_TTS_Bytes.exe
./CartesiaPP_Sample_TTS_Streaming.exe
./CartesiaPP_Sample_STT_File.exe
./CartesiaPP_Sample_STT_Streaming.exeEach sample includes detailed logging and error handling to help you understand the API workflow.
CartesiaPP currently implements a subset of the full Cartesia API. Here's what's supported:
- ✅ API status and version checking
- ✅ Authentication via API keys
- ✅ Multiple API version support (2024-06-10, 2024-11-13, 2025-04-16)
- ✅ Byte-based synthesis (REST API)
- ✅ Real-time streaming synthesis (WebSocket)
- ✅ Voice selection and management
- ✅ Multiple output formats (WAV, MP3, PCM)
- ✅ Audio encoding options (PCM_S16LE, PCM_F32LE, PCM_MULAW, PCM_ALAW)
- ✅ Sample rate configuration (8kHz to 48kHz)
- ✅ Emotion control (50+ emotions supported)
- ✅ Speed and volume adjustment
- ✅ Context management for streaming
- ✅ Word and phoneme timestamp support
- ✅ Multiple TTS models (Sonic-3, Sonic-2)
- ✅ Batch transcription from files
- ✅ Batch transcription from byte arrays
- ✅ Real-time streaming transcription (WebSocket)
- ✅ Multiple audio formats support
- ✅ Word-level timestamps
- ✅ Language specification
- ✅ Audio encoding configuration
- ✅ INK-Whisper model support
- ✅ List available voices with filtering
- ✅ Voice details retrieval
- ✅ Gender-based filtering
- ✅ Owner/starred voice filtering
- ✅ Pagination support
- ❌ Voice cloning from audio samples
- ❌ Custom voice creation and upload
- ❌ Voice deletion
- ❌ Voice sharing and privacy controls
- ❌ Voice embedding management
- ❌ Server-Sent Events (SSE) streaming
- ❌ Pronunciation dictionary support
- ❌ Voice embedding mode
- ❌ Advanced generation controls (pitch, emphasis)
- ❌ Multi-speaker synthesis
- ❌ Language auto-detection
- ❌ Custom vocabulary/models
- ❌ Confidence scores
- ❌ Speaker diarization
- ❌ Advanced audio preprocessing
- ❌ Agent creation and management
- ❌ Agent deployment
- ❌ Phone number integration
- ❌ Webhook configuration
- ❌ Conversation metrics and analytics
- ❌ Git repository integration
- ❌ Usage analytics and metrics
- ❌ Rate limiting information
- ❌ Billing and quota management
- ❌ Team/organization management
- ❌ Audit logging
- Implement voice cloning API
- Add voice deletion functionality
- Support voice privacy controls
- Add Server-Sent Events (SSE) support
- Implement pronunciation dictionary features
- Support voice embedding mode
- Add usage metrics and analytics
- Implement rate limiting support
- Add billing/quota information APIs
- Basic agent management
- Agent deployment support
- Webhook integration
Contributions are welcome! See the missing features above for areas where help is needed.
See LICENSE file for details.
