Cerebrium - Pipecat

Cerebrium is a serverless Infrastructure platform that makes it easy for companies to build, deploy and scale AI applications. Cerebrium offers both CPUs and GPUs (H100s, A100s etc) with extremely low cold start times allowing us to create highly performant applications in the most cost efficient manner. This guide walks through deploying a bot to Cerebrium’s serverless platform. Architecturally it sits in the managed-runtime family — Cerebrium owns the container lifecycle and scaling — though with closer access to the underlying compute (and GPU support) than typical agent runtimes offer.

Install the Cerebrium CLI

To get started, let us run the following commands:

Run uv tool install cerebrium to install the Cerebrium CLI globally.
Run cerebrium login to authenticate yourself.

If you don’t have a Cerebrium account, you can create one and get started with $30 in free credits.

Create a Cerebrium project

Create a new Cerebrium project:

cerebrium init pipecat-agent

This will create two key files:

main.py - Your application entrypoint
cerebrium.toml - Configuration for build and environment settings

Update your cerebrium.toml with the necessary configuration:

[cerebrium.hardware]
region = "us-east-1"
provider = "aws"
compute = "CPU"
cpu = 4
memory = 18.0

[cerebrium.dependencies.pip]
torch = ">=2.0.0"
"pipecat-ai[silero, daily, openai, cartesia, deepgram]" = "latest"
aiohttp = "latest"
torchaudio = "latest"

In order for our application to work, we need to copy our API keys from the various platforms. Navigate to the Secrets section in your Cerebrium dashboard to store your API keys:

OPENAI_API_KEY - We use OpenAI for the LLM. You can get your API key from here
DAILY_API_KEY - For WebRTC transport (used to create rooms and tokens). You can get your key from here
DEEPGRAM_API_KEY - For speech-to-text. You can get your key from here
CARTESIA_API_KEY - For text-to-speech. You can get your API key from here

We access these secrets in our code as if they are normal ENV vars. You can swap in any LLM, STT, or TTS service you wish to use.

Agent setup

We create a basic pipeline setup in our main.py that combines our LLM, TTS and Daily WebRTC transport layer.

import os
import sys

from loguru import logger

from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import EndFrame, LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
    LLMContextAggregatorPair,
    LLMUserAggregatorParams,
)
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.transports.daily.transport import DailyParams, DailyTransport

logger.remove(0)
logger.add(sys.stderr, level="DEBUG")


async def main(room_url: str, token: str):
    transport = DailyTransport(
        room_url,
        token,
        "Friendly bot",
        DailyParams(
            audio_in_enabled=True,
            audio_out_enabled=True,
        ),
    )

    stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))

    llm = OpenAILLMService(
        api_key=os.environ.get("OPENAI_API_KEY"),
        model="gpt-4o",
        settings=OpenAILLMService.Settings(
            system_instruction=(
                "You are a helpful AI assistant in a voice conversation. "
                "Respond naturally and keep your answers conversational."
            ),
        ),
    )

    tts = CartesiaTTSService(
        api_key=os.getenv("CARTESIA_API_KEY"),
        voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",  # British Lady
    )

    context = LLMContext()
    user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
        context,
        user_params=LLMUserAggregatorParams(
            vad_analyzer=SileroVADAnalyzer(),
        ),
    )

    pipeline = Pipeline(
        [
            transport.input(),
            stt,
            user_aggregator,
            llm,
            tts,
            transport.output(),
            assistant_aggregator,
        ]
    )

    task = PipelineTask(
        pipeline,
        params=PipelineParams(
            enable_metrics=True,
            enable_usage_metrics=True,
        ),
    )

    @transport.event_handler("on_first_participant_joined")
    async def on_first_participant_joined(transport, participant):
        context.add_message({"role": "user", "content": "Introduce yourself."})
        await task.queue_frames([LLMRunFrame()])

    @transport.event_handler("on_participant_left")
    async def on_participant_left(transport, participant, reason):
        await task.queue_frame(EndFrame())

    @transport.event_handler("on_call_state_updated")
    async def on_call_state_updated(transport, state):
        if state == "left":
            await task.queue_frame(EndFrame())

    runner = PipelineRunner()
    await runner.run(task)

First, the main function initializes the Daily transport to send and receive audio for the Daily room we’ll connect to. We pass in the room_url we want to join and a token that authenticates the bot. Next, we wire up the AI services: Deepgram for speech-to-text, OpenAI for the LLM, and Cartesia for text-to-speech. The user aggregator is configured with a Silero VAD analyzer to detect when the user has finished speaking, which drives turn-taking. The pipeline chains these together: audio comes in from the transport, gets transcribed by Deepgram, accumulates into the LLM context via the user aggregator, gets a response from OpenAI, is spoken by Cartesia, and goes back out through the transport. The assistant aggregator captures the spoken response back into the context so the LLM has memory of the conversation. Finally, three event handlers manage the session lifecycle: when the first participant joins, the bot kicks off the conversation by introducing itself; when the participant leaves or the call ends, the bot terminates cleanly via an EndFrame.

Deploy bot

Deploy your application to Cerebrium:

cerebrium deploy

You will then see that an endpoint is created for your bot at POST \<BASE_URL\>/main that you can call with your room_url and token. Let us test it.

Test it out

import asyncio
import os
import time

import requests
from loguru import logger


def create_room():
    url = "https://api.daily.co/v1/rooms/"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {os.environ.get('DAILY_API_KEY')}",
    }
    data = {
        "properties": {
            "exp": int(time.time()) + 60 * 5,  # 5 minutes
            "eject_at_room_exp": True,
        }
    }

    response = requests.post(url, headers=headers, json=data)
    if response.status_code == 200:
        room_info = response.json()
        token = create_token(room_info["name"])
        if token and "token" in token:
            room_info["token"] = token["token"]
        else:
            logger.error("Failed to create token")
            return {
                "message": "There was an error creating your room",
                "status_code": 500,
            }
        return room_info
    else:
        logger.error(f"Failed to create room: {response.status_code}")
        return {"message": "There was an error creating your room", "status_code": 500}


def create_token(room_name: str):
    url = "https://api.daily.co/v1/meeting-tokens"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {os.environ.get('DAILY_API_KEY')}",
    }
    data = {"properties": {"room_name": room_name, "is_owner": True}}

    response = requests.post(url, headers=headers, json=data)
    if response.status_code == 200:
        return response.json()
    else:
        logger.error(f"Failed to create token: {response.status_code}")
        return None


if __name__ == "__main__":
    room_info = create_room()
    room_url = room_info["url"]
    print(f"Join room: {room_url}")
    asyncio.run(main(room_url, room_info["token"]))

Future Considerations

Since Cerebrium supports both CPU and GPU workloads if you would like to lower the latency of your application then the best would be to get model weights from various providers and run them locally. You can do this for:

LLM: Run any OpenSource model using a framework such as vLLM
TTS: Deepgram offers TTS models that can be run locally
STT: Deepgram offers a local model that can be run locally

If you implement all three models locally, you should have much better performance. We have been able to get ~300ms voice-to-voice responses.

Examples

Fastest voice agent: Local only implementation
RAG voice agent: Create a voice agent that can do RAG using Cerebrium + OpenAI + Pinecone
Twilio voice agent: Create a voice agent that can receive phone calls via Twilio
OpenAI Realtime API implementation: Create a voice agent that can receive phone calls via OpenAI Realtime API

Documentation Index

​Install the Cerebrium CLI

​Create a Cerebrium project

​Agent setup

​Deploy bot

​Test it out

​Future Considerations

​Examples