How to build a FREE realtime voice assistant with Gemini
Google's new Gemini 2.0 Flash (experimental) model features an api for building a realtime bidirectional voice assistant. Since experimental models are free (and Google will not use your data for training if you use a paid account), this is an extremely slept on opportunity to build a realtime voice assistant for free.
Gemini Multimodal Live
The realtime api is built with websockets, and is documented here. The api is designed to be used with the new Gemini 2.0 Flash model, which is a multimodal model that can handle text, audio, images and video.
Google haven't yet released a TypeScript library, but they have released a Next.js example app with screensharing and biredirectional voice chat.
If you want to build a production realtime assistant for the web, you probably don't want to use the websocket api directly. To achieve low-latency you will need to host a WebRTC server - I recommend daily.co who provide a platform for building voice assistants, and they partnered with Google for the Gemini Multimodal Live launch.
In this tutorial I'm going to show you how to build a realtime voice assistant with the websockets api. We're going to build a command-line agent that will run locally, using the websocket API directly is perfect in this use-case since we don't need a WebRTC server in the middle
Code Examples
I've published all the code examples for this tutorial in a GitHub repository. You need sox
(specifically the play
and record
cli tools). You can use degit
to clone just the code examples:
npx degit autocode2/gemini-realtime/apps/gemini-realtime-getting-started
npm install
To run any of the examples, you'll need to set the GOOGLE_API_KEY
environment variable to your Google API key.
export GOOGLE_API_KEY=your-api-key
npm run example-1
Available Messages
The client can send the following messages to the server:
- setup - configures the model and generation settings (temperature, voice persona etc).
- clientContent - sends messages from the user to the model. These are regular Gemini chat messages and can include system, user, assistant messages and multimodal content.
- realtimeInput - stream data to the model. This can be any mimetype supported by Gemini, including audio, images and video.
- toolResponse - respond to a tool call. Gemini supports tool calling though it's noted in the docs that the tool calling performance is degraded by having to also handle biredirectional audio.
The server can send the following messages to the client:
- setupComplete - sent in responce to the setup message.
- serverContent - messages from the model to the user.
- toolCall - a tool call by the model
- toolCallCancellation - a request to cancel an in progress tool call
Connection and Configuration
First, let's establish a WebSocket connection to Gemini. We'll need to construct the correct URL and send an initial configuration:
const HOST = `generativelanguage.googleapis.com`;
const MODEL = "models/gemini-2.0-flash-exp";
function websocketUrl({
apiKey,
host = HOST,
}: {
apiKey: string;
host?: string;
}) {
const path = "google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent";
return `wss://${host}/ws/${path}?key=${apiKey}`;
}
const ws = new WebSocket(websocketUrl({ apiKey }));
// Initial configuration
ws.send(JSON.stringify({
setup: {
model: MODEL,
generation_config: {
temperature: 0.9,
candidate_count: 1,
}
}
}));
You can now run Example 1 and you should see the setup complete message in the console.
Sending Messages
Once connected, we can send messages to Gemini. Messages are structured with roles and parts. We can use the turnComplete
flag to signal to the model that the turn is complete and it should generate a response.
ws.send(JSON.stringify({
clientContent: {
turns: [{
role: "user",
parts: [{ text: "Hello! How are you?" }]
}],
turnComplete: true
}
}));
You can now run Example 2 and you should see that the model responds to your message (with audio messages).
Receiving Messages
Response from Gemini are sent as serverContent
messages with the following properties:
- modelTurn - the model's response
- turnComplete - if the model has finished generating the response
- interrupted - Indicates that a client message has interrupted current model generation. If the client is playing out the content in real time, this is a good signal to stop and empty the current playback queue.
- groundingMetadata - This is used alongside Google's grounded search tool.
To play the audio response from Gemini we'll use sox
(brew install sox
or whatever) to play the audio. Multimodal Live API supports the following audio formats:
- Input audio format: Raw 16 bit PCM audio at 16kHz little-endian
- Output audio format: Raw 16 bit PCM audio at 24kHz little-endian
class AudioPlayer {
private sox: ChildProcess;
constructor() {
this.sox = spawn("play", [
"-t", "raw",
"-r", "24k",
"-e", "signed-integer",
"-b", "16",
"-c", "1",
"-",
]);
}
play(data: Buffer) {
this.sox.stdin.write(data);
}
}
Incoming audio is sent as base64 encoded audio data in an inlineData
part with a mimetype of audio/pcm;rate=24000
. To play the audio:
if ("serverContent" in message) {
const parts = message.serverContent?.modelTurn?.parts || [];
for (const part of parts) {
if ("inlineData" in part && part.inlineData.mimeType?.startsWith("audio")) {
const audioData = Buffer.from(part.inlineData.data, "base64");
audioPlayer.play(audioData);
}
}
}
You can now run Example 3 and you should see be able to hear the model's response.
Streaming Audio
To stream audio from the microphone we will again use a simple sox
wrapper.
class AudioRecorder {
private sox: ChildProcess;
constructor() {
this.sox = spawn("rec", [
"-t",
"raw",
"-r",
"16k",
"-e",
"signed-integer",
"-b",
"16",
"-c",
"1",
"-",
]);
}
stop() {
this.sox.kill();
}
get stdout() {
return this.sox.stdout;
}
}
Then we can simply connect the stdout
of the sox process to send realtimeInput
messages to Gemini:
audioRecorder.stdout?.on("data", (data: Buffer) => {
ws.send(JSON.stringify({
realtimeInput: {
mediaChunks: [{
mimeType: "audio/pcm;rate=16000",
data: data.toString("base64")
}]
}
}));
});
You can now run Example 4 and voila! You have a realtime voice assistant.
Screen Sharing
The realtimeInput
message can also be used to share any other media type supported by Gemini. For example, we can share screen content with Gemini in real-time.
To get the screen content we'll use a simple wrapper around the node-screenshots
package to take an image on the main monitor every second, then we can stream it to Gemini as an image/jpeg
:
screenshotter.screenshotInterval(1000, (image) => {
ws.send(JSON.stringify({
realtimeInput: {
mediaChunks: [{
mimeType: "image/jpeg",
data: image.toJpegSync().toString("base64")
}]
}
}));
});
Now run Example 5 and you can ask the assistant questions about what's on your screen.
Tool Calling
Gemini also supports tool calling. The support for this is quite similar to tool calling with any other LLM. However, since the API is realtime, the tool calling allows for asynchronous responses and tool call cancellation. For example, the model might decide to call a tool, and will continue to generate audio and accept further input while waiting for the tool response. The LLM may then decide it no longer needs the tool response and can cancel the tool call.
To use tool calling, we need to declare the tool definitions in the setup
message.
const config = {
model: MODEL,
tools: [{
functionDeclarations: [{
name: "lookup_weather",
description: "Get the current weather for a location",
parameters: {
type: "object",
properties: {
location: {
type: "string",
description: "The location to get weather for"
}
},
required: ["location"]
}
}]
}]
};
Then we have to update our message handler to handle toolCall
messages and send toolResponse
messages in response:
if ("toolCall" in message) {
const functionCalls = message.toolCall.functionCalls || [];
for (const call of functionCalls) {
if (call.name === "lookup_weather") {
const weather = getWeather(call.args.location);
ws.send(JSON.stringify({
toolResponse: {
functionResponses: [{
id: call.id,
name: "lookup_weather",
response: weather,
}],
},
}));
}
}
}
With Example 6 now you can ask the assistant about the weather.
@autocode2/gemini-realtime
I've extracted the boilerplate code for connecting to Gemini and sending and receiving messages into a simple npm package called @autocode2/gemini-realtime
. You can install it with npm install @autocode2/gemini-realtime
and use it like this:
import WebSocket from "ws";
import {
AudioPlayer,
AudioRecorder,
Screenshotter,
} from "@autocode2/media-utils";
import { GeminiRealtime, websocketUrl } from "@autocode2/gemini-realtime";
const MODEL = "models/gemini-2.0-flash-exp";
const apiKey = process.env.GOOGLE_API_KEY;
if (!apiKey) {
console.error("Please set GOOGLE_API_KEY environment variable");
process.exit(1);
}
class CLI {
private audioPlayer = new AudioPlayer();
private audioRecorder = new AudioRecorder();
private screenshotter = new Screenshotter();
private gemini: GeminiRealtime;
constructor({ apiKey }: { apiKey: string }) {
const ws = new WebSocket(websocketUrl({ apiKey }));
this.gemini = new GeminiRealtime(ws, {
model: MODEL,
});
this.gemini.on("setupComplete", () => this.onSetupComplete());
this.gemini.on("audioPart", (data) => this.onAudioMessage(data));
}
onSetupComplete() {
console.log("Setup complete");
this.gemini.sendMessages({
messages: [
{
role: "user",
parts: [
{
text: "Hello",
},
],
},
],
turnComplete: true,
});
this.audioRecorder.stdout?.on("data", (data) =>
this.gemini.streamAudio(data),
);
this.screenshotter.screenshotInterval(1000, async (image) => {
this.gemini.streamChunk("image/jpeg", image.toJpegSync());
});
}
onAudioMessage(data: Buffer) {
this.audioPlayer.play(data);
}
}
new CLI({ apiKey });
Conclusion
The API is really easy to use, and it's very easy to build a useful assistant just with a good prompt and a few tools. Next steps (possibly a future blog post) could include:
- adding additional context at the beginning of the conversation to improve the model's performance - particularly a glossary of frequently used terms helps prevent transcription errors.
- extending the conversation length by transcribing the audio and sending it as text at the beginning of each session.
- using the transcription with a supervisor agent to enhance the agents abilities (flash is a small model and we're already asking a lot of it).
Enjoy.