gemini_v2v_python

An extension for integrating Gemini's Next Generation of Multimodal AI into your application, providing configurable AI-driven features such as conversational agents, task automation, and tool integration.

Features

Gemini Multimodal Integration: Leverage Gemini Multimodal models for voice-to-voice as well as text processing.
Configurable: Easily customize API keys, model settings, prompts, temperature, etc.
Async Queue Processing: Supports real-time message processing with task cancellation and prioritization.

API

Refer to the api definition in [manifest.json] and default values in property.json.

Property	Type	Description
`api_key`	`string`	API key for authenticating with Gemini
`temperature`	`float32`	Sampling temperature, higher values mean more randomness
`model`	`string`	Model identifier (e.g., GPT-4, Gemini-1)
`max_tokens`	`int32`	Maximum number of tokens to generate
`system_message`	`string`	Default system message to send to the model
`voice`	`string`	Voice that Gemini model uses, such as `alloy`, `echo`, `shimmer`, etc.
`server_vad`	`bool`	Flag to enable or disable server VAD for Gemini
`language`	`string`	Language that Gemini model responds in, such as `en-US`, `zh-CN`, etc.
`dump`	`bool`	Flag to enable or disable audio dump for debugging purposes
`base_uri`	`string`	Base URI for connecting to the Gemini service
`audio_out`	`bool`	Flag to enable or disable audio output
`input_transcript`	`bool`	Flag to enable input transcript processing
`sample_rate`	`int32`	Sample rate for audio processing
`stream_id`	`int32`	Stream ID for identifying audio streams
`greeting`	`string`	Greeting message for initial interaction

Data Out

Name	Property	Type	Description
`text_data`	`text`	`string`	Outgoing text data
`append`	`text`	`string`	Additional text appended to the output

Command Out

Name	Description
`flush`	Response after flushing the current state
`tool_call`	Invokes a tool with specific arguments

Audio Frame In

Name	Description
`pcm_frame`	Audio frame input for voice processing

Video Frame In

Name	Description
`video_frame`	Video frame input for processing

Audio Frame Out

Name	Description
`pcm_frame`	Audio frame output after voice processing