TrueFoundry AI Gateway supports various types of multimodal inputs, allowing you to work with different data formats including images, audio, and video.
Images
You can send images as part of your chat completion requests. You can either send a URL or a base64 encoded image.
Using Image URLs
Send an image URL to the model:
from openai import OpenAI
client = OpenAI(
api_key="your_truefoundry_api_key",
base_url="<truefoundry-base-url>/api/llm/api/inference/openai"
)
response = client.chat.completions.create(
model="openai-main/gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://images.rawpixel.com/image_800/cHJpdmF0ZS9sci9pbWFnZXMvd2Vic2l0ZS8yMDIyLTA1L25zODIzMC1pbWFnZS5qcGc.jpg"
}
},
],
}
],
)
print(response.choices[0].message.content)
Using Base64 Encoded Images
Send a base64 encoded image to the model:
import base64
from openai import OpenAI
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
client = OpenAI(
api_key="your_truefoundry_api_key",
base_url="<truefoundry-base-url>/api/llm/api/inference/openai"
)
response = client.chat.completions.create(
model="openai-main/gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{encode_image('dogs.jpeg')}"
}
}
]
}
]
)
print(response.choices[0].message.content)
Audio
For audio inputs, you can send audio files in supported formats (MP3, WAV, etc.). Please make sure that the model supports audio inputs, otherwise the request will fail. Audio inputs in chat completions are currently supported for Google Gemini models.
Using Audio URLs
response = client.chat.completions.create(
model="internal-google/gemini-2-0-flash",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe this audio"},
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/prof3ssorSt3v3/media-sample-files/refs/heads/master/hal-9000.wav",
"mime_type": "audio/mp3" # this field is only required for gemini models
}
}
]
}
]
)
print(response.choices[0].message.content)
Using Base64 Encoded Audio
import base64
def encode_audio(audio_path):
with open(audio_path, "rb") as audio_file:
return base64.b64encode(audio_file.read()).decode('utf-8')
response = client.chat.completions.create(
model="internal-google/gemini-2-0-flash",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe this audio"},
{
"type": "image_url",
"image_url": {
"url": f"data:audio/wav;base64,{encode_audio('/path-to-audio-file.wav')}"
}
}
]
}
]
)
print(response.choices[0].message.content)
Video
Video processing is natively supported for Google Gemini models. But it can be used for other models with the help of sending frames as images.
Using Video URLs
Send a video URL to the model:
response = client.chat.completions.create(
model="internal-google/gemini-2-0-flash",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe what's happening in this video"},
{
"type": "image_url",
"image_url": {
"url": "https://www.youtube.com/watch?v=fxqE27gIZcc",
"mime_type": "video/mp4" # this field is only required for gemini models
}
}
]
}
]
)
print(response.choices[0].message.content)
Using Base64 Encoded Video
Send base64 encoded video to the model (please make sure the size of the video is within limits of the provider):
import base64
def encode_video(video_path):
with open(video_path, "rb") as video_file:
return base64.b64encode(video_file.read()).decode('utf-8')
response = client.chat.completions.create(
model="internal-google/gemini-2-0-flash",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe what's happening in this video"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{encode_video('path/to/video.mp4')}",
"mime_type": "video/mp4" # this field is only required for gemini models
}
}
]
}
]
)
print(response.choices[0].message.content)
Supported Models
- Images: Most vision-capable models including GPT-4o, GPT-4 Vision, Claude 3, Gemini Pro Vision
- Audio: Google Gemini models (Gemini 2.0 Flash, etc.)
- Video: Google Gemini models (Gemini 2.0 Flash, etc.)
Make sure to check model capabilities before sending multimodal inputs to avoid errors.