Workshop

Multimodal AI: Working with Text, Images & Audio

Text was just the beginning. The AI frontier is multimodal.

About this workshop

Unlock the full power of GPT-4o, Gemini 1.5, and Claude 3's multimodal capabilities. Build pipelines that process images, transcribe audio, and combine modalities, with real-world demos in document analysis and media processing.

What you will learn

  • Build image understanding pipelines using GPT-4o and Claude 3 vision APIs
  • Transcribe, summarise, and extract structured data from audio using Whisper and Gemini
  • Combine text, image, and audio inputs in a single unified processing pipeline
  • Apply multimodal AI to real-world use cases in document analysis and content workflows

Who this is for

  • Developers who want to go beyond text and build with vision and audio AI APIs
  • Engineers building document processing, media, or healthcare AI applications
  • Anyone who has used GPT-4o's image input and wants to build production pipelines around it

By the end

Before

Limited to text-only AI applications

After

Multimodal pipelines combining vision, audio, and text in a single workflow

Before

Manual processing of documents, images, and media at scale

After

Automated extraction and analysis across all modalities

Before

Missing the full power of GPT-4o and Claude 3

After

Production pipelines built on every capability frontier models offer

About Nina

Nina Kovač

Nina Kovač

Research Scientist, Multimodal AI Lab

Vetted by Maram

Nina is a research scientist specialising in multimodal systems, with published work at ICLR and ACL. She has built multimodal pipelines for medical imaging, legal document analysis, and media production, and contributes to the open-source multimodal evaluation framework MMTE.

View full profile →

What learners say

Reviews appear here once 3 learners have completed this session.