Workshop
Multimodal AI: Working with Text, Images & Audio
Text was just the beginning. The AI frontier is multimodal.
About this workshop
Unlock the full power of GPT-4o, Gemini 1.5, and Claude 3's multimodal capabilities. Build pipelines that process images, transcribe audio, and combine modalities, with real-world demos in document analysis and media processing.
What you will learn
- Build image understanding pipelines using GPT-4o and Claude 3 vision APIs
- Transcribe, summarise, and extract structured data from audio using Whisper and Gemini
- Combine text, image, and audio inputs in a single unified processing pipeline
- Apply multimodal AI to real-world use cases in document analysis and content workflows
Who this is for
- Developers who want to go beyond text and build with vision and audio AI APIs
- Engineers building document processing, media, or healthcare AI applications
- Anyone who has used GPT-4o's image input and wants to build production pipelines around it
By the end
Before
Limited to text-only AI applications
After
Multimodal pipelines combining vision, audio, and text in a single workflow
Before
Manual processing of documents, images, and media at scale
After
Automated extraction and analysis across all modalities
Before
Missing the full power of GPT-4o and Claude 3
After
Production pipelines built on every capability frontier models offer
About Nina
Nina Kovač
Research Scientist, Multimodal AI Lab
Vetted by Maram
Nina is a research scientist specialising in multimodal systems, with published work at ICLR and ACL. She has built multimodal pipelines for medical imaging, legal document analysis, and media production, and contributes to the open-source multimodal evaluation framework MMTE.
View full profile →What learners say
Reviews appear here once 3 learners have completed this session.
