Founding AI Engineer — Multimodal Emotional AI

Founding AI Engineer — Multimodal Emotional AI

Founding AI Engineer — Multimodal Emotional AI

Upwork

Upwork

Remoto

4 hours ago

No application

About

I’m building a system that can sense emotional signals in a live conversation — from audio, video and speech — and return a synchronized emotional stream for a weekly podcast. I need one engineer who can build a real-time multimodal pipeline from scratch. The role is hands-on: prototype fast, ship weekly improvements, and make it work end-to-end. This is inference only, not model training. The System (High-Level) The pipeline will: Capture 2 video feeds from cameras extract facial/body emotional signals timestamp frames Capture audio input from a dual mic receiver run emotion model track tone/tension/stress cues timestamp stream Run Whisper (or similar) in real time speech-to-text confidence scores timestamped text segments Synchronize all streams align video/audio/text output structured JSON Send JSON to a conversational AI (ChatGPT or other LLM) via local API or device connection in real time or near real time Display emotional timeline simple UI (web or local) clean visual for podcast use The goal is not a polished product — it’s a working emotional layer for a conversation. What You’ll Build real-time capture layer video emotion inference audio emotion inference Whisper transcription loop timestamp + sync logic JSON output schema lightweight visualization You don’t need to design the UI — just make something clear and usable.