This AI assistant integrates real-time speech recognition using Whisper ASR and dynamic visual input capture via OpenCV and mss to interpret and respond to user context with high accuracy. It utilizes LangChain in conjunction with 4o to generate intelligent, context-aware responses, while Text-to-Speech (TTS) system delivers natural voice output. The assistant features a multimodal fusion framework that synchronizes audio and visual data streams, enabling a richer understanding of user intent. Additionally, it incorporates conversational memory to sustain coherent multi-turn interactions, supporting fluid and natural human-computer communication. This architecture is designed for real-time operation, with the flexibility to scale into applications such as workflow automation, context-driven task execution, and adaptive user assistance.
to run: => create a virtual environment => python assistant.py