8000 GitHub - niladrik2001/AI-Multimodal-Automation: This AI assistant leverages real-time speech recognition (Whisper ASR) and visual input (via OpenCV and mss) to understand user context. It uses LangChain with 4o for smart response generation, and TTS for speaking back. It maintains conversation memory and fuses multimodal data to enhance interaction quality.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

This AI assistant leverages real-time speech recognition (Whisper ASR) and visual input (via OpenCV and mss) to understand user context. It uses LangChain with 4o for smart response generation, and TTS for speaking back. It maintains conversation memory and fuses multimodal data to enhance interaction quality.

Notifications You must be signed in to change notification settings

niladrik2001/AI-Multimodal-Automation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

AI-Multimodal-Automation

This AI assistant integrates real-time speech recognition using Whisper ASR and dynamic visual input capture via OpenCV and mss to interpret and respond to user context with high accuracy. It utilizes LangChain in conjunction with 4o to generate intelligent, context-aware responses, while Text-to-Speech (TTS) system delivers natural voice output. The assistant features a multimodal fusion framework that synchronizes audio and visual data streams, enabling a richer understanding of user intent. Additionally, it incorporates conversational memory to sustain coherent multi-turn interactions, supporting fluid and natural human-computer communication. This architecture is designed for real-time operation, with the flexibility to scale into applications such as workflow automation, context-driven task execution, and adaptive user assistance.

to run: => create a virtual environment => python assistant.py

About

This AI assistant leverages real-time speech recognition (Whisper ASR) and visual input (via OpenCV and mss) to understand user context. It uses LangChain with 4o for smart response generation, and TTS for speaking back. It maintains conversation memory and fuses multimodal data to enhance interaction quality.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0