AI-powered object detection system with real-time enhancement and accessibility features
- Lead System Developer: Minwoo Kim - Minwoo Kim Github
- Azure AI Developer: Duhyeon Nam - Duhyeon Nam Github
- AI Code Engineer: Nakyung Cho - Nakyung Cho Github
- Azure System Developer: Seungwoo Hong - Seungwoo Hong Github
- AI Code Engineer: Hyunjun Kim - Hyunjun Kim Github
- Azure AI Developer: Sumin Hyun - Sumin Hyun Github
An AI system inspired by Cyberpunk 2077 that combines several advanced features:
- Zero-shot object detection with detailed captioning
- Real-time object enhancement and upscaling
- Speech-to-text (STT) functionality for accessibility
- Overlay display system similar to Cyberpunk 2077 Scanner elements
- Extra information with Bing Search API (Default: Deactivated)
- STT to LLM Response on display
- Clone the repository
git clone https://github.com/MinwooKim1990/Scouter_PJ.git
cd Scouter_PJ
- Install dependencies
pip install -r requirements.txt
- Prepare your data
mkdir data
Place your video files in the data
folder
- Run the application
python system.py --video path/to/your/video.mp4(integer 0 for webcam, Default: 0) --bing [YOUR-API-KEY] --llm-provider [google, openai, groq] --llm-api-key [YOUR-API-KEY] --llm-model [providing models]
- For checking arguments help
python system.py --help
- Clone the repository
git clone https://github.com/MinwooKim1990/Scouter_PJ.git
cd Scouter_PJ
- Install dependencies
pip install -r requirements.txt
- Prepare your data
mkdir data
Place your video files in the data
folder
- Run the application
python tkinterapp.py
- Left Click:
- Click anywhere on the video to activate zero-shot object detection
- Click on an object to create a bounding box and start tracking
- Right Click:
- Releases the current bounding box and stops tracking
- F Key:
- Toggles real-time upscaling of the tracked object
- Default: Disabled
- T Key:
- First Press: Starts voice recording for subtitle generation
- Second Press: Stops recording and processes the subtitle
- S Key:
- Toggles Activate image search of detected object
- Default: Disabled
- A Key:
- First Press: Starts prompt recording for LLM
- Second Press: Stops prompt recording and processes the LLM output
Provider | Model |
---|---|
gemini-2.0-flash-exp | |
gemini-1.5-flash | |
gemini-1.5-flash-8b | |
gemini-1.5-pro | |
OpenAI | gpt-4o-2024-08-06 |
gpt-4o-mini-2024-07-18 | |
o1-2024-12-17 | |
gpt-3.5-turbo-0125 | |
GROQ | llama-3.3-70b-versatile |
llama-3.2-90b-text-preview | |
llama-3.2-11b-text-preview | |
gemma2-9b-it | |
mixtral-8x7b-32768 | |
Anthropic | claude-3-5-sonnet-20241022 |
claude-3-opus-20240229 | |
claude-3-5-haiku-20241022 |
- Space Key:
- First Press: Stop video playing and can do object detection
- Second Press: play video again
- Tab Key:
- First Press: Turn on full instructions (default: turn off)
- Second Press: Turn off all instructions
Action | Key/Button | Function |
---|---|---|
Select Object | Left Click | Activates detection & tracking |
Release Tracking | Right Click | Stops tracking current object |
Toggle Upscaling | F | Enable realtime upscaling |
Voice Recording | T | Start/Stop subtitle recording |
Image Search | S | Activate Bing image search |
STT to LLM | A | Start/Stop prompt recording to LLM |
Play/Stop | Space | Play/Stop video |
Show Instructions | Tab | Showing/Not Showing instructions |
- Zero-shot object detection
- Real-time object tracking
- AI-powered image upscaling
- Speech-to-text subtitle generation
- Speech-to-LLM output generation
- Cyberpunk-style overlay display
- GPU: NVIDIA GPU with CUDA support
- Minimum VRAM: 8GB (For realtime Upscaling: 24GB)
- Tested on: NVIDIA GPU RTX 4090
- Memory: 2GB
- Storage: 2.5GB
- Tested 720P quality videos
- Processing FPS Without Realtime Upscaling FPS: 30 ~ 40 FPS
- Processing FPS during heaviest work load: 10 ~ 20 FPS
- Current version is optimized for VRAM efficiency
- Peak VRAM usage: ~7GB during operation (In Realtime Upscaling ~ 21GB during operation)
- Further optimization may be available in future updates
This project utilizes several open-source models:
- MobileSAM: Zero-shot object detection - Apache-2.0 License
- Source: Ultralytics MobileSAM
- Fast-SRGAN: Real-time image upscaling - MIT License
- Source: Fast-SRGAN
- OpenAI Whisper: Speech-to-text processing - MIT License
- Source: OpenAI Whisper
- Florence 2: Image captioning - MIT License
- Source: Microsoft Florence
- Bing Search API: Image Search
- Source: Microsoft Azure
- Google Gemini API: LLM Response
- Source: Google AI
- OpenAI API: LLM Response
- Source: OpenAI
- GROQ API: LLM Response
- Source: GROQ
This project is licensed under the MIT License since all major components use either MIT or Apache-2.0 licenses. The MIT License is compatible with both and maintains the open-source nature of the utilized models.
This project uses the following APIs. Please ensure compliance with their respective terms of use:
-
Bing Search API:
- Documentation: Bing Web Search API
- Pricing: Bing Search API Pricing
- Licensing: Usage of the Bing Search API is subject to Microsoft's Terms of Use and Privacy Statement.
-
OpenAI API:
- Documentation: OpenAI API
- Terms of Use: OpenAI Terms of Use
- Licensing: OpenAI's API usage is governed by their Terms of Use and Usage Policies.
-
Google GenAI API:
- Documentation: Google GenAI API
- Terms of Service: Google API Terms
- Licensing: Google's APIs are subject to the Google APIs Terms of Service.
-
Groq API:
- Documentation: Groq API Documentation
- Licensing: Usage of the Groq API must comply with Groq's API policies. Specific licensing details can be found in their API documentation.
-
Anthropic API:
- Documentation: Anthropic API Documentation
- Terms of Service: Anthropic Terms of Service
- Usage Guidelines: Anthropic Usage Policies
- Licensing: Use of the Anthropic API is subject to their Terms of Service and Acceptable Use Policy
To ensure the security of API keys, store them securely using environment variables or secret management solutions. Do not expose sensitive information in public repositories.
See the LICENSE file for details.