Scouter Project v 1.0

AI-powered object detection system with real-time enhancement and accessibility features

Contributor:

Lead System Developer: Minwoo Kim - Minwoo Kim Github
Azure AI Developer: Duhyeon Nam - Duhyeon Nam Github
AI Code Engineer: Nakyung Cho - Nakyung Cho Github
Azure System Developer: Seungwoo Hong - Seungwoo Hong Github
AI Code Engineer: Hyunjun Kim - Hyunjun Kim Github
Azure AI Developer: Sumin Hyun - Sumin Hyun Github

Overview

An AI system inspired by Cyberpunk 2077 that combines several advanced features:

Zero-shot object detection with detailed captioning
Real-time object enhancement and upscaling
Speech-to-text (STT) functionality for accessibility
Overlay display system similar to Cyberpunk 2077 Scanner elements
Extra information with Bing Search API (Default: Deactivated)
STT to LLM Response on display

Demo Screenshots

GUI Demo Screenshots

STT to LLM Demo Screenshots

Installation and Setup

Clone the repository

git clone https://github.com/MinwooKim1990/Scouter_PJ.git
cd Scouter_PJ

Install dependencies

pip install -r requirements.txt

Prepare your data

mkdir data

Place your video files in the data folder

Run the application

python system.py --video path/to/your/video.mp4(integer 0 for webcam, Default: 0) --bing [YOUR-API-KEY] --llm-provider [google, openai, groq] --llm-api-key [YOUR-API-KEY] --llm-model [providing models]

For checking arguments help

python system.py --help

Use GUI

Clone the repository

git clone https://github.com/MinwooKim1990/Scouter_PJ.git
cd Scouter_PJ

Install dependencies

pip install -r requirements.txt

Prepare your data

mkdir data

Place your video files in the data folder

Run the application

python tkinterapp.py

How to Use

Basic Controls

Object Detection & Tracking

Left Click:
- Click anywhere on the video to activate zero-shot object detection
- Click on an object to create a bounding box and start tracking
Right Click:
- Releases the current bounding box and stops tracking

Image Enhancement

F Key:
- Toggles real-time upscaling of the tracked object
- Default: Disabled

Voice Subtitles

T Key:
- First Press: Starts voice recording for subtitle generation
- Second Press: Stops recording and processes the subtitle

Image Search

S Key:
- Toggles Activate image search of detected object
- Default: Disabled

STT to LLM

A Key:
- First Press: Starts prompt recording for LLM
- Second Press: Stops prompt recording and processes the LLM output

Provider	Model
Google	gemini-2.0-flash-exp
	gemini-1.5-flash
	gemini-1.5-flash-8b
	gemini-1.5-pro
OpenAI	gpt-4o-2024-08-06
	gpt-4o-mini-2024-07-18
	o1-2024-12-17
	gpt-3.5-turbo-0125
GROQ	llama-3.3-70b-versatile
	llama-3.2-90b-text-preview
	llama-3.2-11b-text-preview
	gemma2-9b-it
	mixtral-8x7b-32768
Anthropic	claude-3-5-sonnet-20241022
	claude-3-opus-20240229
	claude-3-5-haiku-20241022

Stop Video

Space Key:
- First Press: Stop video playing and can do object detection
- Second Press: play video again

Stop Video

Tab Key:
- First Press: Turn on full instructions (default: turn off)
- Second Press: Turn off all instructions

Quick Reference

Action	Key/Button	Function
Select Object	Left Click	Activates detection & tracking
Release Tracking	Right Click	Stops tracking current object
Toggle Upscaling	F	Enable realtime upscaling
Voice Recording	T	Start/Stop subtitle recording
Image Search	S	Activate Bing image search
STT to LLM	A	Start/Stop prompt recording to LLM
Play/Stop	Space	Play/Stop video
Show Instructions	Tab	Showing/Not Showing instructions

Features

Zero-shot object detection
Real-time object tracking
AI-powered image upscaling
Speech-to-text subtitle generation
Speech-to-LLM output generation
Cyberpunk-style overlay display

Technical Specifications

Hardware Requirements

GPU: NVIDIA GPU with CUDA support
- Minimum VRAM: 8GB (For realtime Upscaling: 24GB)
- Tested on: NVIDIA GPU RTX 4090
Memory: 2GB
Storage: 2.5GB

Performance Notes

Tested 720P quality videos
Processing FPS Without Realtime Upscaling FPS: 30 ~ 40 FPS
Processing FPS during heaviest work load: 10 ~ 20 FPS
Current version is optimized for VRAM efficiency
Peak VRAM usage: ~7GB during operation (In Realtime Upscaling ~ 21GB during operation)
Further optimization may be available in future updates

Models & Attributions

This project utilizes several open-source models:

MobileSAM: Zero-shot object detection - Apache-2.0 License
- Source: Ultralytics MobileSAM
Fast-SRGAN: Real-time image upscaling - MIT License
- Source: Fast-SRGAN
OpenAI Whisper: Speech-to-text processing - MIT License
- Source: OpenAI Whisper
Florence 2: Image captioning - MIT License
- Source: Microsoft Florence
Bing Search API: Image Search
- Source: Microsoft Azure
Google Gemini API: LLM Response
- Source: Google AI
OpenAI API: LLM Response
- Source: OpenAI
GROQ API: LLM Response
- Source: GROQ

License

This project is licensed under the MIT License since all major components use either MIT or Apache-2.0 licenses. The MIT License is compatible with both and maintains the open-source nature of the utilized models.

This project uses the following APIs. Please ensure compliance with their respective terms of use:

Bing Search API:
- Documentation: Bing Web Search API
- Pricing: Bing Search API Pricing
- Licensing: Usage of the Bing Search API is subject to Microsoft's Terms of Use and Privacy Statement.
OpenAI API:
- Documentation: OpenAI API
- Terms of Use: OpenAI Terms of Use
- Licensing: OpenAI's API usage is governed by their Terms of Use and Usage Policies.
Google GenAI API:
- Documentation: Google GenAI API
- Terms of Service: Google API Terms
- Licensing: Google's APIs are subject to the Google APIs Terms of Service.
Groq API:
- Documentation: Groq API Documentation
- Licensing: Usage of the Groq API must comply with Groq's API policies. Specific licensing details can be found in their API documentation.
Anthropic API:
- Documentation: Anthropic API Documentation
- Terms of Service: Anthropic Terms of Service
- Usage Guidelines: Anthropic Usage Policies
- Licensing: Use of the Anthropic API is subject to their Terms of Service and Acceptable Use Policy

Note

To ensure the security of API keys, store them securely using environment variables or secret management solutions. Do not expose sensitive information in public repositories.

See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
config		config
models		models
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
system.py		system.py
tkinterapp.py		tkinterapp.py
tksystem.py		tksystem.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scouter Project v 1.0

Contributor:

Overview

Demo Screenshots

GUI Demo Screenshots

STT to LLM Demo Screenshots

Installation and Setup

Use GUI

How to Use

Basic Controls

Object Detection & Tracking

Image Enhancement

Voice Subtitles

Image Search

STT to LLM

Stop Video

Stop Video

Quick Reference

Features

Technical Specifications

Hardware Requirements

Performance Notes

Models & Attributions

License

Note

About

Releases

Packages

Languages

License

nakyung1007/Microsoft_AI_School-2th-project

Folders and files

Latest commit

History

Repository files navigation

Scouter Project v 1.0

Contributor:

Overview

Demo Screenshots

GUI Demo Screenshots

STT to LLM Demo Screenshots

Installation and Setup

Use GUI

How to Use

Basic Controls

Object Detection & Tracking

Image Enhancement

Voice Subtitles

Image Search

STT to LLM

Stop Video

Stop Video

Quick Reference

Features

Technical Specifications

Hardware Requirements

Performance Notes

Models & Attributions

License

Note

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages