Skip to content

Latest commit

 

History

History
85 lines (53 loc) · 3.29 KB

README.md

File metadata and controls

85 lines (53 loc) · 3.29 KB

🎞 VLog: Video as a Long Document

Open in Spaces Tweet

Given a long video, we turn it into a doc containing visual + audio info. By sending this doc to ChatGPT, we can chat over the video!

vlog

News

  • 23/April/2023: We release Huggingface gradio demo!
  • 20/April/2023: We release our project on github and local gradio demo!

To Do List

Done

  • LLM Reasoner: ChatGPT (multilingual) + LangChain
  • Vision Captioner: BLIP2 + GRIT
  • ASR Translator: Whisper (multilingual)
  • Video Segmenter: KTS
  • Huggingface Space

Doing

  • Optimize the codebase efficiency
  • Improve Vision Models: MiniGPT-4 / LLaVA, Family of Segment-anything
  • Improve ASR Translator for better alignment
  • Introduce Temporal dependency
  • Replace ChatGPT with own trained LLM

🧸 Examples

[ News - GPT4 launch event ]GPT4 launch event
[ TV series - 征服之华强买瓜 ]华强买瓜
[ TV series - The Big Bang Theory ]The Big Bang Theory
[ Travel video - Travel in Rome ]Travel in Rome
[ Vlog - Basketball training ]Basketball training

🔨 Preparation

Please find installation instructions in install.md.

🌟 Start here

Run in cmd

python main.py --video_path examples/buy_watermelon.mp4 --openai_api_key xxxxx

The generated video document will be generated and saved in examples/buy_watermelon.log

Run in Gradio

python main_gradio.py --openai_api_key xxxxx

🙋 Suggestion

Stay tuned for our project 🔥

If you have more suggestions or functions need to be implemented in this codebase, feel free to drop us an email [email protected], [email protected] or open an issue.

😊 Acknowledgment

This work is based on ChatGPT, BLIP2, GRIT, KTS, Whisper, LangChain, Image2Paragraph.