We attempt to tackle the challenging task of recipe generation from videos using only pre-trained models. We divided the process of recipe generation into various modules which include event generation, frame extraction, featurizing frames, removing frame redundancy, frame enhancement, frame captioning, and summarization using LLM. We used various pre-trained models to perform different tasks required to achieve desired results at each stage of our recipe generation pipeline. We used the temporal nature of videos, and the power of image embeddings, and harnessed the power of LLMs to extract meaningful content and generate recipes in an efficient manner. We have demonstrated the quality of the recipe generated using various metrics which highlight the impact of our work.
For detailed explanation refer to the report and the video presentation which contains demos. https://docs.google.com/presentation/d/1R0FjAj_QXoLjxR3NsZRVTFgYu-KnKj2BpN4EcOvyuGI/edit#slide=id.g21eccad0113_0_38