Skip to content

ChaofanTao/Autoregressive-Models-in-Vision-Survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Awesome arxiv TechBeat Hits GitHub Repo stars

Autoregressive models have shown significant progress in generating high-quality content by modeling the dependencies sequentially. This repo is a curated list of papers about the latest advancements in autoregressive models in vision. This repo is being actively updated, please stay tuned!

Paper: Autoregressive Models in Vision: A Survey | [中文解读]

Authors: Jing Xiong1,†, Gongye Liu2,†, Lun Huang3, Chengyue Wu1, Taiqiang Wu1, Yao Mu1, Yuan Yao4, Hui Shen5, Zhongwei Wan5, Jinfa Huang4, Chaofan Tao1,‡, Shen Yan6, Huaxiu Yao7, Lingpeng Kong1, Hongxia Yang9, Mi Zhang5, Guillermo Sapiro8,10, Jiebo Luo4, Ping Luo1, Ngai Wong1

1The University of Hong Kong, 2Tsinghua University, 3Duke University, 4University of Rochester, 5The Ohio State University, 6Bytedance, 7The University of North Carolina at Chapel Hill, 8Apple, 9The Hong Kong Polytechnic University, 10Princeton University

Core Contributors, Corresponding Authors

📣 Update News

[2024-11-11] We have released the survey: Autoregressive Models in Vision: A Survey.

[2024-10-13] We have initialed the repository.

⚡ Contributing

We welcome feedback, suggestions, and contributions that can help improve this survey and repository and make them valuable resources for the entire community. We will actively maintain this repository by incorporating new research as it emerges. If you have any suggestions about our taxonomy, please take a look at any missed papers, or update any preprint arXiv paper that has been accepted to some venue.

If you want to add your work or model to this list, please do not hesitate to pull requests. Markdown format:

* [**Name of Conference or Journal + Year**] Paper Name. [[Paper]](link) [[Code]](link)

📖 Table of Contents


Image Generation

Unconditional/Class-Conditioned Image Generation

  • Pixel-wise Generation
    • [ICML, 2020] ImageGPT: Generative Pretraining from Pixels Paper
    • [ICML, 2018] Image Transformer Paper Code
    • [ICML, 2018] PixelSNAIL: An Improved Autoregressive Generative Model Paper Code
    • [ICML, 2017] Parallel Multiscale Autoregressive Density Estimation Paper
    • [ICLR workshop, 2017] Gated PixelCNN: Generating Interpretable Images with Controllable Structure Paper
    • [ICLR, 2017] PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications Paper Code
    • [NeurIPS, 2016] PixelCNN Conditional Image Generation with PixelCNN Decoders Paper Code
    • [ICML, 2016] PixelRNN Pixel Recurrent Neural Networks Paper Code
  • Token-wise Generation
    Tokenizer
    • [Arxiv, 2024.12] Next Patch Prediction for Autoregressive Visual Generation Paper Code
    • [Arxiv, 2024.12] XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation Paper Code
    • [Arxiv, 2024.12] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders. Paper Code Project
    • [Arxiv, 2024.11] Randomized Autoregressive Visual Generation. Paper Code Project
    • [Arxiv, 2024.09] Open-MAGVIT2: Democratizing Autoregressive Visual Generation Paper Code
    • [Arxiv, 2024.06] OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation Paper Code
    • [Arxiv, 2024.06] Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99% Paper Code
    • [Arxiv, 2024.06] Titok An Image is Worth 32 Tokens for Reconstruction and Generation Paper Code
    • [Arxiv, 2024.06] Wavelets Are All You Need for Autoregressive Image Generation Paper
    • [Arxiv, 2024.06] LlamaGen Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation Paper Code
    • [ICLR, 2024] MAGVIT-v2 Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation Paper
    • [ICLR, 2024] FSQ Finite scalar quantization: Vq-vae made simple Paper Code
    • [ICCV, 2023] Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers Paper
    • [CVPR, 2023] Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization Paper Code
    • [CVPR, 2023, Highlight] MAGVIT: Masked Generative Video Transformer Paper
    • [NeurIPS, 2023] MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation Paper
    • [BMVC, 2022] Unconditional image-text pair generation with multimodal cross quantizer Paper Code
    • [CVPR, 2022] RQ-VAE Autoregressive Image Generation Using Residual Quantization Paper Code
    • [ICLR, 2022] ViT-VQGAN Vector-quantized Image Modeling with Improved VQGAN Paper
    • [PMLR, 2021] Generating images with sparse representations Paper
    • [CVPR, 2021] VQGAN Taming Transformers for High-Resolution Image Synthesis Paper Code
    • [NeurIPS, 2019] Generating Diverse High-Fidelity Images with VQ-VAE-2 Paper Code
    • [NeurIPS, 2017] VQ-VAE Neural Discrete Representation Learning Paper
    Autoregressive Modeling
    • [Arxiv, 2024.12] E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling Paper
    • [Arxiv, 2024.12] Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis Paper Code
    • [Arxiv, 2024.12] Taming Scalable Visual Tokenizer for Autoregressive Image Generation Paper Code
    • [Arxiv, 2024.11] Sample- and Parameter-Efficient Auto-Regressive Image Models Paper Code
    • [Arxiv, 2024.01] Scalable Pre-training of Large Autoregressive Image Models Paper Code
    • [Arxiv, 2024.10] ImageFolder: Autoregressive Image Generation with Folded Tokens Paper Code
    • [Arxiv, 2024.10] SAR Customize Your Visual Autoregressive Recipe with Set Autoregressive Modeling Paper Code
    • [Arxiv, 2024.08] AiM Scalable Autoregressive Image Generation with Mamba Paper Code
    • [Arxiv, 2024.06] ARM Autoregressive Pretraining with Mamba in Vision Paper Code
    • [Arxiv, 2024.06] MAR Autoregressive Image Generation without Vector Quantization Paper Code
    • [Arxiv, 2024.06] LlamaGen Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation Paper Code
    • [ICML, 2024] DARL: Denoising Autoregressive Representation Learning Paper
    • [ICML, 2024] DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents Paper Code
    • [ICML, 2024] DeLVM: Data-efficient Large Vision Models through Sequential Autoregression Paper Code
    • [AAAI, 2023] SAIM Exploring Stochastic Autoregressive Image Modeling for Visual Representation Paper Code
    • [NeurIPS, 2021] ImageBART: Context with Multinomial Diffusion for Autoregressive Image Synthesis Paper Code
    • [CVPR, 2021] VQGAN Taming Transformers for High-Resolution Image Synthesis Paper Code
    • [ECCV, 2020] RAL: Incorporating Reinforced Adversarial Learning in Autoregressive Image Generation Paper
    • [NeurIPS, 2019] Generating Diverse High-Fidelity Images with VQ-VAE-2 Paper Code
    • [NeurIPS, 2017] VQ-VAE Neural Discrete Representation LearningPaper
  • Scale-wise Generation
    • [Arxiv, 2024.12] SWITTI: Designing Scale-Wise Transformers for Text-to-Image Synthesis Paper Code Page
    • [Arxiv, 2024.11] M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation Paper Code
    • [Arxiv, 2024.04] Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction Paper Code

Text-to-Image Generation

  • Token-wise Generation
    • [Arxiv, 2024.11] High-Resolution Image Synthesis via Next-Token Prediction Paper Code
    • [Arxiv, 2024.10] Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens Paper
    • [Arxiv, 2024.10] HART: Efficient Visual Generation with Hybrid Autoregressive Transformer Paper Code
    • [Arxiv, 2024.10] DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation Paper Code
    • [Arxiv, 2024.10] DnD-Transformer: A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Fine-grained Image Generation Paper Code
    • [Arxiv, 2024.08] Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining Paper Code
    • [Arxiv, 2024.07] MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis Paper Code
    • [Arxiv, 2024.06] LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation Paper Code
    • [Arxiv, 2024.06] STAR: Scale-wise Text-to-image generation via Auto-Regressive representations Paper Code
    • [Arxiv, 2024.05] Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling Paper
    • [CVPR, 2024] Beyond Text: Frozen Large Language Models in Visual Signal Comprehension Paper Code
    • [TOG, 2023] IconShop: Text-Guided Vector Icon Synthesis with Autoregressive Transformers (svg imagePaper Code
    • [NeurIPS, 2023] LQAE Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment Paper Code
    • [TMLR, 2022.06] Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation Paper Code
    • [NeurIPS, 2022] CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers Paper Code
    • [ECCV, 2022] Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors Paper
    • [CVPR, 2022] VQ-Diffusion: Vector Quantized Diffusion Model for Text-to-Image Synthesis Paper Code
    • [CVPR, 2022] Make-A-Story: Visual Memory Conditioned Consistent Story Generation (storytellingPaper
    • [NeurIPS, 2021] CogView: Mastering Text-to-Image Generation via Transformers Paper Code
    • [Arxiv, 2021.02] DALL-E 1: Zero-Shot Text-to-Image Generation Paper
  • Scale-wise Generation
    • [Arxiv, 2024.08] VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling Paper Code
    • [Arxiv, 2024.06] STAR: Scale-wise Text-to-image generation via Auto-Regressive representations Paper Code

Image-to-Image Translation

  • [ICML Workshop, 2024] MIS Many-to-many Image Generation with Auto-regressive Diffusion Models Paper
  • [Arxiv, 2024.03] SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model Paper Project
  • [CVPR, 2024] Sequential modeling enables scalable learning for large vision models Paper
  • [ECCV, 2022] QueryOTR: Outpainting by Queries Paper Code
  • [NeurIPS, 2022] Visual prompting via image inpainting Paper
  • [MM, 2021] Diverse image inpainting with bidirectional and autoregressive transformers Paper

Image Editing

  • [Arxiv, 2024,06] CAR: Controllable Autoregressive Modeling for Visual Generation Paper Code
  • [Arxiv, 2024,06] ControlAR: Controllable Image Generation with Autoregressive Models Paper Code
  • [Arxiv, 2024,06] ControlVAR: Exploring Controllable Visual Autoregressive Modeling Paper Code
  • [Arxiv, 2024,06] Medical Vision Generalist: Unifying Medical Imaging Tasks in Context Paper
  • [Arxiv, 2024,04] M2M Many-to-many Image Generation with Auto-regressive Diffusion Models Paper
  • [ECCV, 2022] VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance Paper
  • [ECCV, 2022] Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors Paper
  • [ICIP, 2021] MSGNet: Generating annotated high-fidelity images containing multiple coherent objects Paper

Video Generation

Unconditional Video Generation

  • [Arxiv, 2024.10] LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior Paper
  • [ECCV 2024] ST-LLM: Large Language Models Are Effective Temporal Learners Paper Code
  • [ICLR, 2024] MAGVIT-v2 Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation Paper
  • [CVPR, 2023] PVDM Video Probabilistic Diffusion Models in Projected Latent Space Paper
  • [ECCV, 2022] Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer Paper Code
  • [Arxiv, 2021.04] VideoGPT: Video generation using VQ-VAE and transformers Paper
  • [Arxiv, 2020.06] Latent Video Transformer Paper Code
  • [ICLR, 2020] Scaling Autoregressive Video Models Paper
  • [CVPR, 2018] MoCoGAN: Decomposing Motion and Content for Video Generation Paper Code
  • [ICML, 2017] Video Pixel Networks Paper

Conditional Video Generation

  • Text-to-Video Generation
    • [Arxiv, 2024.12] DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models Paper Page
    • [Arxiv, 2024.11] Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing Paper Code
    • [Arxiv, 2024.10] ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation Paper Code
    • [Arxiv, 2024.10] Progressive Autoregressive Video Diffusion Models Paper Code
    • [Arxiv, 2024.10] Pyramid Flow: Pyramidal Flow Matching for Efficient Video Generative Modeling Paper Code
    • [Arxiv, 2024.10] Loong: Generating Minute-level Long Videos with Autoregressive Language Models Paper
    • [Arxiv, 2024.06] Pandora: Towards General World Model with Natural Language Actions and Video States Paper Code
    • [Arxiv, 2024.06] iVideoGPT: Interactive VideoGPTs are Scalable World Models Paper Code
    • [Arxiv, 2024.06] ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models Paper Code
    • [Arxiv, 2024.02] LWM World Model on Million-Length Video And Language With Blockwise RingAttention Paper Code
    • [CVPR, 2024] ART-V: Auto-Regressive Text-to-Video Generation with Diffusion Models Paper
    • [NeurIPS, 2022] NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis Paper Code
    • [ECCV, 2022] NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion Paper Code
    • [Arxiv, 2022.05] CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers Paper Code
    • [Arxiv, 2022.05] GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions Paper
    • [IJCAI, 2021] IRC-GAN: Introspective Recurrent Convolutional GAN for Text-to-video Generation. Paper
  • Visual Conditional Video Generation
    • [Arxiv, 2024.10] MarDini: Masked Autoregressive Diffusion for Video Generation at Scale Paper
    • [CVPR, 2024] LVM Sequential Modeling Enables Scalable Learning for Large Vision Models Paper Code
    • [ICIP, 2022] HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator Paper
    • [Arxiv, 2021.03] Predicting Video with VQVAE Paper
    • [CVPR, 2021] Stochastic Image-to-Video Synthesis using cINNs Paper Code
    • [ICLR, 2019] Eidetic 3d lstm: A model for video prediction and beyond Paper
    • [ICLR, 2018] Stochastic variational video prediction Paper
    • [NeurIPS, 2017] Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms Paper
    • [NeurIPS, 2015] Convolutional LSTM network: A machine learning approach for precipitation nowcasting Paper
  • Multimodal Conditional Video Generation
    • [Arxiv, 2024.12] Autoregressive Video Generation without Vector Quantization Paper Code
    • [ICML, 2024] Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization Paper Code
    • [ICML, 2024] VideoPoet: A Large Language Model for Zero-Shot Video Generation Paper
    • [CVPR, 2023] MAGVIT: Masked Generative Video Transformer Paper
    • [CVPR, 2022] Make it move: controllable image-to-video generation with text descriptions Paper Code

Embodied AI

  • [Arxiv, 2024.12] Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression Paper Page
  • [Arxiv, 2024.10] Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation Paper
  • [Arxiv, 2024.05] iVideoGPT: Interactive VideoGPTs are Scalable World Models Paper
  • [ICML, 2024] Genie: Generative interactive environments Paper
  • [ICLR, 2024] GR-1 Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation Paper
  • [ICLR, 2023] IRIS Transformers are sample-efficient world models Paper

3D Generation

Motion Generation

  • [Arxiv, 2024] ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model Paper Code
  • [AAAI, 2024] AMD: Autoregressive Motion Diffusion Paper Code
  • [ECCV, 2024] BAMM: Bidirectional Autoregressive Motion Model Paper Code
  • [CVPR, 2023] T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations Paper
  • [Arxiv, 2022] HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE Paper
  • [ICCV, 2021 oral] HuMoR: 3D Human Motion Model for Robust Pose Estimation Paper Code

Point Cloud Generation

  • [Arxiv, 2024.02] Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability Paper
  • [ECCV, 2022] Autoregressive 3D Shape Generation via Canonical Mapping Paper
  • [CVPR workshop, 2023] Octree transformer: Autoregressive 3d shape generation on hierarchically structured sequences Paper

3D Medical Generation

  • [arxiv, 2024] Autoregressive Sequence Modeling for 3D Medical Image Representation Paper
  • [arxiv, 2024] Medical Vision Generalist: Unifying Medical Imaging Tasks in Context Paper Code
  • [MIDL, 2024] Conditional Generation of 3D Brain Tumor ROIs via VQGAN and Temporal-Agnostic Masked Transformer Paper
  • [NMI, 2024] Realistic morphology-preserving generative modelling of the brain Paper Code
  • [Arxiv, 2023] Generating 3D Brain Tumor Regions in MRI using Vector-Quantization Generative Adversarial Networks Paper medical image-to-image translation
  • [ICCV, 2023] Unaligned 2D to 3D Translation with Conditional Vector-Quantized Code Diffusion using Transformers Paper Code
  • [MICCAI, 2022] Morphology-preserving Autoregressive 3D Generative Modelling of the Brain Paper Code medical image generation

Multimodal Generation

Unified Understanding and Generation Multi-Modal LLMs

  • [Arxiv, 2024.12] LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation Paper
  • [Arxiv, 2024.12] MetaMorph: Multimodal Understanding and Generation via Instruction Tuning Paper Page
  • [Arxiv, 2024.12] Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads Paper
  • [Arxiv, 2024.12] Multimodal Latent Language Modeling with Next-Token Diffusion. Paper
  • [Arxiv, 2024.12] ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance. Paper
  • [Arxiv, 2024.11] JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation. Paper Project
  • [Arxiv, 2024.11] Unified Generative and Discriminative Training for Multi-modal Large Language Models. Paper Project
  • [Arxiv, 2024.09] Emu3: Next-Token Prediction is All You NeedPaper Name. Paper Code Project
  • [Arxiv, 2024.10] D-JEPA: Denoising with a Joint-Embedding Predictive Architecture Paper project
  • [Arxiv, 2024.10] Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation Paper Code
  • [Arxiv, 2024.10] MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling Paper Code
  • [Arxiv, 2024.10] ACDC: Autoregressive Coherent Multimodal Generation using Diffusion Correction Paper Code
  • [Arxiv, 2024.09] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation Paper Code
  • [Arxiv, 2024.09] MIO: A Foundation Model on Multimodal Tokens Paper
  • [Arxiv, 2024.08] Show-o: One Single Transformer to Unify Multimodal Understanding and Generation Paper Code
  • [Arxiv, 2024.08] Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model Paper Code
  • [Arxiv, 2024.07] SEED-Story: Multimodal Long Story Generation with Large Language Model Paper Code
  • [Arxiv, 2024.05] Chameleon: Mixed-Modal Early-Fusion Foundation Models Paper Code
  • [Arxiv, 2024.04] SEED-X Multimodal Models with UnifiedMulti-granularity Comprehension and Generation Paper Code
  • [ICML, 2024] Libra: Building Decoupled Vision System on Large Language Models Paper Code
  • [CVPR, 2024] Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and ActionPaper Code
  • [CVPR, 2024] Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation Paper Code
  • [Arxiv, 2023.11] InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation Paper Code
  • [ICLR, 2024] Kosmos-G: Generating Images in Context with Multimodal Large Language Models Paper Code
  • [ICLR, 2024] LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization Paper Code
  • [ICLR, 2024] SEED-LLaMA Making LLaMA SEE and Draw with SEED Tokenizer Paper Code
  • [ICLR, 2024] EMU Generative Pretraining in Multimodality Paper Code
  • [Arxiv, 2023.09] CM3Leon: Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning Paper Code
  • [Arxiv, 2023.07] SEED Planting a SEED of Vision in Large Language Model Paper Code
  • [NeurIPS, 2023] SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs Paper
  • [ICLR, 2023] Unified-IO: A unified model for vision, language, and multi-modal tasks Paper Code
  • [ICML, 2023] Grounding Language Models to Images for Multimodal Inputs and Outputs Paper Code
  • [NeurIPS, 2022] Flamingo: a Visual Language Model for Few-Shot Learning Paper
  • [Arxiv, 2021.12] ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation Paper
  • [KDD, 2021] M6: A Chinese Multimodal Pretrainer Paper

Other Generation

  • [Arxiv, 2024.12] DriveGPT: Scaling Autoregressive Behavior Models for Driving Paper
  • [TII, 2025] VarAD: Lightweight High-Resolution Image Anomaly Detection via Visual Autoregressive Modeling Paper Code
  • [Arxiv, 2024.12] DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers Paper Page
  • [Arxiv, 2024.12] Advancing Auto-Regressive Continuation for Video Frames Paper
  • [Arxiv, 2024.12] It Takes Two: Real-time Co-Speech Two-person’s Interaction Generation via Reactive Auto-regressive Diffusion Model Paper
  • [Arxiv, 2024.12] X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models Paper Code
  • [Arxiv, 2024.12] 3D-WAG: Hierarchical Wavelet-Guided Autoregressive Generation for High-Fidelity 3D Shapes Paper Code
  • [Arxiv, 2024.11] SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE Paper Code
  • [Arxiv, 2024.11] Scalable Autoregressive Monocular Depth Estimation Paper
  • [Arxiv, 2024.11] LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models Paper Code
  • [Arxiv, 2024.10] DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control Paper
  • [Arxiv, 2024.10] Autoregressive Action Sequence Learning for Robotic Manipulation Paper Code
  • [Arxiv, 2024.09] BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation Paper Code
  • [Arxiv, 2024.07] Video In-context Learning Paper
  • [CVPR, 2024] Sequential Modeling Enables Scalable Learning for Large Vision Models Paper Code
  • [AAAI, 2024] Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation Paper Code
  • [arxiv, 2024] LM4LV: A Frozen Large Language Model for Low-level Vision Tasks Paper Code
  • [CVPR, 2024] ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe Paper Code
  • [CVPR, 2023 Highlight] Autoregressive Visual Tracking Paper Code
  • [CVPR, 2023] Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings Paper
  • [NeurIPS, 2022] Visual Prompting via Image Inpainting Paper Code
  • [EMNLP, 2022] MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning Paper
  • [NeurIPS, 2021] Multimodal Few-Shot Learning with Frozen Language Models Paper
  • [ECCV, 2020] Autoregressive Unsupervised Image Segmentation Paper

Accelerating & Stability & Analysis & Scaling

  • [Arxiv, 2024.12] Parallelized Autoregressive Visual Generation Paper Code
  • [Arxiv, 2024.12] Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching Paper Code
  • [Arxiv, 2024.12] 3D representation in 512-Byte: Variational tokenizer is the key for autoregressive 3D generation Paper Page
  • [Arxiv, 2024.12] JetFormer: An autoregressive generative model of raw images and text Paper
  • [Arxiv, 2024.11] Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient Paper Code
  • [Arxiv, 2024.11] Continuous Speculative Decoding for Autoregressive Image Generation Paper Code
  • [Arxiv, 2024.10] Diffusion Beats Autoregressive: An Evaluation of Compositional Generation in Text-to-Image Models Paper
  • [Arxiv, 2024.10] Elucidating the Design Space of Language Models for Image Generation Paper Code
  • [NeurIPS, 2024] Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective Paper Code
  • [Arxiv, 2024.10] Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding Paper
  • [Arxiv, 2024.09] Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation Paper
  • [ECCV, 2024] An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models Paper Code
  • [Arxiv, 2020] Scaling Laws for Autoregressive Generative Modeling Paper

Tutorial

Evaluation Metrics

Metric Analysis Type Reference
Inception Score (IS) ↑ Quantitative Salimans et al., 2016
Fréchet Inception Distance (FID) ↓ Quantitative Heusel et al., 2017
Kernel Inception Distance (KID) ↓ Quantitative Binkowski et al., 2018
Precision and Recall ↑ Quantitative Powers, 2020
CLIP Maximum Mean Discrepancy ↓ Quantitative Jayasumana et al., 2023
CLIP Score ↑ Quantitative Hessel et al., 2021
R-precision ↑ Quantitative Craswell et al., 2009
Perceptual Path Length ↓ Quantitative Karras et al., 2019
Fréchet Video Distance (FVD) ↓ Quantitative Unterthiner et al., 2019
Aesthetic (Expert Evaluation) ↑ Qualitative Based on domain expertise
Turing Test Qualitative Turing, 1950
User Studies (ratings, satisfaction)↑ Qualitative Various, depending on the user study methodology

Star History

Star History Chart

♥️ Contributors

📑 Citation

Please consider citing 📑 our papers if our repository is helpful to your work, thanks sincerely!

@misc{xiong2024autoregressive,
    title={Autoregressive Models in Vision: A Survey},
    author={Jing Xiong and Gongye Liu and Lun Huang and Chengyue Wu and Taiqiang Wu and Yao Mu and Yuan Yao and Hui Shen and Zhongwei Wan and Jinfa Huang and Chaofan Tao and Shen Yan and Huaxiu Yao and Lingpeng Kong and Hongxia Yang and Mi Zhang and Guillermo Sapiro and Jiebo Luo and Ping Luo and Ngai Wong},
    year={2024},
    eprint={2411.05902},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

About

The paper collections for the autoregressive models in vision.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published