Paper List

List of papers not just about speech synthesis 😀.

Content

TTS Frontend
Acoustic Model
Vocoder
- Autoregressive Model
- Non-Autoregressive Model
TTS towards Stylization
- Expressive TTS
- MultiSpeaker TTS
Voice Conversion
Singing
- Singing Synthesis
- Singing Voice Conversion
Speech Processing Related
Natural Language Processing
VAE & GAN
- VAE
- GAN
Others

TTS Frontend

Pre-trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis (Interspeech 2019)
A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis (ICASSP 2020)
A hybrid text normalization system using multi-head self-attention for mandarin (ICASSP 2020)
Unified Mandarin TTS Front-end Based on Distilled BERT Model (2021-01)

Acoustic Model

Autoregressive Model

Tacotron V1: Tacotron: Towards End-to-End Speech Synthesis (Interspeech 2017)
Tacotron V2: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions (ICASSP 2018)
Deep Voice V1: Deep Voice: Real-time Neural Text-to-Speech (ICML 2017)
Deep Voice V2: Deep Voice 2: Multi-Speaker Neural Text-to-Speech (NeurIPS 2017)
Deep Voice V3: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning (ICLR 2018)
Transformer-TTS: Neural Speech Synthesis with Transformer Network (AAAI 2019)
DurIAN: DurIAN: Duration Informed Attention Network For Multimodal Synthesis (2019)
Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis (ICASSP 2020)
Flowtron (flow based): Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis (2020)
Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling (under review ICLR 2021)
RobuTrans (towards robust): RobuTrans: A Robust Transformer-Based Text-to-Speech Model (AAAI 2020)
DeviceTTS: DeviceTTS: A Small-Footprint, Fast, Stable Network for On-Device Text-to-Speech (2020-10)

Non-Autoregressive Model

ParaNet: Non-Autoregressive Neural Text-to-Speech (ICML 2020)
FastSpeech: FastSpeech: Fast, Robust and Controllable Text to Speech (NeurIPS 2019)
JDI-T: JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment (2020)
EATS: End-to-End Adversarial Text-to-Speech (2020)
FastSpeech 2: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech (2020)
FastPitch: FastPitch: Parallel Text-to-speech with Pitch Prediction (2020)
Glow-TTS (flow based, Monotonic Attention): Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search (NeurIPS 2020)
Flow-TTS (flow based): Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow (ICASSP 2020)
SpeedySpeech: SpeedySpeech: Efficient Neural Speech Synthesis (Interspeech 2020)
Parallel Tacotron: Parallel Tacotron: Non-Autoregressive and Controllable TTS (2020)
Wave-Tacotron: Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis (2020-11)

Alignment Study

Monotonic Attention: Online and Linear-Time Attention by Enforcing Monotonic Alignments (ICML 2017)
Monotonic Chunkwise Attention: Monotonic Chunkwise Attention (ICLR 2018)
Forward Attention in Sequence-to-sequence Acoustic Modelling for Speech Synthesis (ICASSP 2018)
RNN-T for TTS: Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments (2019)
Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis (ICASSP 2020)
Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling (under review ICLR 2021)
EfficientTTS: EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture (2020-12)

Data Efficiency

Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis (2018)
Almost Unsupervised Text to Speech and Automatic Speech Recognition (ICML 2019)
Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages (Interspeech 2020)
Multilingual Speech Synthesis: One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech (InterSpeech 2020)
Low-resource expressive text-to-speech using data augmentation (2020-11)

Vocoder

Autoregressive Model

WaveNet: WaveNet: A Generative Model for Raw Audio (2016)
WaveRNN: Efficient Neural Audio Synthesis (ICML 2018)
WaveGAN: Adversarial Audio Synthesis (ICLR 2019)
LPCNet: LPCNet: Improving Neural Speech Synthesis Through Linear Prediction (ICASSP 2019)
Towards achieving robust universal neural vocoding (Interspeech 2019)
GAN-TTS: High Fidelity Speech Synthesis with Adversarial Networks (2019)
MultiBand-WaveRNN: DurIAN: Duration Informed Attention Network For Multimodal Synthesis (2019)

Non-Autoregressive Model

Parallel-WaveNet: Parallel WaveNet: Fast High-Fidelity Speech Synthesis (2017)
WaveGlow: WaveGlow: A Flow-based Generative Network for Speech Synthesis (2018)
Parallel-WaveGAN: Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram (2019)
MelGAN: MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis (NeurIPS 2019)
MultiBand-MelGAN: Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech (2020)
VocGAN: VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network (Interspeech 2020)
WaveGrad: WaveGrad: Estimating Gradients for Waveform Generation (2020)
DiffWave: DiffWave: A Versatile Diffusion Model for Audio Synthesis (2020)
HiFi-GAN: HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis (NeurIPS 2020)
Parallel-WaveGAN (New): Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators (2020-10)
Improved parallel WaveGAN vocoder with perceptually weighted spectrogram loss (SLT 2021)
Universal Vocoder Based on Parallel WaveNet: Universal Neural Vocoding with Parallel WaveNet (ICASSP 2021)
LightSpeech: LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search (ICASSP 2021)

TTS towards Stylization

Expressive TTS

ReferenceEncoder-Tacotron: Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron (ICML 2018)
GST-Tacotron: Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis (ICML 2018)
Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis (2018)
GMVAE-Tacotron2: Hierarchical Generative Modeling for Controllable Speech Synthesis (ICLR 2019)
BERT-TTS: Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models (2019)
(Multi-style Decouple): Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency (2019)
(Multi-style Decouple): Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis (InterSpeech 2019)
Mellotron: Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens (2019)
Flowtron (flow based): Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis (2020)
(local style): Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis (ICASSP 2020)
Controllable Neural Prosody Synthesis (Interspeech 2020)
GraphSpeech: GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis (2020-10)
BERT-TTS: Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis (2020-11)
(Global Emotion Style Control): Controllable Emotion Transfer For End-to-End Speech Synthesis (2020-11)
(Phone Level Style Control): Fine-grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis (2020-11)
(Phone Level Prosody Modelling): Mixture Density Network for Phone-Level Prosody Modelling in Speech Synthesis (ICASSP 2021)

MultiSpeaker TTS

Meta-Learning for TTS: Sample Efficient Adaptive Text-to-Speech (ICLR 2019)
SV-Tacotron: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (NeurIPS 2018)
Deep Voice V3: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning (ICLR 2018)
Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings (ICASSP 2020)
MultiSpeech: MultiSpeech: Multi-Speaker Text to Speech with Transformer (2020)
SC-WaveRNN: Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions (Interspeech 2020)
MultiSpeaker Dataset: AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines (2020)

Voice Conversion

ASR & TTS Based

(introduce PPG into voice conversion): Phonetic posteriorgrams for many-to-one voice conversion without parallel data training (2016)
A Vocoder-free WaveNet Voice Conversion with Non-Parallel Data (2019)
TTS-Skins: TTS Skins: Speaker Conversion via ASR (2019)
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization (InterSpeech 2019)
Cotatron (combine text information with voice conversion system): Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data (Interspeech 2020)
(TTS & ASR): Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer (InterSpeech 2020)
FragmentVC (wav to vec): FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention (2020)
Towards Natural and Controllable Cross-Lingual Voice Conversion Based on Neural TTS Model and Phonetic Posteriorgram (ICASSP 2021)

VAE & Auto-Encoder Based

VAE-VC (VAE based): Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder (2016)
(Speech representation learning by VQ-VAE): Unsupervised speech representation learning using WaveNet autoencoders (2019)
Blow (Flow based): Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion (NeurIPS 2019)
AutoVC: AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss (2019)
F0-AutoVC: F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder (ICASSP 2020)
One-Shot Voice Conversion by Vector Quantization (ICASSP 2020)
SpeechFlow (auto-encoder): Unsupervised Speech Decomposition via Triple Information Bottleneck (ICML 2020)

GAN Based

CycleGAN-VC V1: Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks (2017)
StarGAN-VC: StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks (2018)
CycleGAN-VC V2: CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion (2019)
CycleGAN-VC V3: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion (2020)

Singing

Singing Synthesis

XiaoIce Band: XiaoIce Band: A Melody and Arrangement Generation Framework for Pop Music (KDD 2018)
Mellotron: Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens (2019)
ByteSing: ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders (2020)
JukeBox: Jukebox: A Generative Model for Music (2020)
XiaoIce Sing: XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System (2020)
HiFiSinger: HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis (2019)
Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss (2020)
Learn2Sing: Learn2Sing: Target Speaker Singing Voice Synthesis by learning from a Singing Teacher (2020-11)

Singing Voice Conversion

A Universal Music Translation Network (2018)
Unsupervised Singing Voice Conversion (Interspeech 2019)
PitchNet: PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network (ICASSP 2020)
DurIAN-SC: DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System (Interspeech 2020)
Speech-to-Singing Conversion based on Boundary Equilibrium GAN (Interspeech 2020)
PPG-based singing voice conversion with adversarial representation learning (2020)

Speech Processing Related

Speech Pretrained Model

Audio-Word2Vec: Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder (2016)
SpeechBERT: SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering (2019)
Improving Transformer-based Speech Recognition Using Unsupervised Pre-training (2019)

Speech Separation

TasNet: TasNet: time-domain audio separation network for real-time, single-channel speech separation (ICASSP 2018)
Conv-TasNet: Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

Speaker Verification

DeepSpeaker: Deep Speaker: an End-to-End Neural Speaker Embedding System (2017)
GE2E Loss: Generalized End-to-End Loss for Speaker Verification (ICASSP 2018)

Speech Representation Learning

Unsupervised speech representation learning using WaveNet autoencoders (2019)

Natural Language Processing

Sequence Modeling

LSTM: Long Short-term Memory (1997)
GRU: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (EMNLP 2014)
TCN: An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling (2018)
Transformer: Attention Is All You Need (NIPS 2017)
Transformer-XL: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (ACL 2019)
Reformer: Reformer: The Efficient Transformer (ICLR 2020)

Pretrained Model

Awesome Repositories: transformers
BERT: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (NAACL 2019)
XLNET: XLNet: Generalized Autoregressive Pretraining for Language Understanding (NeurIPS 2019)
ALBERT: ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (ICLR 2020)

Non-autoregressive Translation Model

A Study of Non-autoregressive Model for Sequence Generation (ACL 2020)
Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement (EMNLP 2018)
Non-Autoregressive Neural Machine Translation (ICLR 2018)
Non-Autoregressive Machine Translation with Auxiliary Regularization (AAAI 2019)
Mask-Predict: Parallel Decoding of Conditional Masked Language Models (EMNLP 2019)

Speech2Speech Translation Model

Awesome Paper List: awesome-speech-translation
Direct speech-to-speech translation with a sequence-to-sequence model (InterSpeech 2020)
NeurST: NeurST: Neural Speech Translation Toolkit (2020-12)

Neural Machine Reading Comprehension

Review 2019: Neural Machine Reading Comprehension: Methods and Trends (2019)
Review 2020: A Survey on Machine Reading Comprehension: Tasks, Evaluation Metrics, and Benchmark Datasets (2019)
NMRC first: Teaching Machines to Read and Comprehend (NIPS 2015)
RACE dataset: RACE: Large-scale ReAding Comprehension Dataset From Examinations (EMNLP 2017)
Cloze test: Large-scale Cloze Test Dataset Created by Teachers (EMNLP 2018)
HuggingFace: HuggingFace's Transformers: State-of-the-art Natural Language Processing (2019)

VAE & GAN

VAE

VAE: Auto-Encoding Variational Bayes (ICLR 2014)
GM-VAE: Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders (ICLR 2017)
VQ-VAE: Neural Discrete Representation Learning (NIPS 2017)
VQ-VAE 2: Generating Diverse High-Fidelity Images with VQ-VAE-2 (NeurIPS 2019)

GAN

GAN: Generative Adversarial Networks (NIPS 2014)
Condition-GAN: Conditional Generative Adversarial Nets (2014)
Info-GAN: InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets (2016)
SeqGAN: SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient (AAAI 2017)
Cycle-GAN: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (ICCV 2017)
Star-GAN: StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation (CVPR 2018)
BigGAN: Large Scale GAN Training for High Fidelity Natural Image Synthesis (ICLR 2019)
Style-GAN: A Style-Based Generator Architecture for Generative Adversarial Networks (CVPR 2019)

Others

(Forgetting learning): An Empirical Study of Example Forgetting during Deep Neural Network Learning (ICLR 2019)
ScaNN (search accelerating): Accelerating Large-Scale Inference with Anisotropic Vector Quantization (ICML 2020)
(memory management): Efficient Memory Management for Deep Neural Net Inference (2020)

Files

README.md

Latest commit

History

README.md

File metadata and controls

Paper List

Content

TTS Frontend

Acoustic Model

Autoregressive Model

Non-Autoregressive Model

Alignment Study

Data Efficiency

Vocoder

Autoregressive Model

Non-Autoregressive Model

TTS towards Stylization

Expressive TTS

MultiSpeaker TTS

Voice Conversion

ASR & TTS Based

VAE & Auto-Encoder Based

GAN Based

Singing

Singing Synthesis

Singing Voice Conversion

Speech Processing Related

Speech Pretrained Model

Speech Separation

Speaker Verification

Speech Representation Learning

Natural Language Processing

Sequence Modeling

Pretrained Model

Non-autoregressive Translation Model

Speech2Speech Translation Model

Neural Machine Reading Comprehension

VAE & GAN

VAE

GAN

Others