List of papers not just about speech synthesis 😀.
- TTS Frontend
- Acoustic Model
- Vocoder
- TTS towards Stylization
- Voice Conversion
- Singing
- Speech Processing Related
- Natural Language Processing
- VAE & GAN
- Others
- Pre-trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis (Interspeech 2019)
- A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis (ICASSP 2020)
- A hybrid text normalization system using multi-head self-attention for mandarin (ICASSP 2020)
- Unified Mandarin TTS Front-end Based on Distilled BERT Model (2021-01)
- Tacotron V1: Tacotron: Towards End-to-End Speech Synthesis (Interspeech 2017)
- Tacotron V2: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions (ICASSP 2018)
- Deep Voice V1: Deep Voice: Real-time Neural Text-to-Speech (ICML 2017)
- Deep Voice V2: Deep Voice 2: Multi-Speaker Neural Text-to-Speech (NeurIPS 2017)
- Deep Voice V3: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning (ICLR 2018)
- Transformer-TTS: Neural Speech Synthesis with Transformer Network (AAAI 2019)
- DurIAN: DurIAN: Duration Informed Attention Network For Multimodal Synthesis (2019)
- Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis (ICASSP 2020)
- Flowtron (flow based): Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis (2020)
- Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling (under review ICLR 2021)
- RobuTrans (towards robust): RobuTrans: A Robust Transformer-Based Text-to-Speech Model (AAAI 2020)
- DeviceTTS: DeviceTTS: A Small-Footprint, Fast, Stable Network for On-Device Text-to-Speech (2020-10)
- ParaNet: Non-Autoregressive Neural Text-to-Speech (ICML 2020)
- FastSpeech: FastSpeech: Fast, Robust and Controllable Text to Speech (NeurIPS 2019)
- JDI-T: JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment (2020)
- EATS: End-to-End Adversarial Text-to-Speech (2020)
- FastSpeech 2: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech (2020)
- FastPitch: FastPitch: Parallel Text-to-speech with Pitch Prediction (2020)
- Glow-TTS (flow based, Monotonic Attention): Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search (NeurIPS 2020)
- Flow-TTS (flow based): Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow (ICASSP 2020)
- SpeedySpeech: SpeedySpeech: Efficient Neural Speech Synthesis (Interspeech 2020)
- Parallel Tacotron: Parallel Tacotron: Non-Autoregressive and Controllable TTS (2020)
- Wave-Tacotron: Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis (2020-11)
- Monotonic Attention: Online and Linear-Time Attention by Enforcing Monotonic Alignments (ICML 2017)
- Monotonic Chunkwise Attention: Monotonic Chunkwise Attention (ICLR 2018)
- Forward Attention in Sequence-to-sequence Acoustic Modelling for Speech Synthesis (ICASSP 2018)
- RNN-T for TTS: Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments (2019)
- Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis (ICASSP 2020)
- Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling (under review ICLR 2021)
- EfficientTTS: EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture (2020-12)
- Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis (2018)
- Almost Unsupervised Text to Speech and Automatic Speech Recognition (ICML 2019)
- Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages (Interspeech 2020)
- Multilingual Speech Synthesis: One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech (InterSpeech 2020)
- Low-resource expressive text-to-speech using data augmentation (2020-11)
- WaveNet: WaveNet: A Generative Model for Raw Audio (2016)
- WaveRNN: Efficient Neural Audio Synthesis (ICML 2018)
- WaveGAN: Adversarial Audio Synthesis (ICLR 2019)
- LPCNet: LPCNet: Improving Neural Speech Synthesis Through Linear Prediction (ICASSP 2019)
- Towards achieving robust universal neural vocoding (Interspeech 2019)
- GAN-TTS: High Fidelity Speech Synthesis with Adversarial Networks (2019)
- MultiBand-WaveRNN: DurIAN: Duration Informed Attention Network For Multimodal Synthesis (2019)
- Parallel-WaveNet: Parallel WaveNet: Fast High-Fidelity Speech Synthesis (2017)
- WaveGlow: WaveGlow: A Flow-based Generative Network for Speech Synthesis (2018)
- Parallel-WaveGAN: Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram (2019)
- MelGAN: MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis (NeurIPS 2019)
- MultiBand-MelGAN: Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech (2020)
- VocGAN: VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network (Interspeech 2020)
- WaveGrad: WaveGrad: Estimating Gradients for Waveform Generation (2020)
- DiffWave: DiffWave: A Versatile Diffusion Model for Audio Synthesis (2020)
- HiFi-GAN: HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis (NeurIPS 2020)
- Parallel-WaveGAN (New): Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators (2020-10)
- Improved parallel WaveGAN vocoder with perceptually weighted spectrogram loss (SLT 2021)
- Universal Vocoder Based on Parallel WaveNet: Universal Neural Vocoding with Parallel WaveNet (ICASSP 2021)
- LightSpeech: LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search (ICASSP 2021)
- ReferenceEncoder-Tacotron: Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron (ICML 2018)
- GST-Tacotron: Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis (ICML 2018)
- Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis (2018)
- GMVAE-Tacotron2: Hierarchical Generative Modeling for Controllable Speech Synthesis (ICLR 2019)
- BERT-TTS: Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models (2019)
- (Multi-style Decouple): Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency (2019)
- (Multi-style Decouple): Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis (InterSpeech 2019)
- Mellotron: Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens (2019)
- Flowtron (flow based): Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis (2020)
- (local style): Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis (ICASSP 2020)
- Controllable Neural Prosody Synthesis (Interspeech 2020)
- GraphSpeech: GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis (2020-10)
- BERT-TTS: Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis (2020-11)
- (Global Emotion Style Control): Controllable Emotion Transfer For End-to-End Speech Synthesis (2020-11)
- (Phone Level Style Control): Fine-grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis (2020-11)
- (Phone Level Prosody Modelling): Mixture Density Network for Phone-Level Prosody Modelling in Speech Synthesis (ICASSP 2021)
- Meta-Learning for TTS: Sample Efficient Adaptive Text-to-Speech (ICLR 2019)
- SV-Tacotron: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (NeurIPS 2018)
- Deep Voice V3: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning (ICLR 2018)
- Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings (ICASSP 2020)
- MultiSpeech: MultiSpeech: Multi-Speaker Text to Speech with Transformer (2020)
- SC-WaveRNN: Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions (Interspeech 2020)
- MultiSpeaker Dataset: AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines (2020)
- (introduce PPG into voice conversion): Phonetic posteriorgrams for many-to-one voice conversion without parallel data training (2016)
- A Vocoder-free WaveNet Voice Conversion with Non-Parallel Data (2019)
- TTS-Skins: TTS Skins: Speaker Conversion via ASR (2019)
- One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization (InterSpeech 2019)
- Cotatron (combine text information with voice conversion system): Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data (Interspeech 2020)
- (TTS & ASR): Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer (InterSpeech 2020)
- FragmentVC (wav to vec): FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention (2020)
- Towards Natural and Controllable Cross-Lingual Voice Conversion Based on Neural TTS Model and Phonetic Posteriorgram (ICASSP 2021)
- VAE-VC (VAE based): Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder (2016)
- (Speech representation learning by VQ-VAE): Unsupervised speech representation learning using WaveNet autoencoders (2019)
- Blow (Flow based): Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion (NeurIPS 2019)
- AutoVC: AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss (2019)
- F0-AutoVC: F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder (ICASSP 2020)
- One-Shot Voice Conversion by Vector Quantization (ICASSP 2020)
- SpeechFlow (auto-encoder): Unsupervised Speech Decomposition via Triple Information Bottleneck (ICML 2020)
- CycleGAN-VC V1: Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks (2017)
- StarGAN-VC: StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks (2018)
- CycleGAN-VC V2: CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion (2019)
- CycleGAN-VC V3: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion (2020)
- XiaoIce Band: XiaoIce Band: A Melody and Arrangement Generation Framework for Pop Music (KDD 2018)
- Mellotron: Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens (2019)
- ByteSing: ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders (2020)
- JukeBox: Jukebox: A Generative Model for Music (2020)
- XiaoIce Sing: XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System (2020)
- HiFiSinger: HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis (2019)
- Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss (2020)
- Learn2Sing: Learn2Sing: Target Speaker Singing Voice Synthesis by learning from a Singing Teacher (2020-11)
- A Universal Music Translation Network (2018)
- Unsupervised Singing Voice Conversion (Interspeech 2019)
- PitchNet: PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network (ICASSP 2020)
- DurIAN-SC: DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System (Interspeech 2020)
- Speech-to-Singing Conversion based on Boundary Equilibrium GAN (Interspeech 2020)
- PPG-based singing voice conversion with adversarial representation learning (2020)
- Audio-Word2Vec: Audio Word2Vec: Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Autoencoder (2016)
- SpeechBERT: SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering (2019)
- Improving Transformer-based Speech Recognition Using Unsupervised Pre-training (2019)
- TasNet: TasNet: time-domain audio separation network for real-time, single-channel speech separation (ICASSP 2018)
- Conv-TasNet: Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
- DeepSpeaker: Deep Speaker: an End-to-End Neural Speaker Embedding System (2017)
- GE2E Loss: Generalized End-to-End Loss for Speaker Verification (ICASSP 2018)
- LSTM: Long Short-term Memory (1997)
- GRU: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (EMNLP 2014)
- TCN: An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling (2018)
- Transformer: Attention Is All You Need (NIPS 2017)
- Transformer-XL: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (ACL 2019)
- Reformer: Reformer: The Efficient Transformer (ICLR 2020)
- Awesome Repositories: transformers
- BERT: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (NAACL 2019)
- XLNET: XLNet: Generalized Autoregressive Pretraining for Language Understanding (NeurIPS 2019)
- ALBERT: ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (ICLR 2020)
- A Study of Non-autoregressive Model for Sequence Generation (ACL 2020)
- Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement (EMNLP 2018)
- Non-Autoregressive Neural Machine Translation (ICLR 2018)
- Non-Autoregressive Machine Translation with Auxiliary Regularization (AAAI 2019)
- Mask-Predict: Parallel Decoding of Conditional Masked Language Models (EMNLP 2019)
- Awesome Paper List: awesome-speech-translation
- Direct speech-to-speech translation with a sequence-to-sequence model (InterSpeech 2020)
- NeurST: NeurST: Neural Speech Translation Toolkit (2020-12)
- Review 2019: Neural Machine Reading Comprehension: Methods and Trends (2019)
- Review 2020: A Survey on Machine Reading Comprehension: Tasks, Evaluation Metrics, and Benchmark Datasets (2019)
- NMRC first: Teaching Machines to Read and Comprehend (NIPS 2015)
- RACE dataset: RACE: Large-scale ReAding Comprehension Dataset From Examinations (EMNLP 2017)
- Cloze test: Large-scale Cloze Test Dataset Created by Teachers (EMNLP 2018)
- HuggingFace: HuggingFace's Transformers: State-of-the-art Natural Language Processing (2019)
- VAE: Auto-Encoding Variational Bayes (ICLR 2014)
- GM-VAE: Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders (ICLR 2017)
- VQ-VAE: Neural Discrete Representation Learning (NIPS 2017)
- VQ-VAE 2: Generating Diverse High-Fidelity Images with VQ-VAE-2 (NeurIPS 2019)
- GAN: Generative Adversarial Networks (NIPS 2014)
- Condition-GAN: Conditional Generative Adversarial Nets (2014)
- Info-GAN: InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets (2016)
- SeqGAN: SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient (AAAI 2017)
- Cycle-GAN: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (ICCV 2017)
- Star-GAN: StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation (CVPR 2018)
- BigGAN: Large Scale GAN Training for High Fidelity Natural Image Synthesis (ICLR 2019)
- Style-GAN: A Style-Based Generator Architecture for Generative Adversarial Networks (CVPR 2019)
- (Forgetting learning): An Empirical Study of Example Forgetting during Deep Neural Network Learning (ICLR 2019)
- ScaNN (search accelerating): Accelerating Large-Scale Inference with Anisotropic Vector Quantization (ICML 2020)
- (memory management): Efficient Memory Management for Deep Neural Net Inference (2020)