
  1. A. de Cheveigné and H. Kawahara, \"YIN, a fundamental frequency estimator for speech and music\", Journal of the Acoustical Society of America, 2002. 

  2. A. Camacho and J. G. Harris, \"A sawtooth waveform inspired pitch estimator for speech and music\", Journal of the Acoustical Society of America, 2008. 

  3. J. W. Kim, et al., \"CREPE: A convolutional representation for pitch estimation\", ICASSP, 2018. 

  4. B. Gfeller, et al., \"SPICE: Self-supervised pitch estimation\", IEEE Transactions on Audio, Speech and Language Processing, 2020. 

  5. Griffin D. and Lim J. (1984). \"Signal Estimation from Modified Short-Time Fourier Transform\". IEEE Transactions on Acoustics, Speech and Signal Processing. 32 (2): 236--243. doi:10.1109/TASSP.1984.1164317 

  6. Kawahara H. Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited[C]. 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1997, 2: 1303-1306. 

  7. Morise M, Yokomori F, Ozawa K. World: a vocoder-based high-quality speech synthesis system for real-time applications[J]. IEICE TRANSACTIONS on Information and Systems, 2016, 99(7): 1877-1884. 

  8. J. Engel, L. H. Hantrakul, C. Gu, and A. Roberts, \"DDSP: Differentiable digital signal processing,\" in Proc. ICLR, 2020. 

  9. Yang G, Yang S, Liu K, et al. Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech[C]. 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021: 492-498. 

  10. Latorre J, Bailleul C, Morrill T, et al. Combining Speakers of Multiple Languages to Improve Quality of Neural Voices[J]. arXiv preprint arXiv:2108.07737, 2021. 

最后更新: 2022-04-25

