reference

A. de Cheveigné and H. Kawahara, \"YIN, a fundamental frequency estimator for speech and music\", Journal of the Acoustical Society of America, 2002. ↩
A. Camacho and J. G. Harris, \"A sawtooth waveform inspired pitch estimator for speech and music\", Journal of the Acoustical Society of America, 2008. ↩
J. W. Kim, et al., \"CREPE: A convolutional representation for pitch estimation\", ICASSP, 2018. ↩
B. Gfeller, et al., \"SPICE: Self-supervised pitch estimation\", IEEE Transactions on Audio, Speech and Language Processing, 2020. ↩
Griffin D. and Lim J. (1984). \"Signal Estimation from Modified Short-Time Fourier Transform\". IEEE Transactions on Acoustics, Speech and Signal Processing. 32 (2): 236--243. doi:10.1109/TASSP.1984.1164317 ↩
Kawahara H. Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited[C]. 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1997, 2: 1303-1306. ↩
Morise M, Yokomori F, Ozawa K. World: a vocoder-based high-quality speech synthesis system for real-time applications[J]. IEICE TRANSACTIONS on Information and Systems, 2016, 99(7): 1877-1884. ↩
J. Engel, L. H. Hantrakul, C. Gu, and A. Roberts, \"DDSP: Differentiable digital signal processing,\" in Proc. ICLR, 2020. ↩
Yang G, Yang S, Liu K, et al. Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech[C]. 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021: 492-498. ↩
Latorre J, Bailleul C, Morrill T, et al. Combining Speakers of Multiple Languages to Improve Quality of Neural Voices[J]. arXiv preprint arXiv:2108.07737, 2021. ↩

最后更新: 2022-04-25

reference

评论