| 35 | 0 | 137 |
| 下载次数 | 被引频次 | 阅读次数 |
文章针对藏语语音合成在信息化应用中面临的韵律僵硬、语料稀缺及高采样率声码器适配难等痛点,提出一种融合多层级韵律预测与高保真波形重构的技术演进分析框架。在系统分析从级联模型向端到端(VITS)范式转移,并对比Griffin-Lim与HiFi-GAN声码器在24 kHz藏语数据下性能差异的基础上,验证了引入强调层韵律建模与神经声码器技术对于消除合成语音机械噪声、提升自然度及适配智能终端实时交互场景的有效性与必要性。
Abstract:Aiming at the pain points such as rigid prosody, scarcity of corpus, and difficulty in adapting high sampling rate vocoders faced by Tibetan speech synthesis in informatization applications, this paper proposes a technology evolution analysis framework integrating multi-level prosody prediction and high-fidelity waveform reconstruction. Based on a systematic analysis of the paradigm shift from cascade models to end-to-end(VITS) models, and a comparison of the performance differences between Griffin-Lim and HiFi-GAN vocoders under 24 kHz Tibetan data, the effectiveness and necessity of introducing emphasis layer prosody modeling and neural vocoder technology for eliminating mechanical noise in synthesized speech, improving naturalness, and adapting to real-time interaction scenarios of intelligent terminals are verified.
[1] Taylor P.Text-to-speech synthesis [M].Cambridge university press,2009.
[2] 张金溪.基于HMM的藏语拉萨话语音合成研究[D].西北民族大学,2014.
[3] 才让卓玛,李永明,才智杰.藏语语音合成单元选择[J].软件学报,2015,26(Supp):1-10.
[4] Shen J,Pang R,Weiss R J,et al.Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions [C]// ICASSP.IEEE,2018:4779-4783.
[5] 都格草,才让卓玛,南措吉,等.基于神经网络的藏语语音合成[J].中文信息学报,2019,33(02):75-80.
[6] Ren Y,Hu C,Tan X,et al.FastSpeech 2:Fast and High-Quality End-to-End Text to Speech [C]// ICLR,2021.
[7] Kong J,Kim J,Bae J.HiFi-GAN:Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis [J].NeurIPS,2020.
[8] Kim J,Kong J,Son J.Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech [J].ICML,2021.
[9] Honnet P E J C.Intonation Modelling for Speech Synthesis and Emphasis Preservation [D].Ecole Polytechnique Fédérale de Lausanne,2017.
[10] Khysru K,Tang W,Wei J.Research on speech synthesis technology based on Tibetan rhythmic features [J].Expert Systems with Applications,2025,277:127181.
[11] 丁云涛.基于Seq2Seq&WaveNet的安多藏语语音合成技术研究[D].青海师范大学,2022.
[12] 罗敏.藏语安多方言端到端语音合成方法研究[D].西北师范大学,2023.
[13] 谢香腾.藏语卫藏方言的语音合成技术研究[D].西藏大学,2021.[14] 宋志浩.基于完全端到端方法的藏语拉萨话语音合成技术研究[D].西北民族大学,2023.
[14] 拉巴顿珠,珠杰,欧珠,等.端到端的藏语语音合成方法[J].应用声学,2023,42(02):324-332.
[15] 拉巴顿珠,官政先,德庆卓玛,等.完全端到端的藏语语音合成方法[J].中文信息学报,2024,38(09):82-92+116.
[16] 王嘉文,高定国,尼琼,等.基于藏字构件的低资源多方言藏语语音合成方法研究[J].计算机工程与科学,2025,47(08):1503-1510.
[17] 普哇拉毛.基于深度学习的端到端藏语语音合成技术研究[D].西藏大学,2023.
[18] 陈阔.基于深度学习的语音合成研究[D].北京邮电大学,2022.
[19] 刘雨.基于深度学习的语音合成系统的研究与实现[D].北京交通大学,2022.
[20] 邓彦.基于深度学习的语音合成技术研究[D].广西大学,2024.
[21] 王超.基于深度学习的端到端藏语语音识别研究[D].西藏大学,2023.
[22] 李珊珊,边巴旺堆.基于双向长短时记忆网络的藏语语音情感识别[J].信息技术与信息化,2024(10):12-15.
[23] 安晓春.手语到普通话/藏语语音转换的研究[D].西北师范大学,2017.
[24] 王福钊,周雁.藏语语音识别研究进展和展望[J].计算机系统应用,2020,29(03):29-38.
基本信息:
中图分类号:TN912.33
引用信息:
[1]德却措,更太加.面向信息化应用的藏语智能语音合成关键技术研究综述[J].信息化研究,2026,52(01):14-19.
2026-02-20
2026-02-20