期刊目次

加入编委

期刊订阅

添加您的邮件地址以接收即将发行期刊数据:

Open Access Article

Advances in International Computer Science. 2025; 5: (1) ; 18-22 ; DOI: 10.12208/j.aics.20250004.

It is used to generate feature enhancement diffusion model for text-guided sound effects
用于文本引导音效生成特征增强扩散模型

作者: 苗向阳 *

湖南工商大学 湖南长沙

*通讯作者: 苗向阳,单位:湖南工商大学 湖南长沙;

发布时间: 2025-04-18 总浏览量: 68

摘要

音效在游戏、电影和虚拟现实等领域具有重要作用,它通过声音描述事件的发生,增强听众沉浸感。随着深度学习发展和大语言模型的出现,音效生成技术迎来革命性突破,特别是基于文本引导的音效生成技术,该技术通过文本描述就可以自动生成符合场景的音效。然而,现有生成模型和方法仍存在生成音频逼真度欠缺、文本音频相关度低等问题。本文针对这些问题提出了一种新型特征增强扩散模型(Feature Enhanced Diffusion Model, FEDM),(1)采用Haar小波变换进行下采样,有效保留高频特征信息;(2)设计多尺度特征提取模块,通过不同尺寸卷积核捕捉多层次特征。实验结果表明,所提方法在AudioCaps数据集上的FAD和KL指标上比基线模型提升了33.3%和18.1%。

关键词: 音效生成;文本引导;扩散模型;小波变换;多尺度提取

Abstract

Sound effects play a crucial role in games, films, and virtual reality, enhancing the immersion of listeners by describing events through sound. With the development of deep learning and the emergence of large language models, sound effect generation technology has seen revolutionary advancements, particularly text-guided sound effect generation techniques, which can automatically generate sounds that match the scene based on textual descriptions. However, existing generation models and methods still suffer from issues such as insufficient audio realism and low relevance between text and audio. This paper proposes a novel feature-enhanced diffusion model (Feature Enhanced Diffusion Model, FEDM) to address these problems: (1) it uses Haar wavelet transform for downsampling, effectively retaining high-frequency feature information; (2) it designs a multi-scale feature extraction module to capture multi-level features through different-sized convolutional kernels. Experimental results show that the proposed method improves FAD and KL metrics by 33.3% and 18.1%, respectively, over the baseline model on the AudioCaps dataset.

Key words: Sound effect generation; Text guidance; Diffusion model; Wavelet transform; Multi-scale extraction

参考文献 References

[1] 刘天羽. 物理建模合成在游戏音效制作中的应用研究——以水流声合成为例[J]. 电声技术, 2022, 46 (11): 45-48.

[2] 王珏,李洽楠.AI音频技术在电影对白和音效制作中的应用探究[J].现代电影技术,2024,(12):13-21.

[3] Yang D, Yu J, Wang H, et al. Diffsound: Discrete diffusion model for te-xt-to-sound generation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 1720-1733

[4] Liu H, Chen Z, Yuan Y, et al. Audioldm: Text-to-audio generation with latent diffusion models[J]. arXiv preprint arXiv:2301.12503, 2023.

[5] Sohl-Dickstein J, Weiss E, Maheswaranathan N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[C]//International conference on machine learning. pmlr, 2015: 2256-2265.

[6] Oord A, Li Y, Vinyals O. Representation learning with contrastive predictive coding[J]. arXiv preprint arXiv: 1807. 03748, 2018.

[7] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[J]. Advances in neural information processing systems, 2020, 33: 6840-6851.

[8] Ledig C, Theis L, Huszár F, et al. Photo-realistic single image super-resolution using a generative adversarial network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4681-4690.

[9] Kim C D, Kim B, Lee H, et al. Audiocaps: Generating captions for audios in the wild[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019: 119-132.

[10] Kreuk F, Synnaeve G, Polyak A, et al. Audiogen: Textually guided audio generation[J]. arXiv preprint arXiv: 2209. 15352, 2022.

引用本文

苗向阳, 用于文本引导音效生成特征增强扩散模型[J]. 国际计算机科学进展, 2025; 5: (1) : 18-22.