Vision-Integrated High-Quality Neural Speech Coding

Yao Guo, Yang Ai, Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Zhen-Hua Ling
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei, P. R. China

Abstract

This paper proposes a novel vision-integrated neural speech codec (VNSC), which aims to enhance speech coding quality by leveraging visual modality information. In VNSC, the image analysis-synthesis module extracts visual features from lip images, and the feature fusion module facilitates interaction between image analysis-synthesis module and speech coding module, transmitting visual information to assist the speech coding process. Depending on whether visual information is available during the inference stage, the feature fusion module integrates visual features into the speech encoding module using either explicit integration or implicit distillation strategies. Experimental results confirm that the integration of visual information can effectively improve the decoded speech quality and enhance noise robustness of the neural speech codec, without increasing the bitrate.

Contents

Model Architecture



Comparison with Advanced Codecs

Sample 1

Original_Speech VNSC(VA) VNSC(VUA) MDCTCodec Encodec HIFI-codec SoundStream


Sample 2

Original_Speech VNSC(VA) VNSC(VUA) MDCTCodec Encodec HIFI-codec SoundStream


Sample 3

Original_Speech VNSC(VA) VNSC(VUA) MDCTCodec Encodec HIFI-codec SoundStream


Sample 4

Original_Speech VNSC(VA) VNSC(VUA) MDCTCodec Encodec HIFI-codec SoundStream


Sample 5

Original_Speech VNSC(VA) VNSC(VUA) MDCTCodec Encodec HIFI-codec SoundStream


Sample 6

Original_Speech VNSC(VA) VNSC(VUA) MDCTCodec Encodec HIFI-codec SoundStream