Learning Contextualized Representation On Discrete Space Via Hierarchical Product Quantization
Authors : HYUNG YONG KIM, BYEONG-YEOL KIM, YUNKYU Lim, JIHWAN PARK, JINSEOK PARK, YOUSHIN LIM, SEUNG WOO YU, HANBIN LEE
Conference : ICASSP
Year Published : 2024
Topics : Speech Recognition

Abstract


Self-supervised learning has recently demonstrated significant success in various speech processing applications. Recent studies report that pre-training with contextualized continuous targets plays a crucial role in fine-tuning for better speech downstream tasks. However, unlike the continuous targets, it is challenging to produce contextualized targets on discrete space due to unstable training. To address this issue, we introduce a new hierarchical product quantizer that enables the full utilization of multi-layer features by reducing the possible case of quantized targets and preventing mode collapse through diversity loss for all codebooks. Our ablation study confirms the effectiveness of the proposed quantizer and contextualized discrete targets. For supervised ASR, the proposed model outperforms wav2vec2 and showed comparable results with data2vec. In addition, for unsupervised ASR, the proposed method surpasses two baselines.