Abstract
Self-supervised learning has recently demonstrated significant success in various speech processing applications. Recent studies report that pre-training with contextualized continuous targets plays a crucial role in fine-tuning for better speech downstream tasks. However, unlike the continuous targets, it is challenging to produce contextualized targets on discrete space due to unstable training. To address this issue, we introduce a new hierarchical product quantizer that enables the full utilization of multi-layer features by reducing the possible case of quantized targets and preventing mode collapse through diversity loss for all codebooks. Our ablation study confirms the effectiveness of the proposed quantizer and contextualized discrete targets. For supervised ASR, the proposed model outperforms wav2vec2 and showed comparable results with data2vec. In addition, for unsupervised ASR, the proposed method surpasses two baselines.