Abstract
(English Only) Pre-training of self-supervised learning (SSL) generally shows a good performance on various speech processing tasks. However, this pre-training scheme may lead to a sub-optimal solution for fine-tuning a specific task, such as automatic speech recognition (ASR). In order to provide a more optimal pre-trained model for ASR, we introduce an ASR-Specific hidden-unit BERT with self-training, namely ASBERT. Motivated by self-training, we extract linguistic-related pseudo labels from the fine-tuned model, and these labels are used in the next pre-training procedure. Experimental results on LibriSpeech test-clean and test-other datasets show that ASBERT without language model (LM) outperforms the conventional SSL and self-training model, achieving a 6.3/2.0% and 15.4/13.2% relatively word error rate reduction (RERR). Moreover, without using pseudo-transcription, ASBERT yields comparable performance to the conventional self-training method.