Abstract
Recent self-supervised automatic speech recognition (ASR) models based on transformers are showing best performance, but their footprint is too large to be trained on low-resource environments or deployed to edge devices. Knowledge distillation (KD) can be employed to reduce the model size. However, setting embedding dimension of teacher and student network to different values makes it difficult to transfer token embeddings for better performance. To mitigate this issue, we present a novel KD method in which student mimics the prediction vector of teacher under our proposed masked token similarity transfer (MTST) loss where the temporal relation between a token and the other unmasked ones is encoded into a dimension-agnostic token similarity vector. Under our transfer learning setting with a fine-tuned teacher, our proposed methods reduce the model size of student to 28.3% of teacher’s while word error rate on test-clean subset in LibriSpeech corpus is 4.93%, which surpasses prior works. Our source code will be made available.