Joint Unsupervised and Supervised Learning for Context-aware Language Identification

Publication

2023.04.19

2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023)에서 발표 예정인 박진석, 김형용, 박지환, 김병열, 최석재, 임윤규 저자의 “Joint unsupervised and supervised learning for context-aware language identification” 논문을 소개합니다. ICASSP는 음향, 음성 및 신호 처리 분야의 top-tier 국제 학회로 음성 신호처리 분야의 연구자들이 최신 기술과 연구 결과를 공유하고 있습니다.

Conference

• The 48th IEEE International Conference on Acoustics, Speech, & Signal Processing (ICASSP) will be held in Rhodes Island, Greece, from June 4 to June 10, 2023, at the Rodos Palace Luxury Convention Center (https://2023.ieeeicassp.org/).

• The paper “Joint unsupervised and supervised learning for context-aware language identification” written by Jinseok Park, Hyung Yong Kim, Jihwan Park, Byeong-Yeol Kim, Shukjae Choi, Yunkyu Lim, has been accepted by the ICASSP 2023.

• Click the link below for details.

➠ https://arxiv.org/abs/2303.16511

Publication

• Title: Joint unsupervised and supervised learning for context-aware language identification

• Authors: Jinseok Park, Hyung Yong Kim, Jihwan Park, Byeong-Yeol Kim, Shukjae Choi, Yunkyu Lim

• Abstract: Language identification (LID) recognizes the language of a spoken utterance automatically. According to recent studies, LID models trained with an automatic speech recognition (ASR) task perform better than those trained with a LID task only. However, we need additional text labels to train the model to recognize speech, and acquiring the text labels is a cost high. In order to overcome this problem, we propose context-aware language identification using a combination of unsupervised and supervised learning without any text labels. The proposed method learns the context of speech through masked language modeling (MLM) loss and simultaneously trains to determine the language of the utterance with supervised learning loss. The proposed joint learning was found to reduce the error rate by 15.6% compared to the same structure model trained by supervised-only learning on a subset of the VoxLingua107 dataset consisting of sub-three-second utterances in 11 languages.