Studi Pengaruh Tingkat Interferensi Terhadap Performa Transkripsi Model Wav2Vec2-Large-XLSR-Indonesian
Abstract
In general, meetings require a Minutes of Meeting (MoM) to record the main discussion points. Manually creating MoM can be time-consuming and labor-intensive, but it can be assisted by Automatic Speech Recognition (ASR) technology to convert recorded conversations into text or speech-to-text (STT). However, the use of this technology often presents confidentiality issues due to reliance on third-party services. Another challenge arises from the presence of interfering voices that are inevitably mixed with the main conversation. Therefore, this study investigates the extent to which interference affects the performance of ASR and MoM generation conducted locally without using third-party services. The ASR model used in this study is Wav2Vec2 XLSR Indonesian, which was fine-tuned using the Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS) dataset. Interfering sounds were generated in several scenarios—ideal, whisper, equal RMS, and overpower—to be added into the system. Model performance was then evaluated using the Word Error Rate (WER) metric. Simulation results show that the higher the audio level of interference, the lower the transcription model's performance. However, summarization results for MoM generation using a Large Language Model (LLM) show that the whisper scenario with audio levels up to $-40$ dBFS yields performance comparable to the ideal condition (no interference), as indicated by the BERTScore metric. This demonstrates that LLMs are capable of improving less accurate STT transcription outputs.