ResearchHub | Open Science Community

Unveiling the Linguistic Capabilities of a Self-Supervised Speech Model Through Cross-Lingual Benchmark and Layer- Wise Similarity Analysis

Takanori Ashihara et al.Jan 1, 2024

Self-supervised learning (SSL), an unsupervised representation learning technique, has received widespread attention across various modalities. Speech, with its inherent complexity encompassing acoustic (e.g., speaker, phoneme, and paralinguistic cues) and linguistic (e.g., words, semantics, and syntax) aspects, prompts a fundamental question: how well can speech SSL models capture linguistic knowledge solely from speech data? This study comprehensively analyzes off-the-shelf SSL models utilizing three methods: probing tasks, layer contribution examinations, and layer-wise similarity analysis. For the probing task, to elucidate cross-lingual conditions, we introduce SpeechGLUE/SpeechJGLUE, the speech version of General Language Understanding Evaluation (GLUE) and its Japanese variant (JGLUE), both of which comprise diverse natural language understanding tasks. The probing system incorporates a weighted sum with trainable weights of all SSL layers' outputs into downstream models, offering insight into which layer predominantly contributes to addressing tasks. The results reveal that speech SSL models can encode linguistic information, albeit less sophisticated information than with text SSL models. Moreover, later layers are mainly utilized to tackle the benchmark tasks. To highlight their primary linguistic encoding role, we call them linguistic encoding layers (LELs). However, in cross-lingual scenarios, e.g., assessing English SSL models on SpeechJGLUE, the layer contributions equalize, suggesting challenges in determining suitable layers or relying on diverse cues. Nevertheless, some English SSL models can outperform Japanese models on SpeechJGLUE, implying their robustness against language variation. Similarity analysis reveals a block structure within LELs, particularly evident in English WavLM, where the structure becomes unclear with non-English/noise input, reaffirming the presence of LELs.

Philosophy

Artificial Intelligence

0

Paper

Philosophy

Artificial Intelligence

0

Save

0

Probing Self-Supervised Learning Models With Target Speech Extraction

Junyi Peng et al.Apr 14, 2024

Artificial Intelligence

Computer Science

0

Paper

Artificial Intelligence

Computer Science

0

Save

0

Estimating Pitch Information From Simulated Cochlear Implant Signals With Deep Neural Networks

Takanori Ashihara et al.Jan 1, 2024

Cochlear implant (CI) users, even with substantial speech comprehension, generally have poor sensitivity to pitch information (or fundamental frequency, F0). This insensitivity is often attributed to limited spectral and temporal resolution in the CI signals. However, the pitch sensitivity markedly varies among individuals, and some users exhibit fairly good sensitivity. This indicates that the CI signal contains sufficient information about F0, and users’ sensitivity is predominantly limited by other physiological conditions such as neuroplasticity or neural health. We estimated the upper limit of F0 information that a CI signal can convey by decoding F0 from simulated CI signals (multi-channel pulsatile signals) with a deep neural network model (referred to as the CI model). We varied the number of electrode channels and the pulse rate, which should respectively affect spectral and temporal resolutions of stimulus representations. The F0-estimation performance generally improved with increasing number of channels and pulse rate. For the sounds presented under quiet conditions, the model performance was at best comparable to that of a control waveform model, which received raw-waveform inputs. Under conditions in which background noise was imposed, the performance of the CI model generally degraded by a greater degree than that of the waveform model. The pulse rate had a particularly large effect on predicted performance. These observations indicate that the CI signal contains some information for predicting F0, which is particularly sufficient for targets under quiet conditions. The temporal resolution (represented as pulse rate) plays a critical role in pitch representation under noisy conditions.

Artificial Intelligence

Cognitive Neuroscience

0

Paper

Artificial Intelligence

Cognitive Neuroscience

0

Save