Automatic assessment of dysarthric severity level using audio-video cross-modal approach in deep learning

Thumbnail Image
Other Title
Tong, Han
Author ORCID Profiles (clickable)
Master of Computing
Unitec Institute of Technology
Sharifzadeh, Hamid
McLoughlin, Ian
Masters Thesis
Ngā Upoko Tukutuku (Māori subject headings)
motor speech disorders
dysarthric patients
audio data processing systems
video data processing systems
deep-learning algorithms
Tong, H. (2020). Automatic assessment of dysarthric severity level using audio-video cross-modal approach in deep learning. (Unpublished document submitted in partial fulfilment of the requirements for the degree of Master of Computing). Unitec Institute of Technology, Auckland, New Zealand. Retrieved from
Dysarthria is a speech disorder disease that can have a significant impact on a person's daily life. Early detection of the disease can put the patient into therapy sessions more quickly. Researchers have established various approaches to detect the disease automatically. Traditional computational approaches commonly analysed acoustic features like Mel-Frequency Cepstral Coefficients (MFCC), Spectral Centroid, Linear Prediction Cepstral (LPC) coefficients and Perceptual Linear Prediction (PLP) from speech samples of patients to detect dysarthric speech characters like slow speech rate, short pauses, mis-articulated sounds, etc. Recent research has shown that some machine learning algorithms can also be deployed to extract speech features and detect the severity level automatically. In machine learning, feature extraction is a crucial step in dealing with classification and prediction problems. For different data formats, different well-established frameworks have been developed to extract and classify the corresponding features. For example, for an image data processing system, Convolution Neural Network (CNN) can provide the underlying network structure for the system to analyse the video data to obtain the visual features. In contrast, for audio data processing system, Natural Language Processing (NLP) algorithms can be applied to obtain acoustic features. Therefore, the selection of the framework to be used mainly depends on the modality of the input. As early steps in development of machine learning approaches for automatic assessment of dysarthric patients, classification systems based on audio features have been considered in literature; however, recent research efforts in other fields have shown that using an audio-video cross-modal framework can improve performance of the classification systems. In this thesis, for the first time, an audio-video cross-modal framework is proposed using deep-learning algorithm that the network takes both audio and video data as input to detect severity levels of dysarthria. Within the deep-learning framework, we also propose two network architectures using audio-only or video-only input to detect dysarthria severity levels automatically. Comparing with current one-modality systems, the deep-learning framework yields satisfying results. More importantly, comparing with systems based only on audio data for automatic dysarthria severity level assessment, the audio-video deep-learning cross modal system proposed in this research can accelerate the training speed, improve accuracy and reduce the amount of required training data.
Link to ePress publication
Copyright holder
Copyright notice
All rights reserved
Copyright license
Available online at