Show simple record

dc.contributor.authorTong, Han
dc.description.abstractDysarthria is a speech disorder disease that can have a significant impact on a person's daily life. Early detection of the disease can put the patient into therapy sessions more quickly. Researchers have established various approaches to detect the disease automatically. Traditional computational approaches commonly analysed acoustic features like Mel-Frequency Cepstral Coefficients (MFCC), Spectral Centroid, Linear Prediction Cepstral (LPC) coefficients and Perceptual Linear Prediction (PLP) from speech samples of patients to detect dysarthric speech characters like slow speech rate, short pauses, mis-articulated sounds, etc. Recent research has shown that some machine learning algorithms can also be deployed to extract speech features and detect the severity level automatically. In machine learning, feature extraction is a crucial step in dealing with classification and prediction problems. For different data formats, different well-established frameworks have been developed to extract and classify the corresponding features. For example, for an image data processing system, Convolution Neural Network (CNN) can provide the underlying network structure for the system to analyse the video data to obtain the visual features. In contrast, for audio data processing system, Natural Language Processing (NLP) algorithms can be applied to obtain acoustic features. Therefore, the selection of the framework to be used mainly depends on the modality of the input. As early steps in development of machine learning approaches for automatic assessment of dysarthric patients, classification systems based on audio features have been considered in literature; however, recent research efforts in other fields have shown that using an audio-video cross-modal framework can improve performance of the classification systems. In this thesis, for the first time, an audio-video cross-modal framework is proposed using deep-learning algorithm that the network takes both audio and video data as input to detect severity levels of dysarthria. Within the deep-learning framework, we also propose two network architectures using audio-only or video-only input to detect dysarthria severity levels automatically. Comparing with current one-modality systems, the deep-learning framework yields satisfying results. More importantly, comparing with systems based only on audio data for automatic dysarthria severity level assessment, the audio-video deep-learning cross modal system proposed in this research can accelerate the training speed, improve accuracy and reduce the amount of required training data.en_NZ
dc.rightsAll rights reserveden_NZ
dc.subjectmotor speech disordersen_NZ
dc.subjectdysarthric patientsen_NZ
dc.subjectaudio data processing systemsen_NZ
dc.subjectvideo data processing systemsen_NZ
dc.subjectdeep-learning algorithmsen_NZ
dc.titleAutomatic assessment of dysarthric severity level using audio-video cross-modal approach in deep learningen_NZ
dc.typeMasters Thesisen_NZ
dc.rights.holderAuthoren_NZ of Computingen_NZ Institute of Technologyen_NZ
dc.subject.marsden080108 Neural, Evolutionary and Fuzzy Computationen_NZ
dc.subject.marsden1199 Other Medical and Health Sciencesen_NZ
dc.identifier.bibliographicCitationTong, H. (2020). Automatic assessment of dysarthric severity level using audio-video cross-modal approach in deep learning. (Unpublished document submitted in partial fulfilment of the requirements for the degree of Master of Computing). Unitec Institute of Technology, Auckland, New Zealand. Retrieved from
dc.contributor.affiliationUnitec Institute of Technologyen_NZ
unitec.publication.placeAuckland, New Zealanden_NZ
unitec.advisor.principalSharifzadeh, Hamid
unitec.advisor.associatedMcLoughlin, Ian

Files in this item


This item appears in

Show simple record

 Unitec Institute of Technology, Private Bag 92025, Victoria Street West, Auckland 1142