Analysis and reconstruction of distorted speech using deep learning

Loading...
Thumbnail Image
Supplementary material
Other Title
Authors
Patel, Raj
Author ORCID Profiles (clickable)
Degree
Masters of Applied Technologies - Computing
Grantor
Unitec, Te Pūkenga - New Zealand Institute of Skills and Technology
Date
2024
Supervisors
Sharifzadeh, Hamid
Erfanian Sabaee, Maryam
Type
Masters Thesis
Ngā Upoko Tukutuku (Māori subject headings)
Keyword
laryngectomy
voice reconstruction
artificial phonation
speech synthesis
deep learning
Citation
Patel, R. (2024). Analysis and reconstruction of distorted speech using deep learning (Unpublished document submitted in partial fulfilment of the requirements for the degree of Master of Applied Technologies (Computing)). Unitec, Te Pūkenga - New Zealand Institute of Skills and Technology https://hdl.handle.net/10652/6465
Abstract
Communication through whispering is essential for individuals who have undergone laryngectomy, a surgical procedure involving removing part or all of the vocal box. Whispering is unique due to its absence of fundamental frequency (pitch) and can be considered as an alternative mode of communication for laryngectomy patients. While it is generally a quiet form of communication in healthy people, it often results in a hushed (and sometimes unintelligible) speech in laryngectomised individuals, necessitating the use of prosthetics or specialised treatments. Current prosthetic solutions have inherent limitations. The medical treatments pose a risk of getting infections after surgery, and the generated speech by prosthetics sounds mechanical, following these computational methods developed. Some state-of-the-art deep learning algorithms have been developed in computational methods to generate natural-sounding speech. However, they were focused on reconstructing whispered to normal speech, not on laryngectomised or distorted speech. This thesis focuses on analysing these Deep Learning algorithms using objective evaluation metrics and aims to apply these existing algorithms to a laryngectomised dataset for the first time in literature. We discuss the results of these evaluations and perform a comparative analysis between the models. We are starting our analysis with GAN-based models. Following this, we are moving to WESPER, a prediction-based model. Lastly, we are going to analyse the voice conversion-based models developed to convert speech from one speaker style to another and translate one language into another. Our initial analysis comprises 198 tests on 11 models and 6 objective evaluation metrics. This evaluation is going to be done on the testing dataset, which has three patient categories, namely Partial Laryngectomy (PL), Total Laryngectomy (TL), and Total Laryngectomy with Trachea Esophageal Puncture(TLTEP). Based on the results of such evaluation, I propose modifications to the architecture of five GAN-based models. I particularly adjust the the models and loss functions to improve the outcome for laryngectomy patients. In addition, they are cross-compatible with each other and propose a total of 25 models. For a better understanding of the features of laryngectomised speech, we are including the laryngectomised dataset combined with the wTIMIT dataset in the training process. Using the same set of objective evaluation metrics, these proposed models demonstrate a better denoising effect in reconstructed speech, spectral features, and intelligibility than the existing models.
Publisher
Link to ePress publication
DOI
Copyright holder
Author
Copyright notice
All rights reserved
Copyright license
Available online at