Deep correlation learning for urban air quality: Analysis and prediction in New Zealand

Thumbnail Image
Other Title
Andrews, Regan Lee
Author ORCID Profiles (clickable)
Doctor of Computing
Unitec Institute of Technology
Ramirez-Prado, Guillermo
Sharifzadeh, Hamid
Doctoral Thesis
Ngā Upoko Tukutuku (Māori subject headings)
Auckland (N.Z.)
New Zealand
air pollution analysis
air quality
in-depth canonical correlation analysis
deep canonical correlation analysis
Andrews, R. L. (2021). Deep correlation learning for urban air quality: Analysis and prediction in New Zealand. (Unpublished document submitted in partial fulfilment of the requirements for the degree of Doctor of Computing). Unitec Institute of Technology, New Zealand. Retrieved from
RESEARCH QUESTIONS: • RQ1: Can we relate all the variables involved in air-quality monitoring to establish which ones have dependencies on the others? • RQ2: Can we further model this inter-relationship and automate this by using the available data? • RQ3: Can we use this inter-relationship to increase further the accuracy of air quality prediction for PM2:5? • RQ4: In the event that RQ3 is successful, can we apply the deep correlation architecture to urban air quality monitoring? ABSTRACT: Air pollution and air quality are undoubtedly a very high priority for governments, organisations and people of all countries, with developing countries such as India, China and Brazil being especially affected by a high level of pollutants. There are a large number of studies which have shown that the quality of air has a direct impact on public health and preventable diseases. For example, particulate matter in the air (microscopic particles that a diameter of less than 2.5μm) is able to be inhaled into the lungs, increasing risk of respiratory and general health disorders. High levels of this increase a person’s risk of developing a number of serious long-term disorders and ultimately fatal diseases. There are many ways in which air quality has been predicted over the years, with methods falling into the two categories of statistical models and computational models. With technology advances being able to measure more, create data on more and process more, we are now in the era where Artificial Intelligence has become the gold-standard for generating efficient and accurate predictions. For these predictions to be accurate, we require not only accurate data but the correct data to be in used in the first place. Many methods have been devised and used in environmental studies to discover the associations and correlations between the different types of variables that can be found. Correlation analysis is the focus of our research so we may study the influential factors involved in air quality movements, use the ones which have an impact and ignore the ones that do not. This is hypothesised to increase not only the accuracy, but also the efficiency we can now deal with computing air quality. We choose to use Canonical Correlation Analysis (CCA) due to it’s relative obscurity in environmental scientific studies, but increasingly common usage in other fields such as biological sciences and social sciences such as psychology. The most central statistic in a CCA is the canonical correlation between the two synthetic variables, and this statistic is, in effect, a Pearson r. The computations involved in a CCA are done with the goal of maximising this simple and common correlation [Sherry and Henson, 2005]. The solution in this research was implemented in Python using minor alterations of an established and published deep-learning-based in-depth canonical correlation analysis solution [Andrew et al., 2013] to good effect, with the best correlations defined as > +0.50 or > -0.50 which reduced the dataset of 73 variables, spread over 12 monitoring regions to more manageable level which were indicative of changes really occurring in the air mixture. Using the commonly used performance measures - MSE, RMSE, MAE and R2 - the new correlative findings were run using five multi-layer neural networks of different sizes, linear regression, XGBoost, AdaBoost, SVM and kNN to understand whether this results in an accuracy that is better or worse than using all 73 variables in the original dataset. In almost all instances, the performance is considerably better. Using the MAE as the indicator here, it appears that an AdaBoost algorithm is best in accuracy, with a 97.142% decrease over the original followed by an MLP neural network using 1024,1024,1024 hidden layers. The time taken for learning is considerable on very high-end computing hardware, and use against a baseline of quite a low-level of sophistication may explain this type of increase. To this end, Deep Canonical Correlation Analysis (DCCA) can be seen to increase both the efficiency and accuracy of existing air quality prediction activities under most use-cases
Link to ePress publication
Copyright holder
Copyright notice
All rights reserved
Copyright license
Available online at