Introduction
Image obtained from Kenneth E. Okedu
In a world facing unprecedented challenges like climate change, pandemics, social inequality, and biodiversity loss, understanding and improving the sustainability of cities becomes paramount. Cities have the potential to drive positive change, and it’s vital to determine what makes a city more sustainable and how underperforming cities can enhance their sustainability.
The dataset comes from the “Urban Typologies” project 1, where 65 indicators that relate to demographics, mobility, economy, city form can be found. This dataset was obtained by combining multiple sources and had the general objective of classifying the different cities of the world according to a typology. It is in itself an interesting Data Sciences exploration. The aim of the project is to predict the ‚CO2 Emissions per capita (metric tonnes)‘ for each city, conditioned on any other variable given in the dataset, with exception for ‚Pollution index‘. Project consists of two parts, each one have unique method of how the train and test sets are splitted.
The project was conducted in group of 4 for the course Data Science for Mobility 2 taught at Denmark’s Technical University.
Part I
The training set corresponds to the first 75% rows in the dataset and test set is the last 25%, without shuffling.
The aim so to reach the prediction accuracy score for R^2 to be at least 0.6.
Models applied are listed below.
- Linear Regression with PCA
- Linear Regression with Correlation Filtering
- Random Forest Regression with PCA
- Random Forest Regression with Correlation Filtering
- Support Vector Regression with PCA
- Support Vector Regression with Correlation Filtering (Normalized)
- Neural Network with PCA
Part II
The set set corresponds to all cities that belong to North America and South America, while the train set will be the remaining ones.
The aim is to explore the concept of generalizability/transferability, to find out what would model need to properly generalize to a new datapoint?
