Predicting sustainable development indices from geolocated text

tl;dr:

In this project, we leverage readily-available natural language data, scraped from Wikipedia, to predict localized indices (asset, sanitation, women’s education) relevant to the UN’s Sustainability Goals. We explore the impact of different text embedding extraction methods and model architectures on performance in this small data task. We explore logistic regression models, feedforward DNNs, and NLP-CNNs. We use geolocated and extracted “relevant” sentence embeddings to achieve ROC-AUC scores of 0.80 (logistic regression model), 0.70 (logistic regression model), and 0.81 (feedforward DNN model) for asset, sanitation, and women’s education index classification, respectively.

I contributed the following:

performed exploratory data analysis,
generated text embeddings,
built the dataloader,
performed logistic regression experiments (implementation, hyperparameter optimization, and evaluation),
and performed the simple feedforward neural network experiments (implementation, hyperparameter optimization (with ray-tune), and evaluation).

See more here: