Unsupervised Learning with Hierarchical Clustering using Kano, Nigeria Education Data
We use our Fulcrum data collection platform to conduct field surveys across a wide variety of locales. In preparation for our human geography analytical efforts this year, I wanted to get my hands dirty with some of our existing holdings in Kano, Nigeria. This dataset was collected in the fall of 2014 and represents a number of feature types across several human geography themes. For this exploratory data analysis, I chose a survey of education features within the study area consisting of over 400 school locations and their associated attributes. For each school a number of variables are collected including administrative data, capacity and infrastructure measures, and socioeconomic indicators. Having no a priori information about the study area, I wanted to use some unsupervised learning techniques to get a sense of the variation and natural groupings within the dataset. In this post I demonstrate the use of hierarchical clustering to classify schools within Kano, Nigeria using a variety of socioeconomic and infrastructure attributes.
Distribution of Schools in Kano, Nigeria
Hierarchical clustering is an unsupervised classification method used in exploratory data analysis to identify subgroups within a dataset. It is useful in situations where the analyst has no predetermined number of clusters in the data. Hierarchical clustering uses some form of dissimilarity (e.g. Euclidean distance) to determine differences between observations. The method produces a dendrogram which provides an excellent visual method for determining the ideal number of clusters or categories within the data. It’s relatively trivial to conduct the method in R using the hclust() function.
schools <- read.csv("kano_schools.csv") school_vars <- colnames(schools)[-1] scaled_vars <- scale(schools[,school_vars]) d <- dist(scaled_vars, method="euclidean") school_hclust <- hclust(d, method="ward.D")
In the code block above you can see that the school variables were scaled and centered using a mean value of 0 and a standard deviation of 1. When conducting unsupervised learning with a variety of input variable units and value ranges it is important to account for scale differences between variables in order to remove variable bias and its effects on the clustering output. You can see the results from the hclust operation in the dendrogram plot below. The dendrogram gives the analyst a visual means to select the number of clusters in the data by observing vertical distances between subgroups. A horizontal line across the dendrogram determines the number of clusters at a given vertical distance by the number of branches intersecting the line. For demonstration purposes, I have selected four clusters for further examination. You can plot the boundaries of the clusters using the rect() function.
Hierarchical Cluster Dendrogram for Schools in Kano, Nigeria
One of the challenges of unsupervised learning is making sense of the patterns depicted in the output. After choosing the number of clusters in the data it is important to examine their descriptive statistics in order to describe the clusters. Below I discuss the results of the clustering output and describe the major similarities within each cluster.
Cluster 1 (Islamic, mixed gender lower-grade level) - This subgroup is dominated by Islamic schools of mixed gender and constituted the largest proportion of the dataset (193 observations). They tend to be much smaller than other clusters and had less security and utility infrastructure. These schools had the lowest mean socioeconomic indicator values and tend to be located in areas with the highest perceived crime.
Cluster 2 (Secular, mixed gender private higher grade level) - This cluster (133 observations) was dominated by mixed gender private and international secondary schools. They tend to have the highest levels of security and utility infrastructure and were located in areas of high socioeconomic status and low perceived crime.
Cluster 3 (Segregated, mostly secular higher grade level) - This subgroup was the smallest (42 observations) and is best characterized as segregated by gender and consisting mostly of secondary schools. They had the largest average number of classrooms and were located in areas of average socioeconomic status and perceived crime.
Cluster 4 (Secular, mixed gender lower grade level) - This final grouping of schools (88 observations) are predominantly secular and mixed gender. They are mainly comprised of Kindergarten and primary schools and tend to be located in areas of average socioeconomic status and perceived crime.
Hierarchical clustering provides a useful method for learning about your data before conducting more advanced analyses. In this example we selected a small number of clusters to illustrate the technique, but the analyst may wish to select more clusters if the output supports it. We’ll take a look at some more formal ways of selecting clusters and their stability in a future post. It should also be noted that this can be conducted just as easily in a number of other tools including SciPy.