Introduction to BIRCH Clustering & Python Implementation

Clustering is one of the most used unsupervised machine learning techniques for finding patterns in data. Most popular algorithms used for this purpose are K-Means/Hierarchical Clustering. These algorithms do not adequately address problems of processing large datasets with limited amount of resources (i.e. memory and cpu). This is where BIRCH comes in picture and improves performance vis-a-vis aforementioned algorithms.

Balanced Iterative Reducing and Clustering using Hierarchies aka BIRCH deals with the large dataset problem described above by first creating a summary of data while retaining much of the distribution related information and then clustering the summary. Second step of BIRCH can use any of the clustering methods.


Flowchart of steps followed in algorithm

Following is a high level description of the algorithm:

For more details on the detailed algorithm nuances, you can refer to the associated research paper.

Implementation

Letā€™s now try to implement BIRCH on a dummy dataset and examine its performance vis-a-vis K-Means.

Dataset

We will be using dummy data created using sklearn library. Note that we are taking a simple case to illustrate implementation and comparison with K-Means. As the code snippet suggests, we are generating data with 10000 samples, 3 features and 6 Clusters with 1.5 cluster standard deviation. Letā€™s observe data characteristics visually in next section.

Data Exploration

The scatter plot below shows relationship among variables. Clearly, there are 6 clusters with most of them separated out quite well.

Experiment and Observations

We will now try to perform clustering on the data using K-Means and BIRCH algorithms. Python classes for both of these techniques are available in sklearn library. We will first try to investigate the improvement in computation time using BIRCH for the given dataset.

We will now compare silhouette score for both the approaches. Silhouette scores will suggest how well-separated clusters are. Comparing this with original cluster labelling (ground truth) will give us a good idea as to which approach is providing more similar results to ground truth.

This will be followed by visually inspecting the results using 3D scatter diagram of clusters.

Conclusion

From the results above, we can make the following inferences on BIRCH for the given dataset.

  1. Visually, K-Means and BIRCH provide quite similar results.
  2. There is a significant improvement in time elapsed (~50%) over K-Means. With hyper-parameter tuning this might improve further.
  3. Silhouette score is very similar for all three KMeans ,BIRCH and ground truth implying BIRCH is not compromising on the cluster quality. One might think of a reduction in silhouette score in BIRCH due to its inherent methodology of summarizing data before performing clustering.

You can access notebook at: https://www.kaggle.com/code/blackburn1/birch-implementation-and-comparision-with-kmeans?scriptVersionId=91993527

You can also check out other articles written around data science, computing on medium. If you like my work and want to contribute to my journey, you cal always buy me a coffee :)

References

[1] Original Paper: https://www2.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf

[2] Sklearn: https://scikit-learn.org/stable/modules/clustering.html#birch

[3] Explanation on BIRCH phases: https://towardsdatascience.com/machine-learning-birch-clustering-algorithm-clearly-explained-fb9838cbeed9

[4] K-Means Clustering example: https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html#sphx-glr-auto-examples-cluster-plot-cluster-iris-py

[5] Silhouette score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

Comments