My last blog on machine learning discussed different types of artificial intelligence that can be applied to Big Data. This article will discuss cluster analysis, a form of Unsupervised Pattern Recognition.
Let’s start with a basic definition. Pattern recognition algorithms are used to detect regularities in data, and they come in two basic flavors: supervised and unsupervised. In supervised pattern recognition, training against a dataset occurs to help the algorithms detect patterns. Unsupervised means no training against data is provided; patterns are detected by other means, such as statistical analysis.
What are the benefits of using supervised versus unsupervised pattern recognition? To answer this question, bear in mind that some pre-known knowledge must go into designing supervised pattern recognition software. This is because data used to train the software must be pre-selected.
In unsupervised pattern recognition, this is unnecessary. A group of data is simply run through an algorithm to observe what’s “interesting.” We can ask questions about data without pre-thinking potential relationships, and do it “on the fly.”
With supervised pattern recognition, if a few weeks down the road it becomes apparent that other data should have been accounted for, the algorithm will need to be re-trained—and this will involve some additional software development. With unsupervised pattern recognition, the algorithm is simply run against the new data.
Cluster Analysis is a form of unsupervised pattern recognition, and is defined by Wikipedia as follows:
“Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).”
This is easily explained visually. Please see the following diagram (By Chire – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17087089).
Cluster Analysis involves numeric representation of data so oftentimes a conversion must take place*
Think of each point being a relationship between two pieces of data. For example, a point may represent yearly spending by a department (the y-axis representing spending in hundreds of thousands of dollars; the x-axis being decimal representations of departments), or sales by geographic location (the y-axis representing sales in hundreds of thousands of dollars; the x-axis being decimal representations of geographic coordinates).* The preceding diagram illustrates data clustering behavior. This in and of itself may not necessarily lead to data insight immediately.
The next step is for an analyst to look at the data comprising each cluster. For instance, an examination of the green cluster might reveal a concentration of expenses made by departments connected to sales. Or perhaps the blue cluster is comprised of geographic locations in the Northeast. The analyst is asking: 1) What’s interesting about the clusters, and 2) what data attributes could be causing clustering in the manner seen? By running a cluster analysis on data that one wouldn’t think would necessarily be related, a determination can be made if relationships do in fact exist.
Several types of clustering algorithms are available, such as connectivity-, centroid-, distribution-, and density-based algorithms. I will leave it the reader to research on your own the various algorithms and their workings. Hopefully this blog has given you an idea of the practical applications of using clustering.
In summary, cluster analysis is an unsupervised way to gain data insight in the world of Big Data. It will show you relationships in data that you may not realize are there. jKool is a Big Data analysis solution that takes advantage of clustering. Stay tuned to follow-on blogs for more information that will allow you to see various machine learning examples with Big Data at the jKool website www.jkoolcloud.com.