1. Clustering

Lesson content locked

Enroll in Course to Unlock

If you're already enrolled, you'll need to login.

Transcript

- Hello again. In this opportunity we'll be discussing clustering analysis. Clustering is the process of organizing elements into self-similar groups. Cluster analysis does not make any distinction between dependent and independent variables. Let's dive in. Let us review once again the distinction between clustering and classification. In clustering, the goal is to group elements together without labeling them. For example, grouping rock attributes with similar characteristics without knowing the implications of these characteristics may have in production. In classification, the goal is to assignment elements to predefined classes. For example, deciding whether a particular rock type can be associated with a specific class. There are several concepts related to clustering. Attributes, each element that would be clustered represents a point with a certain number of attributes or dimensions. Distances and similarities, in general these are reciprocal concepts. Similarity measures are used to describe convictively how close to the other points of clusters are. Dissimilarity is the opposite concept. In measuring dissimilarity, the farther are the two data points of clusters apart. Distances can be represented by different metrics. A Euclidean distance indicated in green, or the Manhattan distance indicated in blue. The centroid is given by the mean value of points belonging to a data cluster. That is, the centroid is the center mass of a cluster. Hard and soft clustering. In hard clustering, each element belongs to one, and only one, cluster. In soft clustering, an element can belong to one or more clusters with a certain probability. This is necessary when there is not clear separation among clusters. Validation, although challenging, we need some sort of performance evaluation that says the clustering results. We will later discuss a couple of them in this video lecture. In our business, clustering can be used for multiple purposes. The most used one is to be able to separate a set of properties into distinct groups, Another applicability is that a mean value face cluster could be used as a representative entity. Also, within each cluster, it is possible to develop a particular predictive model with a tight range of possible outcomes. Another is useful power of using clustering is that we can isolate data outliers. Lastly, elements sharing a cluster could be used to reconstruct missing data values. In the pictures on the right, we can see how clustering can help us group elements of similar characteristics. A production curve representation fronts resulting from different logical data sections. Moreover, in each of these cases, we can select the representative element by looking at the average percent rate of each cluster. Namely, the production curve, or the different concentration panels that we can see in those pictures. The k-means algorithm is the most popular approach for performing clustering. This algorithm aims separation and observations into tight clusters in which each observation belongs to the cluster with the nearest mean. The algorithm can be depicted with a few lines, and we go straight a process for a particular case when k is equal to three. In the step one of the algorithm, each observation is randomly assigned a cluster. In step two a, the cluster centroids are computed. These are shown as large colored disks. Initially, the centroids are almost completely overlapping, because initial cluster assignments were chosen at random. In step two b, each observation is assigned to the nearest centroid. We then repeat step two a, leading to new cluster centroids. We then keep repeating the process until no major changes in the location of cluster centroids are happening. Hierarchical clustering provides a more comprehensive picture of all the clusters the k-mean generator couldn't correlate. Moreover, in contrast to k-means, we don't need to specify a number of cluster upfront. The algorithm proceeds as follows. One, identify each individual sample with a cluster. Two, compute the mutual distances between each cluster. This generates a symmetric distance matrix containing pairwise dissimilarities. These dissimilarities can be, in fact, chosen in different ways, but is the linkage for the sake of generating clusters of similar size. Three, detect the closest pair and merge them into one cluster. Four, repeat the loop from steps two to four until converging into a single cluster that will hold all samples. Another approach is a agglomerative, and builds on three in a bottom up fashion. That is, we have taken its samples and given it a cluster representing the leaf of that tree, and merged the closest curves, and in one cluster remains, which represents the tree root. An opposite approach, namely a divisive, hierarchical cluster, which proceeds top down is also possible. Hence, hierarchical clustering has an attractive feature that it results in a graphical tree base representation, comprising all clusters' relations, called a dendrogram. This allows for selecting a design of robust clusters by just trimming the tree at a certain height. At the left, we can see the dendrogram obtained from the the hierarchical cluster. The center, the dendrogram from the left side is cut at a height of nine, indicated by the dash line. This cut resolves in two distinct clusters, shown in green and red. At the right, the dendrogram from the left side is now cut at a height of five. This cut results in three distinct clusters, shown in red, yellow, and green colors. Hence, the lower we cut the tree, the greater number of clusters that we can generate. It is worth it to compare the two previous clustering algorithms. The k-means algorithm is the most popular, and we can find that it's a recurring method in public and commercial solutions. However, the effectiveness of the method is very dependent on the number of clusters or distance metric specified. In contrast, the hierarchical clustering method is very appealing for visual purposes via dendrograms, and for processing multiple number of clusters at the same time. Nevertheless, results can change significantly depending on the distance metric, the linkage strategy of what we consider the best dendrogram cutting point. Of course, there are many more methods for clustering, but there is no clear winner which one may be the best for tackling problems. Validation and evaluation of clustering results is as difficult as the clustering itself. The main issue is that there are no labels to compare with these results. This is why finding the correct choice of number of clusters is often ambiguous, with interpretation depending on observation distribution and the desired cluster in the solution as specified by the user. We would provide a couple of methods for validating clusters, but it's important to remark that the user needs to complement these metrics with engineering or scientific judgment to validate how useful the cluster results are. One simple metrics is the Dunn Index. This takes a ratio of the mean distance between each pair of clusters and the maximum internal distance between elements among all clusters. This distance should be be consistent with the one used in the clustering. Clearly, the greater the Dunn Index, the larger is the separation among clusters. On recent resource is the Silhouette Plot. This plot contrasts the average distance two elements in the same cluster, cohesion, with the average distance to elements in other clusters, separation. The silhouette ranges from minus one to one, when a high value indicates that the object is well matched to its own cluster, and poorly matched to neighboring clusters, as illustrated in the figure below of our five clusters. If most elements have a high positive value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. Clustering is a very useful tool for data analysis in an unsupervised setting. However, there are a number of issues that arise when performing clustering. Each of these issues can have a strong impact on the results obtained. Let's start by mentioning that clustering results are sensitive to many factors, including the algorithm used. We know that the results can differ in k-means for different initial guesses, or in the hierarchical approach for different linkage strategies. Regarding the distance metrics, and relation change from element to element, especially if each element has a mix of real and categorical values, the results may change. Results can also change according to how we weight our select attributes. Given the lack of labels, or cessation to particular outputs, it is challenging to rank or discriminate these attributes. Normalization, or valuable transformation, are indirect ways to change distance metrics. That's why clustering results are also subject to changes, in this case two. As clustering are good for removing outliers, clustering results can vary dramatically as outliers are removed. Their amputation, and different levels of noise removal, can also effect results as concentration of points will also change. We already discussed that there is no ultimate evaluation method that holds for any clustering approach. Visualization of clustering results in high dimension demands the use of image enhancing reduction techniques, as discussed in one of our lectures. This introduces additional computational work and analysis. The end of this list of challenges, we can now separate interpretation of clustering results from the underlying engineering or scientific principles behind the problem. In practice, we should try several different choices, and look for the clustering with the most useful, or interpretable solution. With this method, there is no single right answer. Any solution that exposes some interesting aspects of the data should be considered. With this message, we conclude this video lecture. See you in the next one.