by idouba

116 slides


Published Feb 27, 2014 in
Direct Link :

11ClusAdvanced.ppt... Read more

Read less


comments powered by Disqus

Presentation Slides & Transcript

Presentation Slides & Transcript

1Data Mining: Concepts and Techniques (3rd ed.)— Chapter 11 —Jiawei Han, Micheline Kamber, and Jian PeiUniversity of Illinois at Urbana-Champaign &Simon Fraser University©2012 Han, Kamber & Pei. All rights reserved.1

February 27, 2014Data Mining: Concepts and Techniques2

3Review: Basic Cluster Analysis Methods (Chap. 10)Cluster Analysis: Basic ConceptsGroup data so that object similarity is high within clusters but low across clustersPartitioning MethodsK-means and k-medoids algorithms and their refinementsHierarchical MethodsAgglomerative and divisive method, Birch, CameleonDensity-Based MethodsDBScan, Optics and DenCLuGrid-Based MethodsSTING and CLIQUE (subspace clustering)Evaluation of ClusteringAssess clustering tendency, determine # of clusters, and measure clustering quality3

K-Means ClusteringK=2Arbitrarily partition objects into k groupsUpdate the cluster centroidsUpdate the cluster centroidsReassign objectsLoop if needed4The initial data setPartition objects into k nonempty subsetsRepeatCompute centroid (i.e., mean point) for each partition Assign each object to the cluster of its nearest centroid Until no change

Hierarchical ClusteringUse distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition 5

Distance between ClustersSingle link: smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj)Medoid: a chosen, centrally located object in the cluster6

BIRCH and the Clustering Feature (CF) Tree StructureCF1child1CF3child3CF2child2CF5child5CF1CF2CF6prevnextCF1CF2CF4prevnextB = 7L = 6RootNon-leaf nodeLeaf nodeLeaf node7

Overall Framework of CHAMELEONConstruct (K-NN)Sparse GraphPartition the GraphMerge PartitionFinal ClustersData SetK-NN GraphP and q are connected if q is among the top k closest neighbors of pRelative interconnectivity: connectivity of c1 and c2 over internal connectivityRelative closeness: closeness of c1 and c2 over internal closeness8

Density-Based Clustering: DBSCANTwo parameters:Eps: Maximum radius of the neighbourhoodMinPts: Minimum number of points in an Eps-neighbourhood of that pointNEps(p): {q belongs to D | dist(p,q) ≤ Eps}Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. Eps, MinPts if p belongs to NEps(q)core point condition: |NEps (q)| ≥ MinPts 9

10Density-Based Clustering: OPTICS & Its Applications

DENCLU: Center-Defined and Arbitrary11

STING: A Statistical Information Grid ApproachWang, Yang and Muntz (VLDB’97)The spatial area is divided into rectangular cellsThere are several levels of cells corresponding to different levels of resolution12

Evaluation of Clustering QualityAssessing Clustering TendencyAssess if non-random structure exists in the data by measuring the probability that the data is generated by a uniform data distributionDetermine the Number of ClustersEmpirical method: # of clusters ≈√n/2 Elbow method: Use the turning point in the curve of sum of within cluster variance w.r.t # of clustersCross validation methodMeasuring Clustering QualityExtrinsic: supervisedCompare a clustering against the ground truth using certain clustering quality measureIntrinsic: unsupervisedEvaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are13

14Outline of Advanced Clustering AnalysisProbability Model-Based ClusteringEach object may take a probability to belong to a clusterClustering High-Dimensional DataCurse of dimensionality: Difficulty of distance measure in high-D spaceClustering Graphs and Network DataSimilarity measurement and clustering methods for graph and networksClustering with ConstraintsCluster analysis under different kinds of constraints, e.g., that raised from background knowledge or spatial distribution of the objects

15Chapter 11. Cluster Analysis: Advanced MethodsProbability Model-Based ClusteringClustering High-Dimensional DataClustering Graphs and Network DataClustering with ConstraintsSummary15

Fuzzy Set and Fuzzy ClusterClustering methods discussed so farEvery data object is assigned to exactly one clusterSome applications may need for fuzzy or soft cluster assignment Ex. An e-game could belong to both entertainment and softwareMethods: fuzzy clusters and probabilistic model-based clustersFuzzy cluster: A fuzzy set S: FS : X → [0, 1] (value between 0 and 1)Example: Popularity of cameras is defined as a fuzzy mapping Then, A(0.05), B(1), C(0.86), D(0.27)16

Fuzzy (Soft) ClusteringExample: Let cluster features beC1 :“digital camera” and “lens”C2: “computer“Fuzzy clustering k fuzzy clusters C1, …,Ck ,represented as a partition matrix M = [wij]P1: for each object oi and cluster Cj, 0 ≤ wij ≤ 1 (fuzzy set)P2: for each object oi, , equal participation in the clusteringP3: for each cluster Cj , ensures there is no empty clusterLet c1, …, ck as the center of the k clustersFor an object oi, sum of the squared error (SSE), p is a parameter: For a cluster Ci, SSE:Measure how well a clustering fits the data:17

Probabilistic Model-Based ClusteringCluster analysis is to find hidden categories.A hidden category (i.e., probabilistic cluster) is a distribution over the data space, which can be mathematically represented using a probability density function (or distribution function).Ex. 2 categories for digital cameras soldconsumer line vs. professional linedensity functions f1, f2 for C1, C2obtained by probabilistic clusteringA mixture model assumes that a set of observed objects is a mixture of instances from multiple probabilistic clusters, and conceptually each observed object is generated independentlyOut task: infer a set of k probabilistic clusters that is mostly likely to generate D using the above data generation process18

19Model-Based ClusteringA set C of k probabilistic clusters C1, …,Ck with probability density functions f1, …, fk, respectively, and their probabilities ω1, …, ωk.Probability of an object o generated by cluster Cj is Probability of o generated by the set of cluster C isSince objects are assumed to be generated independently, for a data set D = {o1, …, on}, we have,Task: Find a set C of k probabilistic clusters s.t. P(D|C) is maximizedHowever, maximizing P(D|C) is often intractable since the probability density function of a cluster can take an arbitrarily complicated formTo make it computationally feasible (as a compromise), assume the probability density functions being some parameterized distributions

20Univariate Gaussian Mixture ModelO = {o1, …, on} (n observed objects), Θ = {θ1, …, θk} (parameters of the k distributions), and Pj(oi| θj) is the probability that oi is generated from the j-th distribution using parameter θj, we have Univariate Gaussian mixture model Assume the probability density function of each cluster follows a 1-d Gaussian distribution. Suppose that there are k clusters.The probability density function of each cluster are centered at μj with standard deviation σj, θj, = (μj, σj), we have

The EM (Expectation Maximization) AlgorithmThe k-means algorithm has two steps at each iteration: Expectation Step (E-step): Given the current cluster centers, each object is assigned to the cluster whose center is closest to the object: An object is expected to belong to the closest clusterMaximization Step (M-step): Given the cluster assignment, for each cluster, the algorithm adjusts the center so that the sum of distance from the objects assigned to this cluster and the new center is minimizedThe (EM) algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models.E-step assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clustersM-step finds the new clustering or parameters that maximize the sum of squared error (SSE) or the expected likelihood21

Fuzzy Clustering Using the EM AlgorithmInitially, let c1 = a and c2 = b1st E-step: assign o to c1,w. wt =1st M-step: recalculate the centroids according to the partition matrix, minimizing the sum of squared error (SSE)Iteratively calculate this until the cluster centers converge or the change is small enough

23Univariate Gaussian Mixture ModelO = {o1, …, on} (n observed objects), Θ = {θ1, …, θk} (parameters of the k distributions), and Pj(oi| θj) is the probability that oi is generated from the j-th distribution using parameter θj, we have Univariate Gaussian mixture model Assume the probability density function of each cluster follows a 1-d Gaussian distribution. Suppose that there are k clusters.The probability density function of each cluster are centered at μj with standard deviation σj, θj, = (μj, σj), we have

24Computing Mixture Models with EMGiven n objects O = {o1, …, on}, we want to mine a set of parameters Θ = {θ1, …, θk} s.t.,P(O|Θ) is maximized, where θj = (μj, σj) are the mean and standard deviation of the j-th univariate Gaussian distribution We initially assign random values to parameters θj, then iteratively conduct the E- and M- steps until converge or sufficiently small changeAt the E-step, for each object oi, calculate the probability that oi belongs to each distribution,At the M-step, adjust the parameters θj = (μj, σj) so that the expected likelihood P(O|Θ) is maximized

Advantages and Disadvantages of Mixture ModelsStrengthMixture models are more general than partitioning and fuzzy clustering Clusters can be characterized by a small number of parametersThe results may satisfy the statistical assumptions of the generative modelsWeaknessConverge to local optimal (overcome: run multi-times w. random initialization)Computationally expensive if the number of distributions is large, or the data set contains very few observed data pointsNeed large data setsHard to estimate the number of clusters25

26Chapter 11. Cluster Analysis: Advanced MethodsProbability Model-Based ClusteringClustering High-Dimensional DataClustering Graphs and Network DataClustering with ConstraintsSummary26

27Clustering High-Dimensional DataClustering high-dimensional data (How high is high-D in clustering?)Many applications: text documents, DNA micro-array dataMajor challenges: Many irrelevant dimensions may mask clustersDistance measure becomes meaningless—due to equi-distanceClusters may exist only in some subspacesMethodsSubspace-clustering: Search for clusters existing in subspaces of the given high dimensional data spaceCLIQUE, ProClus, and bi-clustering approachesDimensionality reduction approaches: Construct a much lower dimensional space and search for clusters there (may construct new dimensions by combining some dimensions in the original data)Dimensionality reduction methods and spectral clustering

Traditional Distance Measures May Not Be Effective on High-D DataTraditional distance measure could be dominated by noises in many dimensionsEx. Which pairs of customers are more similar?By Euclidean distance, we get, despite Ada and Cathy look more similarClustering should not only consider dimensions but also attributes (features)Feature transformation: effective if most dimensions are relevant (PCA & SVD useful when features are highly correlated/redundant)Feature selection: useful to find a subspace where the data have nice clusters28

29The Curse of Dimensionality (graphs adapted from Parsons et al. KDD Explorations 2004)Data in only one dimension is relatively packedAdding a dimension “stretch” the points across that dimension, making them further apartAdding more dimensions will make the points further apart—high dimensional data is extremely sparseDistance measure becomes meaningless—due to equi-distance

30Why Subspace Clustering?(adapted from Parsons et al. SIGKDD Explorations 2004)Clusters may exist only in some subspacesSubspace-clustering: find clusters in all the subspaces

Subspace Clustering MethodsSubspace search methods: Search various subspaces to find clusters Bottom-up approachesTop-down approachesCorrelation-based clustering methodsE.g., PCA based approachesBi-clustering methodsOptimization-based methodsEnumeration methods

Subspace Clustering Method (I): Subspace Search MethodsSearch various subspaces to find clusters Bottom-up approachesStart from low-D subspaces and search higher-D subspaces only when there may be clusters in such subspacesVarious pruning techniques to reduce the number of higher-D subspaces to be searchedEx. CLIQUE (Agrawal et al. 1998)Top-down approachesStart from full space and search smaller subspaces recursivelyEffective only if the locality assumption holds: restricts that the subspace of a cluster can be determined by the local neighborhoodEx. PROCLUS (Aggarwal et al. 1999): a k-medoid-like method32

33CLIQUE: SubSpace Clustering with Aprori Pruning

Subspace Clustering Method (II): Correlation-Based MethodsSubspace search method: similarity based on distance or densityCorrelation-based method: based on advanced correlation modelsEx. PCA-based approach:Apply PCA (for Principal Component Analysis) to derive a set of new, uncorrelated dimensions, then mine clusters in the new space or its subspacesOther space transformations:Hough transform Fractal dimensions34

Subspace Clustering Method (III): Bi-Clustering MethodsBi-clustering: Cluster both objects and attributes simultaneously (treat objs and attrs in symmetric way)Four requirements:Only a small set of objects participate in a clusterA cluster only involves a small number of attributesAn object may participate in multiple clusters, or does not participate in any cluster at allAn attribute may be involved in multiple clusters, or is not involved in any cluster at all35Ex 1. Gene expression or microarray data: a gene sample/condition matrix. Each element in the matrix, a real number, records the expression level of a gene under a specific conditionEx. 2. Clustering customers and productsAnother bi-clustering problem

Types of Bi-clustersLet A = {a1, ..., an} be a set of genes, B = {b1, …, bn} a set of conditionsA bi-cluster: A submatrix where genes and conditions follow some consistent patterns4 types of bi-clusters (ideal cases)Bi-clusters with constant values:for any i in I and j in J, eij = cBi-clusters with constant values on rows:eij = c + αi Also, it can be constant values on columnsBi-clusters with coherent values (aka. pattern-based clusters)eij = c + αi + βjBi-clusters with coherent evolutions on rowseij (ei1j1− ei1j2)(ei2j1− ei2j2) ≥ 0i.e., only interested in the up- or down- regulated changes across genes or conditions without constraining on the exact values36

Bi-Clustering MethodsReal-world data is noisy: Try to find approximate bi-clustersMethods: Optimization-based methods vs. enumeration methodsOptimization-based methodsTry to find a submatrix at a time that achieves the best significance as a bi-clusterDue to the cost in computation, greedy search is employed to find local optimal bi-clustersEx. δ-Cluster Algorithm (Cheng and Church, ISMB’2000)Enumeration methodsUse a tolerance threshold to specify the degree of noise allowed in the bi-clusters to be minedThen try to enumerate all submatrices as bi-clusters that satisfy the requirementsEx. δ-pCluster Algorithm (H. Wang et al.’ SIGMOD’2002, MaPle: Pei et al., ICDM’2003)37

38Bi-Clustering for Micro-Array Data AnalysisLeft figure: Micro-array “raw” data shows 3 genes and their values in a multi-D space: Difficult to find their patternsRight two: Some subsets of dimensions form nice shift and scaling patternsNo globally defined similarity/distance measureClusters may not be exclusiveAn object can appear in multiple clusters

Bi-Clustering (I): δ-Bi-ClusterFor a submatrix I x J, the mean of the i-th row:The mean of the j-th column:The mean of all elements in the submatrix isThe quality of the submatrix as a bi-cluster can be measured by the mean squared residue valueA submatrix I x J is δ-bi-cluster if H(I x J) ≤ δ where δ ≥ 0 is a threshold. When δ = 0, I x J is a perfect bi-cluster with coherent values. By setting δ > 0, a user can specify the tolerance of average noise per element against a perfect bi-clusterresidue(eij) = eij − eiJ − eIj + eIJ39

Bi-Clustering (I): The δ-Cluster AlgorithmMaximal δ-bi-cluster is a δ-bi-cluster I x J such that there does not exist another δ-bi-cluster I′ x J′ which contains I x JComputing is costly: Use heuristic greedy search to obtain local optimal clustersTwo phase computation: deletion phase and additional phaseDeletion phase: Start from the whole matrix, iteratively remove rows and columns while the mean squared residue of the matrix is over δAt each iteration, for each row/column, compute the mean squared residue:Remove the row or column of the largest mean squared residueAddition phase: Expand iteratively the δ-bi-cluster I x J obtained in the deletion phase as long as the δ-bi-cluster requirement is maintainedConsider all the rows/columns not involved in the current bi-cluster I x J by calculating their mean squared residuesA row/column of the smallest mean squared residue is added into the current δ-bi-clusterIt finds only one δ-bi-cluster, thus needs to run multiple times: replacing the elements in the output bi-cluster by random numbers 40

Bi-Clustering (II): δ-pCluster Enumerating all bi-clusters (δ-pClusters) [H. Wang, et al., Clustering by pattern similarity in large data sets. SIGMOD’02]Since a submatrix I x J is a bi-cluster with (perfect) coherent values iff ei1j1 − ei2j1 = ei1j2 − ei2j2. For any 2 x 2 submatrix of I x J, define p-scoreA submatrix I x J is a δ-pCluster (pattern-based cluster) if the p-score of every 2 x 2 submatrix of I x J is at most δ, where δ ≥ 0 is a threshold specifying a user's tolerance of noise against a perfect bi-clusterThe p-score controls the noise on every element in a bi-cluster, while the mean squared residue captures the average noiseMonotonicity: If I x J is a δ-pClusters, every x x y (x,y ≥ 2) submatrix of I x J is also a δ-pClusters. A δ-pCluster is maximal if no more row or column can be added into the cluster and retain δ-pCluster: We only need to compute all maximal δ-pClusters.41

MaPle: Efficient Enumeration of δ-pClustersPei et al., MaPle: Efficient enumerating all maximal δ-pClusters. ICDM'03Framework: Same as pattern-growth in frequent pattern mining (based on the downward closure property)For each condition combination J, find the maximal subsets of genes I such that I x J is a δ-pClustersIf I x J is not a submatrix of another δ-pClustersthen I x J is a maximal δ-pCluster.Algorithm is very similar to mining frequent closed itemsetsAdditional advantages of δ-pClusters:Due to averaging of δ-cluster, it may contain outliers but still within δ-thresholdComputing bi-clusters for scaling patterns, take logarithmic onwill lead to the p-score form42

Dimensionality-Reduction MethodsDimensionality reduction: In some situations, it is more effective to construct a new space instead of using some subspaces of the original data43Ex. To cluster the points in the right figure, any subspace of the original one, X and Y, cannot help, since all the three clusters will be projected into the overlapping areas in X and Y axes.Construct a new dimension as the dashed one, the three clusters become apparent when the points projected into the new dimensionDimensionality reduction methodsFeature selection and extraction: But may not focus on clustering structure findingSpectral clustering: Combining feature extraction and clustering (i.e., use the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions)Normalized Cuts (Shi and Malik, CVPR’97 or PAMI’2000)The Ng-Jordan-Weiss algorithm (NIPS’01)

Spectral Clustering: The Ng-Jordan-Weiss (NJW) AlgorithmGiven a set of objects o1, …, on, and the distance between each pair of objects, dist(oi, oj), find the desired number k of clustersCalculate an affinity matrix W, where σ is a scaling parameter that controls how fast the affinity Wij decreases as dist(oi, oj) increases. In NJW, set Wij = 0Derive a matrix A = f(W). NJW defines a matrix D to be a diagonal matrix s.t. Dii is the sum of the i-th row of W, i.e.,Then, A is set toA spectral clustering method finds the k leading eigenvectors of A A vector v is an eigenvector of matrix A if Av = λv, where λ is the corresponding eigen-valueUsing the k leading eigenvectors, project the original data into the new space defined by the k leading eigenvectors, and run a clustering algorithm, such as k-means, to find k clustersAssign the original data points to clusters according to how the transformed points are assigned in the clusters obtained44

Spectral Clustering: Illustration and CommentsSpectral clustering: Effective in tasks like image processing Scalability challenge: Computing eigenvectors on a large matrix is costlyCan be combined with other clustering methods, such as bi-clustering45

46Chapter 11. Cluster Analysis: Advanced MethodsProbability Model-Based ClusteringClustering High-Dimensional DataClustering Graphs and Network DataClustering with ConstraintsSummary46

Clustering Graphs and Network DataApplicationsBi-partite graphs, e.g., customers and products, authors and conferencesWeb search engines, e.g., click through graphs and Web graphsSocial networks, friendship/coauthor graphsSimilarity measuresGeodesic distancesDistance based on random walk (SimRank)Graph clustering methodsMinimum cuts: FastModularity (Clauset, Newman & Moore, 2004)Density-based clustering: SCAN (Xu et al., KDD’2007) 47

Similarity Measure (I): Geodesic DistanceGeodesic distance (A, B): length (i.e., # of edges) of the shortest path between A and B (if not connected, defined as infinite)Eccentricity of v, eccen(v): The largest geodesic distance between v and any other vertex u ∈ V − {v}. E.g., eccen(a) = eccen(b) = 2; eccen(c) = eccen(d) = eccen(e) = 3Radius of graph G: The minimum eccentricity of all vertices, i.e., the distance between the “most central point” and the “farthest border” r = min v∈V eccen(v)E.g., radius (g) = 2Diameter of graph G: The maximum eccentricity of all vertices, i.e., the largest distance between any pair of vertices in Gd = max v∈V eccen(v)E.g., diameter (g) = 3A peripheral vertex is a vertex that achieves the diameter.E.g., Vertices c, d, and e are peripheral vertices48

SimRank: Similarity Based on Random Walk and Structural ContextSimRank: structural-context similarity, i.e., based on the similarity of its neighborsIn a directed graph G = (V,E),individual in-neighborhood of v: I(v) = {u | (u, v) ∈ E}individual out-neighborhood of v: O(v) = {w | (v, w) ∈ E}Similarity in SimRank: Initialization:Then we can compute si+1 from si based on the definitionSimilarity based on random walk: in a strongly connected componentExpected distance:Expected meeting distance:Expected meeting probability:49P[t] is the probability of the tour

Graph Clustering: Sparsest CutG = (V,E). The cut set of a cut is the set of edges {(u, v) ∈ E | u ∈ S, v ∈ T } and S and T are in two partitionsSize of the cut: # of edges in the cut setMin-cut (e.g., C1) is not a good partitionA better measure: Sparsity:A cut is sparsest if its sparsity is not greater than that of any other cutEx. Cut C2 = ({a, b, c, d, e, f, l}, {g, h, i, j, k}) is the sparsest cutFor k clusters, the modularity of a clustering assesses the quality of the clustering: The modularity of a clustering of a graph is the difference between the fraction of all edges that fall into individual clusters and the fraction that would do so if the graph vertices were randomly connectedThe optimal clustering of graphs maximizes the modularityli: # edges between vertices in the i-th clusterdi: the sum of the degrees of the vertices in the i-th cluster50

Graph Clustering: Challenges of Finding Good CutsHigh computational costMany graph cut problems are computationally expensiveThe sparsest cut problem is NP-hardNeed to tradeoff between efficiency/scalability and quality Sophisticated graphsMay involve weights and/or cycles.High dimensionalityA graph can have many vertices. In a similarity matrix, a vertex is represented as a vector (a row in the matrix) whose dimensionality is the number of vertices in the graphSparsityA large graph is often sparse, meaning each vertex on average connects to only a small number of other verticesA similarity matrix from a large sparse graph can also be sparse51

Two Approaches for Graph ClusteringTwo approaches for clustering graph dataUse generic clustering methods for high-dimensional dataDesigned specifically for clustering graphs Using clustering methods for high-dimensional dataExtract a similarity matrix from a graph using a similarity measureA generic clustering method can then be applied on the similarity matrix to discover clustersEx. Spectral clustering: approximate optimal graph cut solutionsMethods specific to graphsSearch the graph to find well-connected components as clustersEx. SCAN (Structural Clustering Algorithm for Networks)X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A Structural Clustering Algorithm for Networks”, KDD'0752

SCAN: Density-Based Clustering of NetworksHow many clusters?What size should they be?What is the best partitioning?Should some points be segregated?53An Example NetworkApplication: Given simply information of who associates with whom, could one identify clusters of individuals with common interests or special relationships (families, cliques, terrorist cells)?

A Social Network ModelCliques, hubs and outliersIndividuals in a tight social group, or clique, know many of the same people, regardless of the size of the groupIndividuals who are hubs know many people in different groups but belong to no single group. Politicians, for example bridge multiple groupsIndividuals who are outliers reside at the margins of society. Hermits, for example, know few people and belong to no groupThe Neighborhood of a Vertex54Define () as the immediate neighborhood of a vertex (i.e. the set of people that an individual knows )

Structure SimilarityThe desired features tend to be captured by a measure we call Structural SimilarityStructural similarity is large for members of a clique and small for hubs and outliers55

Structural Connectivity [1]-Neighborhood:Core:Direct structure reachable:Structure reachable: transitive closure of direct structure reachabilityStructure connected:[1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases56

Structure-Connected ClustersStructure-connected cluster CConnectivity:Maximality:Hubs:Not belong to any clusterBridge to many clustersOutliers:Not belong to any clusterConnect to less clustershuboutlier57

Algorithm = 2 = 0.758

139101178126401523Algorithm  = 2 = 0.70.6359

139101178126401523Algorithm = 2 = 0.70.750.670.8260

Algorithm = 2 = 0.761

139101178126401523Algorithm = 2 = 0.70.6762

139101178126401523Algorithm = 2 = 0.70.730.730.7363

139101178126401523Algorithm = 2 = 0.764

139101178126401523Algorithm = 2 = 0.70.5165

139101178126401523Algorithm = 2 = 0.70.6866

139101178126401523Algorithm = 2 = 0.70.5167

139101178126401523Algorithm = 2 = 0.768

139101178126401523Algorithm = 2 = 0.70.510.510.6869

139101178126401523Algorithm = 2 = 0.770

Running TimeRunning time = O(|E|)For sparse networks = O(|V|)[2] A. Clauset, M. E. J. Newman, & C. Moore, Phys. Rev. E 70, 066111 (2004).71

Chapter 11. Cluster Analysis: Advanced MethodsProbability Model-Based ClusteringClustering High-Dimensional DataClustering Graphs and Network DataClustering with ConstraintsSummary72

73Why Constraint-Based Cluster Analysis?Need user feedback: Users know their applications the bestLess parameters but more user-desired constraints, e.g., an ATM allocation problem: obstacle & desired clusters

74Categorization of ConstraintsConstraints on instances: specifies how a pair or a set of instances should be grouped in the cluster analysisMust-link vs. cannot link constraints must-link(x, y): x and y should be grouped into one clusterConstraints can be defined using variables, e.g., cannot-link(x, y) if dist(x, y) > dConstraints on clusters: specifies a requirement on the clustersE.g., specify the min # of objects in a cluster, the max diameter of a cluster, the shape of a cluster (e.g., a convex), # of clusters (e.g., k)Constraints on similarity measurements: specifies a requirement that the similarity calculation must respectE.g., driving on roads, obstacles (e.g., rivers, lakes)Issues: Hard vs. soft constraints; conflicting or redundant constraints

75Constraint-Based Clustering Methods (I):Handling Hard ConstraintsHandling hard constraints: Strictly respect the constraints in cluster assignmentsExample: The COP-k-means algorithm Generate super-instances for must-link constraintsCompute the transitive closure of the must-link constraintsTo represent such a subset, replace all those objects in the subset by the mean.The super-instance also carries a weight, which is the number of objects it representsConduct modified k-means clustering to respect cannot-link constraints Modify the center-assignment process in k-means to a nearest feasible center assignment An object is assigned to the nearest center so that the assignment respects all cannot-link constraints

Constraint-Based Clustering Methods (II):Handling Soft ConstraintsTreated as an optimization problem: When a clustering violates a soft constraint, a penalty is imposed on the clusteringOverall objective: Optimizing the clustering quality, and minimizing the constraint violation penaltyEx. CVQE (Constrained Vector Quantization Error) algorithm: Conduct k-means clustering while enforcing constraint violation penaltiesObjective function: Sum of distance used in k-means, adjusted by the constraint violation penaltiesPenalty of a must-link violationIf objects x and y must-be-linked but they are assigned to two different centers, c1 and c2, dist(c1, c2) is added to the objective function as the penaltyPenalty of a cannot-link violationIf objects x and y cannot-be-linked but they are assigned to a common center c, dist(c, c′), between c and c′ is added to the objective function as the penalty, where c′ is the closest cluster to c that can accommodate x or y76

77Speeding Up Constrained ClusteringIt is costly to compute some constrained clustering Ex. Clustering with obstacle objects: Tung, Hou, and Han. Spatial clustering in the presence of obstacles, ICDE'01K-medoids is more preferable since k-means may locate the ATM center in the middle of a lakeVisibility graph and shortest path Triangulation and micro-clusteringTwo kinds of join indices (shortest-paths) worth pre-computationVV index: indices for any pair of obstacle verticesMV index: indices for any pair of micro-cluster and obstacle indices

78An Example: Clustering With Obstacle ObjectsTaking obstacles into accountNot Taking obstacles into account

79User-Guided Clustering: A Special Kind of ConstraintsX. Yin, J. Han, P. S. Yu, “Cross-Relational Clustering with User's Guidance”, KDD'05 User usually has a goal of clustering, e.g., clustering students by research areaUser specifies his clustering goal to CrossClus

80Comparing with ClassificationUser-specified feature (in the form of attribute) is used as a hint, not class labelsThe attribute may contain too many or too few distinct values, e.g., a user may want to cluster students into 20 clusters instead of 3Additional features need to be included in cluster analysisAll tuples for clusteringUser hint

81Comparing with Semi-Supervised ClusteringSemi-supervised clustering: User provides a training set consisting of “similar” (“must-link) and “dissimilar” (“cannot link”) pairs of objectsUser-guided clustering: User specifies an attribute as a hint, and more relevant features are found for clusteringAll tuples for clusteringSemi-supervised clusteringAll tuples for clusteringUser-guided clusteringx

82Why Not Semi-Supervised Clustering?Much information (in multiple relations) is needed to judge whether two tuples are similarA user may not be able to provide a good training setIt is much easier for a user to specify an attribute as a hint, such as a student’s research areaTuples to be compared

83CrossClus: An OverviewMeasure similarity between features by how they group objects into clustersUse a heuristic method to search for pertinent featuresStart from user-specified feature and gradually expand search rangeUse tuple ID propagation to create feature valuesFeatures can be easily created during the expansion of search range, by propagating IDsExplore three clustering algorithms: k-means, k-medoids, and hierarchical clustering

84Multi-Relational FeaturesA multi-relational feature is defined by: A join path, e.g., Student → Register → OpenCourse → CourseAn attribute, e.g., Course.area(For numerical feature) an aggregation operator, e.g., sum or averageCategorical feature f = [Student → Register → OpenCourse → Course, Course.area, null]areas of courses of each studentValues of feature f

85Representing FeaturesSimilarity between tuples t1 and t2 w.r.t. categorical feature fCosine similarity between vectors f(t1) and f(t2) Most important information of a feature f is how f groups tuples into clustersf is represented by similarities between every pair of tuples indicated by fThe horizontal axes are the tuple indices, and the vertical axis is the similarityThis can be considered as a vector of N x N dimensionsSimilarity vector Vf

86Similarity Between FeaturesValues of Feature f and gSimilarity between two features – cosine similarity of two vectorsVfVg

87Computing Feature SimilarityTuplesFeature fFeature gDBAITHInfo sysCog sciTheorySimilarity between feature values w.r.t. the tuplessim(fk,gq)=Σi=1 to N f(ti).pk∙g(ti).pqTuple similarities, hard to computeFeature value similarities, easy to computeCompute similarity between each pair of feature values by one scan on data

88Searching for Pertinent FeaturesDifferent features convey different aspects of informationFeatures conveying same aspect of information usually cluster tuples in more similar waysResearch group areas vs. conferences of publicationsGiven user specified featureFind pertinent features by computing feature similarity

89Heuristic Search for Pertinent FeaturesOverall procedure1. Start from the user- specified feature2. Search in neighborhood of existing pertinent features3. Expand search range graduallyTarget of clusteringUser hint12Tuple ID propagation is used to create multi-relational featuresIDs of target tuples can be propagated along any join path, from which we can find tuples joinable with each target tuple

90Clustering with Multi-Relational FeaturesGiven a set of L pertinent features f1, …, fL, similarity between two tuplesWeight of a feature is determined in feature search by its similarity with other pertinent featuresClustering methodsCLARANS [Ng & Han 94], a scalable clustering algorithm for non-Euclidean spaceK-meansAgglomerative hierarchical clustering

91Experiments: Compare CrossClus withBaseline: Only use the user specified featurePROCLUS [Aggarwal, et al. 99]: a state-of-the-art subspace clustering algorithmUse a subset of features for each clusterWe convert relational database to a table by propositionalizationUser-specified feature is forced to be used in every clusterRDBC [Kirsten and Wrobel’00]A representative ILP clustering algorithmUse neighbor information of objects for clusteringUser-specified feature is forced to be used

92Measure of Clustering AccuracyAccuracyMeasured by manually labeled dataWe manually assign tuples into clusters according to their properties (e.g., professors in different research areas)Accuracy of clustering: Percentage of pairs of tuples in the same cluster that share common labelThis measure favors many small clustersWe let each approach generate the same number of clusters

93DBLP Dataset

94Chapter 11. Cluster Analysis: Advanced MethodsProbability Model-Based ClusteringClustering High-Dimensional DataClustering Graphs and Network DataClustering with ConstraintsSummary94

95SummaryProbability Model-Based ClusteringFuzzy clusteringProbability-model-based clusteringThe EM algorithmClustering High-Dimensional DataSubspace clustering: bi-clustering methodsDimensionality reduction: Spectral clusteringClustering Graphs and Network DataGraph clustering: min-cut vs. sparsest cutHigh-dimensional clustering methodsGraph-specific clustering methods, e.g., SCANClustering with ConstraintsConstraints on instance objects, e.g., Must link vs. Cannot LinkConstraint-based clustering algorithms

96References (I)R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD’98C. C. Aggarwal, C. Procopiuc, J. Wolf, P. S. Yu, and J.-S. Park. Fast algorithms for projected clustering. SIGMOD’99S. Arora, S. Rao, and U. Vazirani. Expander flows, geometric embeddings and graph partitioning. J. ACM, 56:5:1–5:37, 2009.J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, 1981.K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is ”nearest neighbor” meaningful? ICDT’99 Y. Cheng and G. Church. Biclustering of expression data. ISMB’00 I. Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the k-means algorithm. SDM’05I. Davidson, K. L. Wagstaff, and S. Basu. Measuring constraint-set utility for partitional clustering algorithms. PKDD’06C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis, and density estimation. J. American Stat. Assoc., 97:611–631, 2002.F. H¨oppner, F. Klawonn, R. Kruse, and T. Runkler. Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition. Wiley, 1999.G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. KDD’02 H.-P. Kriegel, P. Kroeger, and A. Zimek. Clustering high dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowledge Discovery from Data (TKDD), 3, 2009.U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17:395–416, 2007

References (II)G. J. McLachlan and K. E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley & Sons, 1988.B. Mirkin. Mathematical classification and clustering. J. of Global Optimization, 12:105–108, 1998.S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 1, 2004.A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. NIPS’01J. Pei, X. Zhang, M. Cho, H. Wang, and P. S. Yu. Maple: A fast algorithm for maximal pattern-based clustering. ICDM’03 M. Radovanovi´c, A. Nanopoulos, and M. Ivanovi´c. Nearest neighbors in high-dimensional data: the emergence and influence of hubs. ICML’09S. E. Schaeffer. Graph clustering. Computer Science Review, 1:27–64, 2007.A. K. H. Tung, J. Hou, and J. Han. Spatial clustering in the presence of obstacles. ICDE’01A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-based clustering in large databases. ICDT’01A. Tanay, R. Sharan, and R. Shamir. Biclustering algorithms: A survey. In Handbook of Computational Molecular Biology, Chapman & Hall, 2004.K. Wagstaff, C. Cardie, S. Rogers, and S. Schr¨odl. Constrained k-means clustering with background knowledge. ICML’01 H. Wang, W. Wang, J. Yang, and P. S. Yu. Clustering by pattern similarity in large data sets. SIGMOD’02X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger. SCAN: A structural clustering algorithm for networks. KDD’07X. Yin, J. Han, and P.S. Yu, “Cross-Relational Clustering with User's Guidance”, KDD'05


Slides Not to Be Used in Class99

100Conceptual ClusteringConceptual clusteringA form of clustering in machine learningProduces a classification scheme for a set of unlabeled objectsFinds characteristic description for each concept (class)COBWEB (Fisher’87) A popular a simple method of incremental conceptual learningCreates a hierarchical clustering in the form of a classification treeEach node refers to a concept and contains a probabilistic description of that concept

101COBWEB Clustering MethodA classification tree

102More on Conceptual ClusteringLimitations of COBWEBThe assumption that the attributes are independent of each other is often too strong because correlation may existNot suitable for clustering large database data – skewed tree and expensive probability distributionsCLASSITan extension of COBWEB for incremental clustering of continuous datasuffers similar problems as COBWEB AutoClass (Cheeseman and Stutz, 1996)Uses Bayesian statistical analysis to estimate the number of clustersPopular in industry

103Neural Network ApproachesNeural network approachesRepresent each cluster as an exemplar, acting as a “prototype” of the clusterNew objects are distributed to the cluster whose exemplar is the most similar according to some distance measureTypical methodsSOM (Soft-Organizing feature Map)Competitive learningInvolves a hierarchical architecture of several units (neurons)Neurons compete in a “winner-takes-all” fashion for the object currently being presented

104Self-Organizing Feature Map (SOM)SOMs, also called topological ordered maps, or Kohonen Self-Organizing Feature Map (KSOMs) It maps all the points in a high-dimensional source space into a 2 to 3-d target space, s.t., the distance and proximity relationship (i.e., topology) are preserved as much as possibleSimilar to k-means: cluster centers tend to lie in a low-dimensional manifold in the feature spaceClustering is performed by having several units competing for the current objectThe unit whose weight vector is closest to the current object winsThe winner and its neighbors learn by having their weights adjustedSOMs are believed to resemble processing that can occur in the brainUseful for visualizing high-dimensional data in 2- or 3-D space

105Web Document Clustering Using SOMThe result of SOM clustering of 12088 Web articlesThe picture on the right: drilling down on the keyword “mining”Based on Web page

Why Semi-Supervised Learning?Sparsity in data: training examples cannot cover the data space well unlabeled data can help to address sparsity106

Semi-Supervised Learning MethodsMany methods exist: EM with generative mixture models, self-training, co-training, data-based methods, transductive SVM, graph-based methods, …Inductive methods and Transductive methodsTransductive methods: only label the available unlabeled data – not generating a classifier Inductive methods: not only produce labels for unlabeled data, but also generate a classifier Algorithmic methodsClassifier-based methods: start from an initial classifier, and iteratively enhance itData-based methods: find an inherent geometry in the data, and use the geometry to find a good classifier107


Graph MincutsPositive samples as sources and negative samples as sinksUnlabeled samples are connected to other samples with weights based on similarityObjective: find a minimum set of edges to remove so that all flows from sources to sinks are blocked109

Bi-clustering and Co-clusteringBiclustering, co-clustering, or two-mode clustering allows simultaneous clustering of the rows and columns of a matrixGiven a set of m rows in n columns (i.e., an m×n matrix), a biclustering algorithm generates biclusters – a subset of rows which exhibit similar behavior across a subset of columns, or vice versa110

111Chapter 11. Cluster Analysis: Advanced MethodsStatistics-Based ClusteringClustering High-Dimensional DataSemi-Supervised Learning and Active LearningConstraint-Based ClusteringBi-Clustering and co-ClusteringCollaborative filteringSpectral ClusteringEvaluation of Clustering QualitySummary111

Collaborative FilteringCollaborative filtering (CF) is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc.Applications involving very large data sets: sensing and monitoring data, financial data, electronic commerce and web 2.0 applicationsExample: a method of making automatic predictions (filtering) about the interests of a user by collecting taste information from many users (collaborating) Assumption: those who agreed in the past tend to agree again in the future112

FrameworkA general 2-step processLook for users who share the same rating patterns with the active user (the one for which the prediction is made)Use the ratings of those found in step 1 to calculate a predictionItem-based filtering (used in an item-to-item matrix determining the relationships between each pair of itemsInfer the user’s taste (i.e., the prediction) using the matrix113

Spectral ClusteringGiven a set of data points A, a similarity matrix S may be defined where Sij represents a measure of the similarity between points i and j (i, j ∊ A)Spectral clustering makes use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensionsIn functional analysis, the spectrum of a bounded operator is a generalization of eigenvalues for matricesA complex number λ is said to be in the spectrum of a bounded linear operator T if λI − T is not invertible, where I is the identity operator114

Shi-Malik AlgorithmGiven a set S of points, the algorithm partitions the points into two sets S1 and S2 Let v be the eigenvector v corresponding to the second-smallest eigenvalue of the Laplacian matrix L of SL = I – D–1/2 SD–1/2, where D is the diagonal matrix Let m be the median of the components in vPlace all points whose component in v is greater than m in S1, and the rest in S2115

Example116Extraction from