Spectral Embedding Norm: Looking Deep into the Spectrum of the Graph Laplacian

The extraction of clusters from a dataset which includes multiple clusters and another significant portion of"background"samples is a task of practical importance. The traditional spectral clustering algorithm, relying on the leading $K$ eigenvectors to detect the $K$ clusters, fails in such cases. This paper proposes the spectral embedding norm which sums the squared values of the first $I$ (normalized) eigenvectors, where $I$ can be larger than $K$. We prove that the quantity can be used to separate clusters from the background under generic conditions motivated by applications such as anomaly detection. The performance of the algorithm is not sensitive to the choice of $I$, and we present experiments on synthetic and real-world datasets.


Introduction
In unsupervised data analysis and machine learning, one of the most common goals is to group the data points into clusters.A variant task is to extract interesting clusters from the data when, in practice, data points do not perfectly fall into K clusters.Often, data points can be roughly divided into two parts, one consists of interesting sub-groups, namely "clusters", and the other one contains points which are less structured or of less interest, which we call "background".Theoretically, one may view the background component as an extra cluster, and this poses challenges for traditional clustering algorithms.In this paper, we consider the situation where the background takes a significant portion of the data compared to the size of the clusters.This unbalanced case is a classical scenario where traditional spectral clustering fails to identify the existing clusters [1][2][3].For example, the task of anomaly (or outlier) detection is such a case, where anomalous samples (small clusters) in the dataset differ from the normal ones (background) and indicate that something important has happened or a problem has occurred.By the very nature of the problem, most data points belong to the background and only a small fraction of data points are anomalies.Other examples include image segmentation and saliency detection where the clusters are certain regions of interest to detect, and background consists of the rest of the image, see applications in [4,5].
As a popular clustering method, the spectral clustering algorithm [6][7][8] reduces the dimensionality of the data using a spectral embedding, and then performs clustering in the low-dimensional space.The method originally proposed to cluster data into K clusters with the leading K eigenvectors (the low-lying eigenvectors of the graph Laplacian) computed from an affinity matrix built from the data [6,7,9,10].A question is then how to set the parameter K, and this is especially important for exploratory data analysis when the number of clusters underlying the data is not known a priori.The traditional solution is to use the spectral gap of the eigenvalues to determine K [11], yet in practical settings, such a gap may not exist.In particular, as has been pointed out by [1] and others, when the cluster sizes are unbalanced, or a large background component is present, there is no spectral gap after the K-th eigenvalue and the leading eigenvectors tend to be supported mostly on the large component due to the slow mixing time of the diffusion process restricted to it.This is indeed the case for anomaly detection, which poses a challenge for spectral methods.In practice, spectral embeddings have been used for anomaly detection in a modified way, based on the first non-trivial eigenvector of an affinity matrix [12,13], eigenvector selection [14,15], out-of-sample extension [16][17][18], and multiscale constructions of spectral embeddings [16,19], and usually requiring tuning of multiple parameters.
In this paper we consider a cluster-background splitting model of the graph, including anomaly detection as a special case.The model is motivated by applications, and experiments on real-world datasets will be shown.We propose a quantity called the spectral embedding norm, which maps each node in the graph to a positive number, and separates clusters from background with a theoretical guarantee.The idea is closely related to the "localization" pattern of the eigenvectors, namely where they are supported on -specifically, either mainly on the cluster block or the background block -and this pattern maintains even when the spectral gap vanishes.Viewing the affinity matrix as a perturbed one from a baseline affinity where the background and clusters are completely disconnected, one can analyze the consequent deformation of the spectrum of the graph Laplacian matrix.However, the instability of eigenvectors under the deformation poses difficulty to the use of individual eigenvectors in this environment.The spectral embedding norm, on the other hand, improves the stability by using a summation over multiple eigenmodes and provides guaranteed detection of the clusters by a simple thresholding of the quantity.The algorithm involves a parameter which is the number of eigenvectors summed over, and we will show that the method is not sensitive to the choice of this parameter.
This paper is organized as follows.We introduce the spectral embedding norm in Section 2, starting with a toy example which illustrates the main idea of the paper.The theoretical result is presented in Section 3, including extensions of the basic result.Experiments on synthetic and real-world datasets are shown in Section 4, proofs in Section 5 and further remarks in the final section.

Related works
As spectral clustering and variants have been intensively studied in literature, we only list the most relevant works and the references given here may not be exhaustive.
Spectral clustering and limitations.Different papers have highlighted the limitations of spectral clustering, mainly that, in practice, the top K eigenvectors may not localize on K given clusters, causing spectral clustering to fail [1][2][3].Previous works have analyzed the limitations and successes of spectral clustering in view of a diffusion process in a multi-well potential or a mixture of Gaussians model [1,10,20], or nonparametric mixture model [21].While Schiebinger et al. [21] showed that under certain conditions, the embedded points lie in an orthogonal cone structure and k-means succeeds in clustering the data, Nadler and Galun [1] showed that even for well separated Gaussians the top K eigenvectors do not necessarily localize on K clusters.Different methods, like [9,22], attempted to align the eigenvectors axes with the different clusters and improve the robustness of cluster identification.Zelnik-Manor and Perona [9] proposed to estimate the number of clusters from the eigenvectors instead of from the spectral gap, and empirically demonstrated improved performance when a background cluster is present (called "clutter"), while Damle, Minden and Ying [22] considered the case of balanced block-like affinity matrix.The embedding norm studied in the current work differs from the above approaches, and it involves a simple algorithm with theoretical guarantee under the specified settings.
Selection of eigenvectors.Several papers proposed to perform eigenvector selection to determine eigenvectors that localize on clusters or specifically on anomalies, using unsupervised spectral ranking [13], kurtosis [15], relevance learning [3], entropy [2,23], the L1 norm [14], and local linear regression [24].Both [2,3] proposed calculating Km eigenvectors where m ą 1 and then select informative eigenvectors.The theoretical analysis of [15] focused on the adjacency matrix of a graph, while [14] considered the modularity matrix of a graph.We will analyze the spectrum of the (normalized) random-walk Laplacian matrix which has a more stable spectrum with finite samples [25].In [26], the authors presented a twolevel tensor product construction to model the localization of eigenvectors on low-order eigenvectors and produced empirical improvements, but no theoretical analysis was provided to explain the model.
Spectral methods for anomaly detection.The following papers have employed spectral embeddings for anomaly detection in different applications: in images [16,27], fraud detection [13], in graphs [14,15], sensor networks [19], computer systems [12] and in general applications [17,18,28].For a general review of anomaly detection methods, the interested reader is referred to [29][30][31].As has been shown in [13][14][15]32], and we will demonstrate, the eigenvectors in an anomaly setting can lie deep within the spectrum of the affinity matrix.This gives rise to the notion of abandoning the guideline of focusing on K eigenvectors and rather choosing to look deeper into the spectrum, both for clustering and for anomaly detection.Similar to the embedding norm which we propose, the square sum of principle components (called "leverage score") has been used in a different setting as a indicator of outlier sample points for linear regression and other applications in statistics [33].
Notations.| ¨| stands for the cardinal number of a set.A c means the complement of a set A.

Cluster-Background splitting in the graph
Given n data points in the feature space, an undirected weighted graph can be constructed which has n nodes, denoted by V, and the weight on edge px, yq is the affinity between nodes x and y denoted by W px, yq, W px, yq " W py, xq ě 0. W is an n-by-n real-symmetric matrix of non-negative entries, called the graph affinity matrix.In applications, W is built as a pairwise affinity between data points in a feature space, e.g., W px, yq " kpx, yq, where k is a symmetric kernel function applied to the feature vectors of data points x and y.In our analysis we assume that W has been constructed.Suppose that V can be divided into two disjoint subsets, background and clusters, denoted by B and C respectively.The typical scenario which we consider is when data points in C are concentrated in the feature space and well-clustered into K sub-clusters, whereas those in B can be "manifold-like" and spread over the space.The precise assumptions will be formulated in terms of the graph-Laplacian spectra constrained to the subgraphs of C and B (see Assumption 1).We also assume the connections between C and B are weak.As a result, the submatrix of W constrained to C is close to having K blocks, and is almost separated from the submatrix of B. Define matrix W 0 by removing all the connections between B and C from W , i.e.W 0 is a block-diagonal matrix consisting of two blocks of C and B respectively.We introduce a pseudo-dynamic parametrized by time t as W ptq " W 0 `tE, t P r0, 1s so that W p0q " W 0 and W p1q " W .The analysis is of the deformation of the spectrum of W ptq (specifically, that of the normalized graph Laplacian) over time t.
To simplify the analysis, we assume that the K sub-clusters are of equal size, that is, and each of the K clusters in C has δ|V| K nodes.Our result extends to the unequal-size case (Section 3.3).

Graph Laplacian and embedding norm
Consider the random-walk graph Laplacian of W L " I ´D´1 W :" I ´P, where D is a diagonal matrix defined by D ii " ř j W ij , and P is a Markov matrix.We shall see that D is always invertible.P is similar to D ´1{2 W D ´1{2 which is real-symmetric, thus P is diagonalizable and has n real eigenvalues.Let where tλ k u k are the eigenvalues of P , tψ k u k are the corresponding right eigenvectors and δ kj " 1 when k " j and 0 otherwise.Given that |λ k | ď 1 (Perron-Frobenius theorem), and when W is a positive-definite kernel matrix then all the eigenvalues are between 0 and 1.The largest eigenvalue is 1 and the associated eigenvectors is the constant vector.Note that time dependence is omitted in the above notations: As we introduced the deformation of W in (1), D, P and consequently ψ k and λ k also depend on t.We assume that at t " 1 the eigenvalues are sorted to be decreasing, and for other t, the indexing k is arranged so that ψ k and λ k are differentiable with respect to t (see Section 3).
The spectral embedding norm of every node x P V is defined to be Spxq " where I is a subset of the eigenvalue indices t1, ¨¨¨, nu.Spxq is the (squared) Euclidean norm of the embedded vector of x in the spectral embedding space using eigenvectors of indices in I.When needed, we include the time dependence in the notation written as Spx, tq, t P r0, 1s.A typical choice of I is when the eigenvalues are sorted to be descending.The cardinal number |I| is chosen to be around K and up to a few multiples of K, and in practice, estimates of K can be used if K is not known.We will show that the result is not sensitive to the choice of |I|, in analysis and experiments.Under certain assumptions, the embedding norm S is able to separate C from B by a provable margin (Theorem 3.1).A general weighted form of S is introduced in Section 3.3, and the main result extends directly.The quantity (3) is naturally related to the diffusion distance [34], we will elaborate more on this in the last section.

A prototypical toy example
The prototypical scenario which motivates the proposed method is illustrated in the toy example in Figure 1.(a) shows data points in R 2 consisting of two groups: a large group, denoted by B, which lie close to the the unit circle (blue) and a small one, denoted by C, which form a small cluster lying close to the circle (red), and |C| |V| " δ " 0.01.Variations of this model involve multiple sub-clusters in C (see Figure 4 and the application in Section 4.3), and the qualitative picture is the same.The affinity matrix W built from the data is shown in (b).The first few eigenvalues, changing over time t, are shown in (c), and the associated eigenvectors evaluated on two nodes, one in C and one in B, are plotted in (d) and (e).We sort the eigenvalues of P from large to small, and the first eigenvalue is always 1.The embedding norm Spxq takes the squared-sum of the first 40 eigenvectors on each node (c.f.(3)) and is plotted in (f) over time.
The figure demonstrates that 1.Though two blocks C and B exists in the graph, there is no clear eigen-gap between the second and third eigenvalues.Actually, the leading eigenvalues are all very close to 1 throughout time 5 (the eigth eigenvalue is greater than 0.998).
2. While the second eigenvector ψ 2 distinguishes C at short time t, once t is greater than 0.01 ψ 2 fails to indicate the cluster C ((d) shows the value on one node and it is typical for the value of ψ k 's on C).The transition actually happens when the initial gap between λ 2 and λ 3 almost vanishes.However, the high-index eigenvectors may take large value on C (the eigth eigenvector starts to take large magnitude on C around t " 0.1, and the trend of high-indexed eigenvectors localizing on C continues, which is not shown).This is evident by Spxq consistently distinguishing C from B over time (shown in (f)).This suggests that when the leading eigenvectors fail to identify the cluster C, the information of the location of C may be contained in higher-indexed eigenvectors, and looking "deep" into the spectrum may be helpful.However, the selection of "informative" eigenvectors is generally a challenging problem.In particular, as shown in (d)(e), the deformation of eigenvectors is not stable when eigenvalues get close, which makes it difficult to study them individually.Instead, we consider in the embedding norm the summation over a group of eigenvectors, and intuitively this should improve the stability over time.This is indeed the case here: The quantity S varies smoothly over time and preserves a gap between C and B, as shown in (f), thus one can detect C from B by thresholding the value of S. Note that 40 eigenvectors are used in the summation, which is much larger than 2. We will justify this in the analysis.

Analysis
At t " 0 in (1), the matrix W 0 has the two-block structure, and the spectrum of the graph Laplacian of W 0 also splits into two groups, one residing on C and the other on B respectively.However, as t increases, interactions among the eigenvectors develop and the the perfect splitting pattern is no longer preserved.In this section we show that the embedding norm varies more stably than individual eigenvectors over time, and serves as a measure by which to separate C from B up to time t " 1.  19) is obtained by by the contour integral Γ, which exists for all t due to positive ∆ptq .

Initial separation by S and assumptions
Since we will use S to separate B and C, we need it to do so at least at t " 0 when the two blocks B and C are perfectly separated by removing all the edges connecting them.Note that this does not necessarily happen unless certain assumptions are made: Because the eigenvalues of the B block can be close to 1 and the clustering in the C block may not be perfect, the first |I| eigenvectors may be supported either on C or on B, and there is generally no guarantee that the squared sum (3) will distinguish the two blocks.The assumptions needed here are the following two: (1) At t " 0, the eigenvectors in I which are supported on B are sufficiently delocalized ("flat") and those on C are close to the well-clustered case; (2) The fraction δ of |C| is sufficiently small so that the eigenvectors on C are of sufficiently larger magnitude than those on B, due to the eigenvector normalization (2).The precise condition depends on the choice of |I|, the node degrees and so on.
To proceed, we define the degree function to be dpx, tq " which depends on time t.We introduce the notation νpA, tq to stands for the "volume" of set A at time t, where "volume" means the sum of the degrees, νpA, tq " We also define lower and upper bounds and assume that d 0 ą 0. By construction (1), the degree dpx, tq of any node monotonically increases over time.Thus d 0 is the universal degree lower-bound: Lemma 1.For all x P V and all 0 ď t ď 1, dpx, tq ě d 0 ą 0.
At t " 0, since the affinity matrix falls into two separated blocks B and C, so do the eigenvectors.We call them initial eigenvectors, and the set of eigenvectors which are only supported on B are called the B-eigenvectors, denoted by Ψ B , and similarly for C-eigenvectors and Ψ C .The assumption on these eigenvectors and the index set I is the following: Assumption 1 (B and C-eigenvectors).At t " 0, (a) The index I includes K C-eigenvectors and |I| ´K B-eigenvectors.
(b) Each of the K eigenvectors in I X Ψ C (up to a K-by-K rotation of these K vectors) is associated with one of the K clusters in the following sense: There exists 0 ď ε 1 ă 1, and for each ψ P I X Ψ C , there is a unique j, 1 ď j ď K, s.t.
(c) There exists ε 2 ě 0, s.t. for any ψ P I X Ψ B , The above assumption, while appearing to be complicated, poses only generic conditions on the subgraphs B and C: In the perfectly separated case the largest K C-eigenvalues are 1, and the pK `1q-th one is strictly less than 1 and depends on the mixing time of the Markov chain within each cluster.Since we primarily work with well-clustered C which takes a small fraction of nodes and are localized in the graph (typically C is an outlier cluster, or several localized regions of interest, see the applications in Section 4), this spectral gap is usually significant.As a result, even when the clustering is not perfect, the pK `1q-th C-eigenvalue is still sufficiently far away from the first K ones, and they can be excluded from the index set I, since I selects the largest |I| eigenvalues.This fulfills (a).
If the K clusters in C are perfectly separated, one can verify that ε 1 " 0 in (b).Thus (b) holds when C is not far from being well-clustered.
Assumption 1(c) requires that the eigenvector ψ is sufficiently delocalized, or "flattened" on B: Recall that (2) and the first eigenvector (associated with eigenvalue 1) takes the constant value ψpxq 2 " 1 νpB,0q .If all the other eigenvectors are flattened, then (c) holds with some small ε 2 .The delocalization widely applies when B are built from data vectors lying on certain regular manifolds: assuming that the discrete eigenvectors well approximate the continuous limits which are eigenfunctions of the manifold Laplacian, the delocalization of the former inherit from that of the latter (quantum ergodicity theorem [35,36]).When the spectral convergence is poor, the finite-sample effects may create some localized pattern in the "noisy" eigenvectors, however, since |I| is typically a small number compared to n, we assume that the selected B-eigenvectors are sufficiently close to the population ones.
The second assumption is the following: Recall that δ " |C| |V| , Assumption 2. The constants δ, |I| and K satisfy that where ε 1 , ε 2 are as in Assumption 1.
The above two assumptions guarantees that the embedding norm Spx, 0q separates the blocks C and B at time t " 0, together with an upper bound of Spx, 0q over V: Proposition 1 (Initial separation by Spxq).Under Assumption 1, at time t " 0, If furthermore, Assumption 2 holds, then (1) The initial gap between B and C is at least that is, @x P C and y P B, Spxq ´Spyq ě g 0 ą 0.

Stability of the embedding norm under deformation
We will prove the stability of Spx, tq over time making use of the Hadamard variation formula for the eigenvalues and eigenvectors, after properly indexing them.Specifically, since we assume that d 0 ą 0, the diagonal matrix D is invertible throughout time, and the Markov matrix P " D ´1W is diagonalizable and similar to D ´1{2 W D ´1{2 .Under the matrix perturbation model (1) which is linear in t, the n eigenvalues of the Markov matrix P can be indexed as λ 1 ptq, ¨¨¨, λ n ptq so that they are descending at t " 0, i.e. λ k`1 p0q ď λ k p0q, and differentiable with respect to t for 0 ď t ď 1 (Chapter 2 of [37]).Similar to the classical Hadamard variation formula, the evolution equation of λ k can be shown to be and the equation of the associated eigenvector ψ k is, when valid, that is, (15) holds on time intervals when no eigen-crossing of any pair of λ k and λ j happens.The derivation of ( 14), ( 15) is left to Appendix B. Though the n eigenvalues are ordered from large to small at t " 0, an eigen-crossing (or neighboring eigenvalues becoming very close) may happen as t increases, as illustrated in the diagram in Figure 2, and numerically in the toy example in Figure 1.This voids a direct adoption of (15) unless one shows that the singularity does not affect the differentiability of the eigenvector branches before and after the crossing, which is still possible in our setting [37].However, even if (15) can be made valid with such an effort, when an eigen-crossing or a near crossing happens there is generally no control on the speed of change of the associated pair of eigenvectors.Some steep changes of eigenvectors are shown in the toy example in Figure 1, at times of (near) eigen-crossings.This instability of eigenvectors under matrix perturbation underlies the main difficulty to justify the use of leading eigenvectors in this environment, not only for theoretical analysis but for algorithms as well.
The main observation of this work is to overcome such instability by considering the spectral embedding norm instead of individual eigenvectors.A key quantity needed in the stability bounds (of both the eigenvalues and the embedding norm) is the C-B "connection strength", measured by The analysis needs C to be a small compared to the magnitude of the node degrees, specifically, C d0 needs to be a small constant.We note that the condition may be much stronger than encountered in applications due to the reliance on a spectral gap between I and I c eigenvalues.To be specific, we define the I-eigen-gap (depending on time t) to be ∆ptq :" min iPI, jRI |λ i ptq ´λj ptq|, t P r0, 1s.
Such an "I-eigen-gap" prevents eigenvalues from I and I c to get too close, but allows arbitrary eigencrossings within I and I c .While needed in the perturbation analysis, we note that ∆ptq should be viewed as an artifact due to the limitation of our theory (see Remark 1).However, this is essentially different from the traditional spectral gap assumed after the K-th eigenvalue.
The following proposition proves the preserved I-eigen-gap assuming an initial one, based upon the stable evolution of eigenvalues c.f. (14).Proof in Section 5.
The significance of the preserved I-eigen-gap is that we can derive the evolution equation of the embedding norm Spx, tq without being concerned with the eigen-crossings within I (and within I c ).This is possible by relying on Spxq being the px, xq-th diagonal entry of the spectral projection matrix P I :" ř kPI ψ k ψ T k , which can be written in form of a contour integral of the resolvent in the complex plane where the contour circles the eigenvalues in I throughout t P r0, 1s, as illustrated in Figure 2. The evolution equation below only requires eigenvalue difference λ k ´λj to be non-vanishing when one is from I and the other is from I c .Actually, this difference is bounded from below by the constant ∆ by Proposition 2. Proof in Section 5.
Then the two parts B and C can be separated by thresholding the embedding norm, i.e., there exists a constant τ s.t. at t " 1 Spxq ą τ, @x P C, Spxq ă τ, @x P B.
The r.h.s. of ( 20) is technical and appears to be complicated.However, we show that in the typical setting it is not more restrictive than (18) (and it implies the latter, see below): Note that the r.h.s. is greater than (using that logp1 `x 2 q ą 2 5 x for 0 and thus unless p#q is too small, this term would be comparable to ∆ 8 .To be specific, suppose that δ is so small that the first term in the formula of p#q (12) dominates, which makes p#q approximately d0 d0 , assuming that ε 1 and ε 2 are small constants.This is reasonable since we typically apply the proposed method when the initial separation is large, where the initial gap g 0 " K nd0 p#q δ .Furthermore, in such cases, if the graph has balanced degree, i.e., d 0 « d 0 , then p#q would be close to 1. Combined with ∆ being small, e.g., ∆ ă 0.1, ( 21) is then approximately 0.39 ¨∆ 8 .
Remark 1.In practice, the requirement of a spectral gap ∆ between the |I|-th and (|I| `1)-th eigenvalue may not be satisfied, yet the embedding norm S can still produce a good separation of C from B, see e.g.Fig. 4. The is because that the analysis is not tight: for one thing, the relaxation of the term 1 pλ k ´λj q for k P I, j P I c by 1  ∆ is crude, and can be improved, e.g., under proper assumptions of the eigenvalue distribution.In this aspect, the need for an I-eigen gap in the theory is still somehow artificial, and further analysis should be able to relax the constraint.This also means that the requirement on C in (20) may be more restrictive than what occurs in practical applications.

Extension of the results
The main result Theorem 3.1 can be directly extended to cover the following cases, and proofs omitted: 1. Weighed embedding norm.The definition of the embedding norm S can be generalized as where f pλq is a (complex) analytic function which is real-valued on real λ.We have been addressing the special case where f " 1.The results extend to analytic functions f pλq, e.g., f pλq " λ p where p is a positive integer.To see this, consider the contour integral of f pzqRpzq, R being the resolvent (defined in (37)), and then the time-evolution equation of Spx, tq can be shown to be where in case that the eigenvalues λ k and λ j coincide, the term f pλ k q´f pλj q λ k ´λj is replaced by f 1 pλ k q and λ k f pλ k q´λj f pλj q λ k ´λj by pzf pzqq 1 pλ k q.So the r.h.s. is well-defined when an eigen-crossing within I happens, and the terms f pλ k q´f pλj q λ k ´λj and λ k f pλ k q´λj f pλj q λ k ´λj are uniformly bounded due to the analyticity of f .When f " 1, the equation reduces to (19).Proceeding with the same technique as in the proof of the main result, the deformation bound of Spx, tq will then involve constant factors which depend on the boundedness of f and f 1 on r0, 1s.Specifically, the constant C will need to be redefined to be pc 1 `c2 ∆ q 2C d0 where c 1 and c 2 are absolute constants.E.g., when f pλq " λ p , p ą 0, c 2 remains 4 (which is the dominating term with small ∆) and c 1 " pp `1q.We compare to diffusion distance [34] in the last section.
2. Unequal cluster size in C. The requirement of equal cluster size of the K clusters in C can be relaxed.Specifically, suppose that the K clusters have varying sizes |C j | " δ j n, and ř K k"1 δ j " δ.Let δ min " min 1ďjďK δ j , and similarly define δ max .Then under Assumption 1, ( 10) and ( 11) become Define g j,0 :" 1 nd 0 ˆd0 p1 ´ε1 q and the minimum of g j,0 is Modify Assumption 2 to be that g min,0 ą 0, then the initial separation of Spx, 0q on C and B is at least g min,0 (and more precisely g j,0 between C j and B), and (13) becomes Note that Proposition 2, Proposition 3 and claims (1) (2) in the proof of Theorem 3.1 do not rely on Assumption 1 or Assumption 2 and are valid.As a result, it can be shown that the t " 1 separation between C and B by S holds as long as This condition is more restrictive when the cluster sizes in C are less balanced, namely when the difference δ max ´δmin becomes larger.In our numerical experiments, all the sub-clusters are of comparable sizes (in the outlier detection in images, K " 1 or 2, and in image segmentation the clusters are of similar sizes), while we note that an extremely unbalanced cluster size, e.g., very small δ min , could affect the performance of the method.
3. Detection of parts of C. The above argument leads to a "personalized" detection condition for each cluster C j in C, that is, even when Spx, 1q fails to separate some clusters in C from B it may still successfully detect the rest.To see this, note that the proof of the theorem actually gives the following: For any subsets E   The previous results corresponds to E 1 " C and E 2 " B. Let E 1 be any individual cluster C j , then since Sp0q is upper-bounded by (26), we have that each cluster C j can be separated from B by Spx, 1q if g j,0 as defined in (25) is larger than the r.h.s. of ( 27). 4. Initial inclusion of I.The Assumption 1 (a) can be relaxed by only requiring K 1 C-eigenvectors in I, 0 ă K 1 ď K, as long as they contribute to a sufficiently large Spx, 0q on C, or any subset of C such as an individual cluster C j .The separation guarantee at time 1 follows the same argument as in item 2. above, where the quantities (23) (24) and consequently (25) (26) need to be modified.The precise condition is not pursued here.In practice, this means that even if less than K "nearly" C-eigenvectors are included in I, the method may still be able to detect part of C from B.

Experiments
In this section we will apply the spectral embedding norm to both synthetic and real-world datasets, in scenarios of both single outliers and multiple clusters in a cluttered background.

Manifold data toy example
We begin with a simulated dataset composed of a manifold-like background B and clusters in C following the model: in R 2 , B consists of i.i.d samples x i distributed as x i " y i `ni , where y i are uniformly distributed on the unit circle, which is a one-dimensional manifold, and n i " N p0, 2 B Iq, with B " 0.01.C contain K equal-sized sub-clusters, each has i.i.d.samples drawn from N pµ j , 2 C Iq, where µ j are centered close to the circle, and C " 0.02.n " 5000 points are generated, and the number of points in C is set to be δn for positive δ, rounded so that each sub-cluster has integer many points.To measure the accuracy of the detection of C we compute the F1 score: F1 " 2pr p `r , p :" TP TP `FP , r :" TP TP `FN .
Figure 3 shows results for K " 2, δ " 0.02.For K " 2, the eigenvalues do not reveal any clear eigen-gap, and the first 5 eigenvectors do not all give a clear indication of the cluster C, but contain a significant amount of "component" from B-eigenvectors.Examining the first k " 20, certain eigenvectors are more localized on C, e.g.k " 8 and 10 in this example.The embedding norm S I clearly separates C from B. The results are not sensitive to parameter choices.Let k ST be the k-nearest neighbor used to set the local self-tuning scale [9] in constructing the affinity matrix W .Then, throughout varying values of the parameter k ST , the F1-score of the detection by thresholding S reveals a "plateau" of valid values of |I|, e.g., when k ST " 8, the range of |I| is about 18 " 30, with the optimal F1 score obtained at |I| "23.The best F1 score for k ST " 4, 8 or 16 are all greater than 0.99.
Similar results are obtained for K " 10, δ " 0.1, as shown in Figure 4.The condition in Assumption 2, (9) suggests that |I| is chosen to be proportional to K δ , and this is revealed in Figure 3 and Figure 4 (in these two examples K δ is kept to be the same) as the "plateaus" of valid |I| are at about the same range, across values of k ST .

Side-scan sonar imagery
Anomaly detection can be seen as a special case of clustering in which there is a vast imbalance in the size of clusters, i.e., background vs. anomaly, and the density of each cluster.In image anomaly detection, the goal is to detection a small region (subset of pixels) that differs from the "normal image background.This is an important task in a variety of applications such as target detection in remote sensing imagery, detecting abnormalities such as tumors in biomedical imagery and for quality inspection in production lines.An automated solution highlighting only suspicious regions to be reviewed by an expert would save greatly on time.We demonstrate on real-world images that eigenvectors localizing on the anomaly can be buried deep within the spectrum of the image, and that by calculating the spectral embedding norm we can identify the anomalies.We examine two side-scan sonar images containing a single sea-mine, displayed in Fig. 5 where we consider the sea-mine to be an anomaly (indicated by a red circle).The first image contains a noisy periodic background, and the presence of the sea-mine is indicated by its shadow which appears as a dark cluster.The second image contains a more varying background and the sea-mine appears as a bright highlight with dark shadow to its right.To construct the graph we extract all overlapping patches of size 8, and build a nearest neighbor graph with 64 neighbors and set k ST " 32.In Fig. 5(b) and (d) we display Spxq for |I| " 50, reshaped as an image, separating the sea-mine from the background.In Fig. 6 we plot examples of eigenvectors of the Laplacian, where the eigenvectors in the top row localize on the background B, revealing its periodic nature at different scales and orientations, while the eigenvectors in the bottom row localize on the sea-mine C. To demonstrate the stability of Spxq, in

Calcium Imaging
Calcium imaging is an experimental method in neuroscience that enables the individual activity of hundreds of neurons in an awake behaving animal, at cellular resolution [38].The acquired data is composed of a spatiotemporal volume, where, after motion correction, the neuron locations are fixed and the temporal activity consists of hundreds to tens of thousands of time-frames.There is also varying temporal activity in the background (neuropil).Thus, this data can also be viewed as an image whose pixels lie in a high-dimensional space (time-frames), consisting of hundreds of clusters (neurons) in an image plane with a non-trivial background, which matches our problem setting.It is of interest to provide a 2D image that depicts the structure that exists in this volume and highlights the existing neurons.Such images serve for manual segmentation, to align imaged volumes across days (where the field of view may shift), to display neurons detected by automatic and manual means, etc.Several ROI extraction algorithms are also initialized based on such images [39,40].Typically the temporal correlation image is used [41], or the temporal mean image.Here we show that the spectral embedding norm provides a meaningful visualization of the data, with sharp morphology and suppression of noise from the background clutter.
We analyze a publicly available dataset from Neurofinder [42].The images are 512 ˆ512 pixel and 8000 time frames have been recorded at 8 Hz.Ground truth labels provided with the dataset includes 197 identified neurons, however note that recent papers point out that the ground truth on Neurofinder datasets is probably lacking, i.e. not all neurons are labeled [39,43].To calculate the affinity matrix, we calculate a nearest neighbor graph for all pixels, represented as high-dimensional vectors in time, using 50 nearest neighbors.To accelerate the nearest neighbor search dimensionality is reduced from 8000 to 300 using PCA.In Fig. 9 we display examples of eigenvectors from both the background (top) and localizing on single neurons (bottom).
Figure 10 compares the spectral embedding norm (right) to the temporal mean (left) and temporal correlation image (middle).In each image, the values (mean/correlation/norm) appear in the green channel, while we overlay in the red channel a mask of the ground truth labels that were manually selected (where the two overlap it appears as yellow).The mean image exhibits a strong background, while neurons appear as typical "donuts" [40].In the correlation image, the background mostly appears as noise.In comparison, the background has been suppressed in the spectral embedding norm image, while neurons which are barely or not at all visible in the correlation image appear as bright clusters.
To quantify, the separation of background and clusters, we segment the spectral embedding norm image for increasing |I|, and compare the overlap between the segmented clusters and the given ground-truth mask.In Fig. 11 we plot the F1 score and demonstrate a plateau of stable F1-score values around 200-250.We also display the image of Spxq for multiple values of |I|, to demonstrate the property of the spectral embedding norm to detect parts of C. Note that we are not performing clustering here, but rather demonstrating how the embedding norm can be used to separate meaningful structure from background clutter.Beyond visualization, this approach can then serve to remove the background, and focus only on the remaining clusters in C, thus simplifying a clustering task.

Proposition 1 (Initial separation by S)
Proof of Proposition 1.It suffices to prove (10) and (11), because Assumption 2 implies that the r.h.s of (11) is strictly less than the l.h.s. of (10) by g 0 , and then claims (1) and (2) directly follow.To prove (10): Note that for any x P C, at t " 0, By Assumption 1 (a), up to a possible K-by-K rotation among the K eigenvectors in I X Ψ C , we assume that ψ j is the eigenvector associated with the sub-cluster C j , j " 1, ¨¨¨, K, and then as the rotation preserves the squared sum.Furthermore, suppose that x P C jx , Assumption 1 (b) gives Plugging into (28), it shows that Spxq ď 1 `ε1 νpC jx , 0q `pK ´1q ε 1 νpC, 0q , and together with νpC j q " ř xPCj dpx, 0q ě d 0 |C j | " d 0 δn K for any j (the K sub-clusters are equal-size) and similarly νpCq ě d 0 |C| " d 0 δn, it gives the upper bound in (10).Consider the lower bound, (28) continues as Spxq ě ψ jx pxq 2 ě 1 ´ε1 νpC jx , 0q .

Proposition 2 (Preservation of the I-eigen-gap)
Proof of Proposition 2. We will establish that for any k " 1, . . ., n, Given that this inequality holds, then by (18), This means that it is impossible for ∆ptq ă ∆: Otherwise, there exist |λ k1 ptq ´λk2 ptq| ă ∆, where k 1 P I and k 2 R I, and then (31) implies that |λ k1 p0q ´λk2 p0q| ă 2∆ which contradicts the assumption that ∆p0q ě 2∆.It suffices to show (30) to finish the proof.To do so, we prove the following bound From (14), As |λ k | ď 1 (Perron-Frobenius), then If the following claim is true, then (32) follows directly from (33): Proof of (34): To bound |ψ T l 9 W ψ k |, note that 9 W " E, and then We use the following more relaxed bound where the last inequality relies on dpxq ě d 0 throughout time (Lemma 1).The second equality relies on DΨΨ T " I, thus ΨΨ T " D ´1.Then (35) continues as where the bound ( 36) is used to bound each |ψ l pxq| and |ψ k pxq| in the 2nd inequality.Note that while time dependence has been omitted in all the notations, the above arguments hold throughout time t P r0, 1s.

Proposition 3 (Evolution of S)
Proof of Proposition 3. As explained in the text, one may first establish the validity of ( 15) and then verify the formula (19) based on the former by observing the cancelation of terms.As an alternative approach, we use the contour integral of the resolvent.
For z P C and not an eigenvalue of P " D ´1W , define Rpzq " pW ´zDq

´1
where the time dependence is omitted.By that P " ΨΛΦ T , Λ " diagtλ 1 , ¨, λ n u, Φ " DΨ and Ψ T Φ " I, one can verify the equivalent form of R as This means that P I " where the contour Γ is such that the eigenvalues in I (I c ) stay inside (outside) Γ throughout time t (Figure 2), and such Γ exists due to Proposition 2. Thus the above expression of P I holds for all time t, and as a result Rpzqdz.
By differencing both sides of pW ´zDqR " I, one obtains that 9 R " ´Rp 9 W ´z 9 DqR.
This means that where By Cauchy's integral formula, one can verify the following: (a) When k P I, l P I, α kl " 0, Since Spxq " P I px, xq, the claim follows by evaluating at the entry px, xq on both sides.

Theorem 3.1 (Separation at t " 1)
Proof of Theorem 3.1.We firstly show that condition (ii) implies ( 18): Note that p#q ď d 0 p1 ´ε1 q Together with where in the integrand all the variables involving time take value at time τ .Introducing the notation Sptq :" sup we are going to prove the following two claims: For any t P r0, 1s, (suppose that Sptq " Spx 0 , tq for some x 0 , then Spx 0 , tq ´Sp0q ď Spx 0 , tq ´Spx 0 , 0q).Since Claim (1) holds uniformly for x, this implies that Sptq ´Sp0q ď C ż t 0 Spτ qdτ, and the claim then follows by Gronwall's inequality.

Further Comments
Viewed as diffusion distance.With f pλq " λ p , Spxq can be interpreted as the (squared) diffusion distance between node x and the origin at diffusion time p 2 [34].The diffusion distance can be interpreted as a "geometrical" distance between two nodes when the affinity graph is built from data points lying on a manifold embedded in the ambient space, and the distance is intrinsic to the manifold geometry and invariant to the specific embedding.Since λ p k Ñ 0 when p is large (except for λ k " 1), the origin point is the limiting point of the diffusion map embedding.Thus for x in a sub-cluters in C, the weighted norm Spxq with positive p can be viewed as a measurement of the extent of metastability (the "depth" of the well) of the potential well associated with the sub-cluster.In view of the diffusion distance, under the setting of this paper, nodes in B are very similar to one another, and in comparison nodes in C are distinct from those in B (as well as from other sub-clusters in C, which is not reflected in S).A similar weighted form has also been studied in [28] for graph-based outlier detection.In the primary application considered in this paper, the leading eigenvalues are all close to 1, which means that the weighted form (22) is not very different unless p is large.On the other hand setting p to be large may suppress the high-index eigenvectors by small weights while they are actually the informative ones to indicate C. Due to these reasons, we mainly consider f " 1 in the current paper, though the analysis directly extends.
Indexing eigenvectors by support regions.The phenomenon being studied here also suggests that sorting by the magnitude of eigenvalues may not be the most informative way to index the eigenvectors, a problem recently addressed in [44].Our paper here studies the special case where eigenvectors can be grouped by where they are mainly supported on.In the pseudo-dynamic (1), the eigenvectors begin with being exactly supported on either C or B at t " 0, and as time develops this pattern is nearly preserved as long as the C-B inter-block connections are not too strong.The distinct support regions of eigenvectors appears to be irrelevant to the magnitude of the eigenvalues nor the existence of spectral gaps.This suggests that grouping eigenvectors by their localization regions maybe a better way to arrange them in such cases.However, one still needs to be careful with the instability of eigenvectors: As shown in the numerical example, when two eigenvalues get close in the pseudo-dynamic, the associated pair of eigenvectors "swap" their values.(The swapping may be analyzed by the differential equation ( 15): assuming that among all the pairs of neighboring eigenvalues only one pair pλ k ´λj q is approaching zero, then the dynamic of ψ k evolution is dominated by that pair, which approximates a "rotation" among the indices j and k.)Our analysis in the current paper handles this by the summation in S over the index group I, which makes S invariant to such "swaps" as long as j and k both belong to I. Generally, since eigen-crossings only happen at isolated times in the deformation dynamics [37], these special times can be excluded.Then one can say that the eigenvectors continue to almost localize on one of the two blocks most of the time.
Eigenvector selection.Related work have devised different methods to perform eigenvector selection to identify anomalies [14,15].The spectral embedding norm can also be used for eigenvector selection.[45] demonstrated that it can be used to identify pixels which define clusters and find the embedding coordinates that best separate them from the background.An example in the anomaly detection case is given in the Appendix.

Figure 1 :
Figure 1: Plots of eigenvalue and eigenvectors of P " D ´1W over time.(a) n " 5000 data points in R 2 sampled on B Y C, where points in B lie close to a circle (blue) and points in C form the small cluster lying close to the circle (red).In this case, K " 1, δ " 0.01.(b) The affinity matrix W ptq at t " 1, c.f. (1).(c) Plot of the first 8 eigenvalues as t increases from 0 to 1 (excluding λ 1 " 1).(d) The absolute values of the associated first 8 eigenvectors at x 1 P C over time.(e) Same plot at x 2 P B. (f) The values of the embedding norm Spxq defined in (3) at x 1 and x 2 over time, where |I| " 40.

Figure 2 :
Figure 2: Diagram showing the evolution of eigenvalues of the Markov matrix as W ptq changes over time as in (1).At t " 0, circles indicate B eigenvalues, and crosses indicate C ones.In this example, K " 3, and |I| " 10.Eigenvalues of the B-submatrix are shown in circles, and those of C-submatrix in crosses.Note that B can have eigenvalues close to 1 even at t " 0. As t increases, at most time the eigenvalues are all of multiplicity one.Eigen-crossings may happen within I and I c but not in between, and the I-spectral gap denoted by ∆ptq is preserved (Proposition 2).The differential equation (19) is obtained by by the contour integral Γ, which exists for all t due to positive ∆ptq .

Figure 3 :
Figure 3: Detection of C from a manifold-like B: (a) n " 5000 data points in R 2 sampled on B Y C, where B points lie close to a circle (blue) and C points form K " 2 clusters lying nearby (red), δ " 0.02.(b) The affinity matrix W ptq at t " 1, c.f. (1).(c) The first 100 eigenvalues of the Markov matrix.(d) The plot of S. (e) k-th Eigenvectors of multiple k's of the Markov matrix.(f) F1-score of the detection of C by thresholding the values of S I , where |I| varies from 2 to 100, and for multiple choice of self-tuning parameter (k-nearest neighbor in self-tuning, denoted by k ST ).Mean and standard deviation of F1-score are shown, and optimal value of |I| are indicated by a red cross.

Figure 5 :
Figure 5: Detecting anomalies in images.(a,c) Side-scan sonar images with sea-mines indicated by red circle.(b,d) Spectral embedding norm with |I| " 50 reveals the sea-mine.

Figure 6 :
Figure 6: Examples of Laplacian eigenvectors on the side-scan sonar image in Fig. 5(a), with eigenvectors mainly supported on the background (top row) or localizing on the sea-mine (bottom row).

Figure 9 :
Figure 9: Example of Laplacian eigenvectors on a calcium imaging dataset from Neurofinder, either mainly supported on the background clutter (top) or localizing on neurons which are clusters (bottom).

Figures 7 -
Figures 7-8 we plot Spxq for increasing values of |I|, demonstrating that for both images, the sea-mine is revealed consistently for a wide range of values, while the background is suppressed.Note that in both cases this requires looking deep enough in the spectrum, and the first few eigenvectors bring out background structures.

Figure 10 :
Figure 10: Images of the Temporal mean (left) Temporal correlation(middle) and Spectral embedding norm (right) for a Neurofinder dataset.The spectral embedding norm has both removed the background (which is present in the mean image) and enhanced the appearance of the structure in the image: neuronal soma and dendrites, with sharp morphology.The correlation image is much noisier with fewer visual neurons.

Figure 11 :
Figure 11: F1-score of segmenting neurons from background based on the spectral embedding norm image, for increasing |I| values.We can see that for a range of values (200-250) the F1 score plateaus, and then decreases as the number of included eigenvectors increases.We display Spxq for |I| " 20, 49, 100, 250 to demonstrate how more and more clusters are revealed.