Network modularity in the presence of covariates

We characterize the large-sample properties of network modularity in the presence of covariates, under a natural and flexible nonparametric null model. This provides for the first time an objective measure of whether or not a particular value of modularity is meaningful. In particular, our results quantify the strength of the relation between observed community structure and the interactions in a network. Our technical contribution is to provide limit theorems for modularity when a community assignment is given by nodal features or covariates. These theorems hold for a broad class of network models over a range of sparsity regimes, as well as weighted, multi-edge, and power-law networks. This allows us to assign $p$-values to observed community structure, which we validate using several benchmark examples in the literature. We conclude by applying this methodology to investigate a multi-edge network of corporate email interactions.

A fundamental challenge in modern science is to understand and explain network structure: in particular, the tendency of nodes in a network to connect in communities based on shared characteristics or function. Scientists inevitably observe not only network nodes and their connections, but also additional information in the form of covariates. Most analysis methods fail to exploit this information when attempting to explain network structure, and instead assign communities based solely on the network itself. This leads to a loss of interpretability and presents a barrier to understanding. We solve this problem, by showing how to decide whether communities defined by covariates lead to a valid summary of network structure. In the student friendship network shown in Fig. 1, for example, this means we can evaluate whether communities based on common gender, race, or year in school can explain the observed structure of the friendships.
The strength of community structure in networks is most often measured by modularity [1], which is intuitive and practically effective but until now has lacked a sound theoretical basis. We derive modularity from first principles, give it a formal statistical interpretation, and show why it works in practice. Moreover, by acknowledging that different community assignments may explain different aspects of a network's observed structure, we extend the applicability of modularity beyond its typical use to find a single "best" community assignment.
We use covariates to define community assignments, and then prove that modularity quantifies how well these covariates explain network structure. We show a fundamental limit theorem for modularity in this context: in the presence of covariates, it behaves like a Normal random variable for large networks whenever there is a lack of community structure. This allows us to translate modularity into a probability (a p-value), enabling for the first time its use to draw defensible, repeatable conclusions from network analysis.
Our main technical contribution is a flexible, nonparametric approach to quantify the strength of observed community structure. Most work assumes a single unobserved or latent community assignment (e.g., stochastic block models [2] and latent space models [3]). Hoff et al. [3] and Zhang et al. [4] both estimate latent community structure, while adjusting for the varying effects of covariates. Fosdick and Hoff [5] simultaneously model covariates and latent structure, providing a test for independence. In contrast, we derive limit theorems to evaluate observed community structure implied by the covariates themselves.
The existing statistical literature on modularity has focused on more basic parametric approaches. For example, the authors of [6] and [7] model all edges as equally likely Bernoulli random variables. In contrast, we take a nonparametric approach: using a single parameter per node, we model only the expectation of each edge [8]. This allows for individual node-specific differences but avoids specific distributional assumptions on the edges. Our results apply to a broad class of network models, allowing us to treat (among others) power-law networks, weighted networks, and those with multiple edges.

Network modularity in the presence of covariates
Two essential ingredients are necessary to understand modularity in the presence of covariates: first, a framework to allow for a formal interpretation of modularity as a measure of statistical significance; and second, the use of this framework to evaluate a covariate-based community assignment. We now describe each of these ingredients in turn. First, to interpret modularity as a measure of statistical significance, we must recognize it as an estimator of a population quantity. Let g(·) denote an assignment of nodes into groups (i.e., communities), and write δ g(i)=g(j) = 1 when nodes i and j are assigned to the same group, and 0 otherwise. Denote by A ij the strength of an edge (e.g., a count or a weight) between nodes i and j, and by d i = j =i A ij the degree of the ith node. Then, modularity as defined  Figure 1: A student friendship network illustrated for four different community assignments, each defined by a covariate [5,9].
in [1] is Modularity contrasts an observed edge A ij with the ratio d i d j / l d l whenever nodes i and j are in the same community. Now consider replacing d i d j / l d l by E A ij , the expected value of an edge under a given model: We recognize Q in Eq. (2) as a sum of signed residuals (observed minus expected values) A ij − E A ij . If the model for each E A ij posits the absence of community structure, then a large positive value of Q indicates the presence of such structure (more within-group edges than expected). Figure 1 illustrates this effect: the visible community structure in Figs. 1a-c is obscured in Fig. 1d when communities are assigned at random. Moreover, using d i d j / l d l as a proxy for E A ij , we see that modularity Q as defined in Eq. (1) is an estimator of Q in Eq. (2). We will return to this point in the next section. Second, to interpret covariate-based community structure, we must recognize that different community assignments reveal different structural aspects of a network. Figures 1a-c illustrate this point using a student friendship network grouped by gender, race, and year in school. Covariates such as these define distinct community assignments, each of which relates the covariate in question to the observed network structure.
A key insight is that rather than maximizing modularity to obtain a single "best" community assignment, we may instead use modularity to measure the strength of an observed community structure. If a particular community assignment is given by a covariate, then modularity allows us to quantify the explanatory value of this covariate for the observed structure of the network.

Main result: A limit theorem for modularity
Our main result is a practical tool to understand objectively whether a covariate captures the structure of the interactions in a network. Technically, we derive a theorem quantifying the large-sample behavior of modularity in the setting above. In particular, if the null model of Definition 1 below is in force, then modularity in the presence of covariates behaves like a Normal random variable. This enables us to associate a p-value with any observed community structure, quantifying how unlikely it is (under the null) to observe a community structure at least as extreme as the one we observe.
Theorem 1 (Central limit theorem for modularity). Suppose the null model of Definition 1 below is in force, and consider a sequence of networks where for each n we observe a fixed (non-random) group assignment g(1), g(2), . . . , g(n). Then as long as the number of groups grows strictly more slowly than n, there exist constants b and s for each n such that as n → ∞, Proof. Proofs of all results are given in the Appendices.
Thus, when appropriately shifted and scaled, modularity converges in distribution to a standard Normal random variable. In the sequel we explain this result and give explicit formulations for b and s 2 (Eqs. (4) and (5) below).

The network model underlying modularity
To understand Theorem 1, we must establish a technical foundation for modularity in the presence of covariates. Different models for the network edges A ij will imply different estimators for Q in Eq. (2). Estimating Q using Q in Eq. (1), we indirectly assume a model for the absence of community structure, where nodes connect independently based on the product of their individual propensities to form connections [8,10,11].
Definition 1 (The network model underlying modularity). Consider an undirected, random graph on n nodes without self-loops. We model its (possibly weighted) edges A ij ≥ 0 as independent random variables with expectations given by the product of node-specific parameters π 1 , π 2 , . . . , π n > 0: Furthermore, considering a sequence of such networks as n grows, we assume they are well behaved asymptotically: 1. No single node dominates the network: max i π i /π, withπ = 1 n n l=1 π l , is bounded asymptotically; 2. The network is not too sparse: min i π i · √ n diverges as n grows; 3. The expectation of each edge E A ij does not diverge too quickly as n grows: max i π i / √ n goes to 0; 4. The variance of each edge does not vary too much from its expectation: Var A ij / E A ij is bounded from above and away from 0 asymptotically; and 5. The skewness of each edge A ij is controlled: the third central moment We make no further assumptions on the distribution of A ij , and so our results apply in many settings, including weighted networks and those with multiple edges. Assumptions 1-3 are structural: the first excludes star-like networks; the second ensures that the network is not too sparse; and the third controls the growth of E A ij with n in the weighted or multi-edge setting. Assumptions 4 and 5 are technical; they exclude extreme behavior of the edge variables. For instance, both are fulfilled whenever A ij ∼ Bernoulli(π i π j ) or A ij ∼ Poisson(π i π j ).
Each parameter π i describes the relative popularity of node i. Thus, to fit the degree-based model of Definition 1 to a network, we estimate the parameters π i using the node's degrees d i as follows [8,10,11]: The estimatorπ i is both more natural and more computationally efficient than the corresponding maximum-likelihood estimator for π i , which follows from the theory of generalized linear models and cannot be written explicitly in closed form. In many settings the difference between these estimators is provably small [10], and so properties of maximum likelihood estimation can also be expected to hold for Eq. (3).
Most importantly, we show that any finite collection of estimators defined by Eq. (3) tends toward a multivariate Normal distribution when n is large and Definition 1 is in force. This generalizes a univariate result in [11] which assumes Bernoulli(π i π j ) edges and a power law degree distribution. Theorem 2 (Multivariate central limit theorem for Eq. (3)). Assume the model of Definition 1 and any finite set of estimators from Eq. (3). Relabeling the indices of these estimators from 1 to r without loss of generality, we have that as n → ∞, Figure 2: Within-and between-group edges in a network of political books frequently purchased together, where groups are defined by political alignment [12]. Note that only within-group edges appear in Q (Eq. (2)); by contrast, both types of edges contribute to modularity Q (Eq. (1)). Figure 2 illustrates the second main insight into the limiting behavior of modularity: its variability reduces asymptotically to that of a centered sum of withinand between-group edges. More specifically, every network degree d i = j =i A ij decomposes into within-and between-group components:

Modularity reflects within-and between-group edges
This decomposition is surprisingly powerful, in part because the model of Definition 1 asserts that d w i and d b i are statistically independent for any fixed group assignment g(1), g(2), . . . , g(n). After separating the systematic bias term b in modularity from its random variation, we obtain the following decomposition.
Theorem 3 (Bias-variance decomposition for modularity). Under the null model of Definition 1 and for a fixed (non-random) group assignment g(1), g(2), . . . , g(n), it holds that Books (105) [12] Political alignment (3) 5 6 9 0.02 1.01 0.51 0.29 21 < 10 −6 Jazz bands (198) [13] Recording location (17)   where is a random error term, α i = 1/2 + β i , and Theorem 3 quantifies the random variability inherent in modularity under the model of Definition 1. It establishes that a main term contributing to the variability of Q−b in this setting is a linear combination of centered within-and between-group degrees (d w i , d b i ), which for each i are statistically independent. The weights α i and β i associated with this linear combination are determined by the global proportion of expected within-group edges in the network, relative to the local proportion of expected within-group edges specific to node i.
Combining these two insights, we first shift modularity Q by its approximate bias b and then scale it by the standard deviation s of Recalling Theorem 3, we then know that we are left with a linear combination of centered within-and between-group degrees that are now also scaled by s. This leads directly to a central limit theorem for modularity Q as stated in Theorem 1:

Applying the limit theorem to benchmark examples
Having established a central limit theorem for modularity in the presence of covariates, we now show how to apply this result in practice. To turn our theory into a methodology suitable for a specific network dataset, we first need to elicit a model for the data based on Definition 1. We then fit this model, leading ultimately to a p-value based on Theorem 1. We now illustrate the complete analysis procedure for four binary networks which, along with their covariates, frequently serve as benchmarks for community detection [12,16]. Table 1 summarizes all data and results.
1. First, we must further specify the null model of Definition 1, so that the parameter s 2 in Eq. (5) can be estimated. This can be done either by assuming sets of the variances Var A ij to be equal, or by assuming a distribution for the edges A ij . Since the benchmark networks we consider here are binary (A ij ∈ {0, 1}), we model their edges as 2. Second, we must assess whether the five asymptotic assumptions of Definition 1 appear to hold for our data. Assumptions 3-5 are automatically satisfied for Bernoulli edges, and so we are left to assess Assumptions 1 (max i π i /π bounded) and 2 (min i π i · √ n growing). We do this by substitutingπ i for π i , noting that max iπi /π = max i d i /d and min iπi · √ n = min i d i /

√d
. Replacing min i d i ,d, and max i d i respectively by the first, second and third degree quartiles as shown in Table 1, we observe that for all four benchmark networks, these ratios are of order one. This indicates that these networks are neither too star-like nor too sparse for Theorem 1 to apply.
3. Third, we estimate the parameters b and s necessary to shift and scale Q in accordance with Theorem 1. To obtain an estimatorb, we substitutê π for π in Eq. (4). The estimatorŝ depends on the assumption added in Step 1 above. Here, with A ij ∼ Bernoulli(π i π j ), we have Then,ŝ follows directly by substitutingπ for π in Eq. (5).
4. Finally, we compute and interpret the resulting approximate p-value. We first define community assignments g(1), g(2), . . . , g(n) based on a covariate, and calculate Q as per Eq. (1). We next estimate ( Q − b)/s usinĝ b andŝ. Then, by Theorem 1, we compute an approximate one-sided p-value as follows: A small p-value implies that the observed value of modularity (or any larger value) is unlikely under the null. Table 1 shows the results of applying this procedure to four benchmark datasets: a network of books [12] where books are connected if they have frequently been purchased together, categorized by political affiliation (Fig. 2); a network of jazz bands [13] where bands are connected if they have at least one band member in common, categorized by recording location; a network of political commentary websites (weblogs) [14] where weblogs are connected if they refer to each other, categorized by political affiliation; and a network of physicists [15] where physicists are connected if they have co-authored a manuscript, categorized by manuscript subject category.
The first conclusion of our benchmark analysis is as follows: when we fit the null model of Definition 1 to each of these four networks, and then simulate from the fitted model (parametric bootstrap), each simulated network results in (via Eq. (6)) a p-value with empirical mean near 1/2 and standard deviation near 1/ √ 12. This empirical result aligns with Theorem 1, which predicts the p-values to be uniformly distributed with exactly that mean and standard deviation in the limit.
Our second conclusion is that, when using the observed data rather than simulated data under the null, each of the covariates leads (again via Eq. (6)) to a very small p-value (< 10 −6 ; see Table 1). This suggests that the data as observed are extremely unlikely under the null. Furthermore, since the null itself cannot explain any community structure, the conclusion we obtain agrees with the use of these covariates by other researchers as ground truth in community detection settings.

Evaluating communities in a multi-edge email network
We now illustrate how our methodology can identify covariates that reflect a network's community structure. This analysis goes beyond the four benchmark examples considered above, where we validated our methodology but did not reach any new data-analytic conclusions. Here we evaluate the effects of employee seniority, gender, and company department on community structure in a multi-edge corporate email network (see Fig. 3). Table 2 summarizes all results, showing that each of these covariates results in a small p-value, while covariates based on grouping the first-or last-name initials of the employees do not. We will return to this analysis in more detail below, after describing the data and eliciting a suitable model. This network and its covariates form a substantially richer dataset than those treated above. The data come from the Enron corporation [18]: as part of a U.S. government investigation following allegations of fraud, the email activities of senior employees from 1998-2002 were made public. Following the analysis in [18], we exclude all emails that have been sent en masse (to more than five recipients), leading to 32261 pairwise email exchanges between 153 employees. To model this network we will use the full flexibility afforded by Definition 1, following the four steps described in the previous section to determine a p-value corresponding to each covariate.
Poisson(π i π j ), NegativeBinomial(π i π j , r) with common shape parameter r, and zero-inflated versions of both. Figure 4 shows how well these distributions model the multi-edges. Even without zero-inflation, the negative Binomial distribution yields a good fit, particularly in the right tail. A formal model comparison via suitable likelihood ratio tests [19] confirms this: as Table 3 shows, the negative Binomial achieves the best balance between fitting the observed data (residual  Table 3: Goodness-of-fit versus model complexity for the models in Fig. 4 (starting from the 1-parameter model Poisson(λ), relative to a saturated negative Binomial model with r → ∞). deviance) and model complexity (degrees of freedom). We thus choose the model Step 2: To verify the assumptions of Definition 1 for our data, we first assess Assumptions 1 and 2 exactly as before. Computing quartiles Q 1 -Q 3 of the degrees-68, 200, 564-we see that Q 3 /Q 2 and Q 1 / √ Q 2 are both of order one. Assumption 3 (max i π i / √ n shrinking) can be analogously assessed via

. Assumptions 4 and 5 require Var
To assess this, we observe that a maximum-likelihood estimate of r [19] yieldsr = 0.047, while the first three quartiles of E A ij are respectively 0.16, 0.59, 2.1.
Step 3: To estimate b and s in Theorem 1, we substituteπ i for π i in Eqs. (4) and (5) exactly as before. Recall, however, that to estimate s we also require an estimate of Var A ij in Eq. (5). Under the parametrization of Eq. (7), it follows that Thus, Var A ij can be estimated by substitutingπ i for π i andr for r in (8). This yields the required estimatorsb andŝ.
Step 4: To calculate p-values, we must first compute ( Q −b)/ŝ for each covariate. In advance of our analysis, we would expect that employee gender, seniority, and department might reflect aspects of community structure in email interactions. In contrast, we would expect covariates based on the first or last name of each individual to be non-informative. Figure 3 illustrates, in decreasing order of ( Q−b)/ŝ, the observed structure of our data when grouped by covariate. Table 2 reports two approximate p-values per covariate, in contrast to the previous section. The first of these derives (via Eq. (6)) from Theorem 1, which shows the limiting distribution of ( Q −b)/ŝ under the assumed model to be a standard Normal. The second is based on 10 7 replicates of the parametric bootstrap, whereby we fit a negative Binomial model to the data and then simulate from the fitted values to obtain an empirical finite-sample distribution. Table 2 indicates that our asymptotic theory is somewhat conservative in this setting, leading as it does here to larger p-values than the bootstrap.
Finally, considering these p-values in more detail, we see from Table 2 that for the covariates of department, gender, and seniority, all p-values fall below 1% (leading to a corrected total of 5% after adjusting for multiple comparisons). In contrast, we obtain large p-values for first-and last-name covariates. This matches our expectations that department, gender, and seniority are likely to have an impact on email interactions, while there is no obvious reason why this should hold for name-related covariates.

Discussion
Networks have richer and more varied structure than can be described by a single "best" community assignment. To reflect this, we have introduced an approach which exploits the structural information captured by covariates, each of which may describe different aspects of community structure in the data. In contrast to community detection per se, this approach allows us to assess the significance of a given, interpretable community assignment with respect to the observed network structure. As described in the data analysis examples above, our method leads to the identification of structurally significant community assignments, ultimately yielding a better understanding of the network under study.
In technical terms, we have established a central limit theorem for modularity under a nonparametric null model, yielding p-values to assess the significance of observed community structure. The model we introduce shows explicitly how modularity measures variability in the data that cannot be explained solely by node-specific propensities for connection. What is more, modularity has more explanatory power than a classical (chi-squared) goodness-of-fit statistic: by aggregating the estimated signed residuals A ij − d i d j / l d l within every network community, it measures the global tendency of a given community assignment to explain the observed network structure.
To advance the state of the art in network analysis, we as a research community must use this explanatory power to understand the effects of multiple observed communities on network structure. Our work here represents a first step in this direction: we use the explanatory power of modularity to assess the significance of observed community structure relative to a null model. This opens the door to more advanced uses of multiple observed community assignments within formal statistical modeling frameworks. This is an important next step, since we see clear evidence here that multiple groupings may explain different aspects of a network's community structure.

A Notation and assumptions
For the following proofs we will always consider an undirected random graph on n nodes with no self-loops. We model the edges A ij as independent random variables with expectation where π = (π 1 , . . . , π n ) ∈ R n >0 . We will denote the degree of node i as d i ; i.e., d i = j =i A ij . The remaining five assumptions of Definition 1 of the degreebased model are not all needed at all times and will therefore be mentioned explicitly. For convenience we restate the assumptions below, all of which reference a sequence of networks where n → ∞.
1. No node dominates the network; i.e, n max i π i / π 1 = O(1); 2. The network is not too sparse; i.e., min i π i = ω(1/ √ n); 3. The expectation of each edge does not diverge too quickly; i.e., max i π i = o( √ n); 4. The ratio of variance to expectation of each edge is controlled; i.e., ∀i, j : 5. The skewness of each edge A ij is controlled; i.e., ∀i, j : We use bold letters to denote vectors.

B Proof of Theorem 2
We first show a univariate central limit theorem for the scalar estimatorπ i = d i / d 1 . We then extend this result to the multivariate case, applying the Cramér-Wold theorem.
Preliminaries: Since the edges A ij , i < j are independent, it follows as shown in [11] that for finite n Var Theorem B.1 (Central limit theorem forπ i ). Consider Assumptions 1-5. Defineπ i = d i / d 1 as an estimator of π i . Then as n → ∞, , and can be consistently estimated using a plug-in estimator for A ij ∼ Bernoulli(π i π j ) and A ij ∼ Poisson(π i π j ).
Proof. The proof is a generalization of the proof of Theorem 3.2 in [11], which assumes Bernoulli edges and a power law degree distribution. We writê To deduce the required result, we show that T 1 converges in distribution to a Normal(0, 1) random variable and T 2 and T 3 go in probability to 0 and 1, respectively. Slutsky's theorem enables us to combine the results and to obtain the claimed convergence in distribution.
Since in addition, the skewness of each edge A ij is asymptotically bounded (Assumption 5), the Lyapunov condition for exponent 1 is satisfied; i.e., Hence, the Lindeberg-Feller Central Limit Theorem allows us to conclude that T 1 d → Normal(0, 1).
Term T 2 : We write .
Term T 2 converges in probability to 0 since both a) the first ratio converges to 0 and b) the second ratio converges to 0 in probability. a) This convergence is driven by the fact that (9) and (12)) while Var d i → ∞. More precisely, Consideringπ = π/ max j π j , we can conclude from π 2 2 ≤ π 1 that Assumption 1 implies that max j π j / π 1 = O(1/n), and thus we conclude This allows us to apply a convergent Taylor expansion of √ 1 − x at 0 in Eq. (16): , it follows that the left-hand side of Eq. (19) converges to 0 in n.

b) We show below that the second ratio
Eq. (15) converges in probability to 0; this follows since π i / √ Var d i → 0 under Assumptions 1 and 4 (see c) below) and c) From Assumption 4 it follows that Lemma B.1. Consider Assumptions 2-5. Then, Proof. Observe that the square root function has one continuous derivative at 1. A Taylor expansion in probability of d 1 / E d 1 about 1 requires in addition [20, p. 201] that I. ∃a ∈ R : d 1 / E d 1 = a + O P (r n ); with II. r n → 0 as n → ∞.
I. It follows from Chebyshev's inequality that II. As a consequence of I., r n = Var d 1 / E d 1 . From Eq. (12) and Assumption 2 (⇒ E d i → ∞) it follows that E d 1 → ∞. Since A ij are independent for i < j, and since we assume Var A ij / E A ij = Θ(1) (Assumption 4), it holds that It follows that the ratio Var d 1 / E d 1 → 0. We now can apply a convergent Taylor expansion in probability: i<j A ij is a sum of independent random variables, we apply the Lindeberg-Feller central limit theorem analogously to Term As a consequence of Lemma B.1, we now know that the numerator of term b) in Eq. (15) is bounded in probability. Since we show in Eq. (20) that In turn, this completes the proof of the convergence of Term 2 (see Eq. (15)); i.e., Term T 3 Combining Eqs. (21) and (22), we know that This converges in probability to 1 because of Assumption 2 (⇒ E d 1 → ∞).
Applying the continuous mapping theorem, leads to Slutsky's Theorem enables us to combine the results on the convergence of terms T 1 -T 3 to obtain thatπ To complete the proof of Theorem B.1, it remains to show that Var d i / E d 1 = O(1/n), and that it can be consistently estimated using a plug-in estimator for A ij ∼ Bernoulli(π i π j ) and A ij ∼ Poisson(π i π j ). Since Var A ij / E A ij = Θ(1) (Assumption 4), we know that We know that nπ i / π 1 = O(1) (Assumption 1) and we have seen in Eq. (18) that π 2 2 / π 2 1 = O(1/n) (also from Assumption 1). Hence, Var We defer the proof of consistency of the plug-in estimator of Var d i / E d 1 for A ij ∼ Bernoulli(π i π j ) and A ij ∼ Poisson(π i π j ) to Theorem D.1, where we show a more general statement.
Having shown a univariate central limit theorem for eachπ i , we are now ready to extend this result to the multivariate case. The Corollary below is identical to Theorem 2 in the main text.
Furthermore for all i, , and can be consistently estimated for A ij ∼ Bernoulli(π i π j ) and A ij ∼ Poisson(π i π j ) using a plug-in estimator.
Proof. This proof is the multidimensional equivalent of the proof of Theorem B.1. It is analogously driven by the fact that the vector can be reduced to a sum of independent but not identically distributed random vectors. These in turn converge in distribution to a multivariate standard Normal random vector; as we now show. In direct analogy to the univariate case of Eq. (14), We now prove that m 1 d → Normal(0, I r ). In order to apply a multivariate central limit theorem, we rearrange m 1 such that we extract a sum of indepen-dent random vectors (m 12 ): We will show three things: that the matrix D 11 converges to the identity matrix I r ; that m 12 d → Normal(0, I r ); and that the term m 13 P → 0. For the term D 11 , it holds for all i that .

Furthermore, from Assumption 4 (Var
It follows further from Assumption 1 that In turn, Var( n l=r+1 A li )/ √ Var d i → 1 for all i. Hence, the diagonal matrix D 11 converges to the identity matrix I r in the operator norm.
The term m 12 d → Normal(0, I r ), as we will now show by applying the Cramér-Wold theorem. The term m 12 is a random vector depending on n, where each component is a sum of independent random variables. We will show now that, as a consequence, each component converges marginally in distribution to a Normal(0, 1) random variable (by the same argument as in Theorem B.1 for Term T 1 ). From Assumption 2 (⇒ E d i → ∞) and Assumption 4 (Var A ij / E A ij = Θ(1)), it follows that Var d i → ∞. Since in addition we assume the skewness of each edge A ij to be bounded asymptotically (Assumption 5), the Lyapunov condition (for δ = 1) is satisfied for each component. Hence, the Lindeberg-Feller central limit theorem lets us conclude that each component converges marginally in distribution to a Normal(0, 1) random variable [22, p. 362].
Furthermore, the components of m 12 are independent. It follows that for each (c 1 , . . . , c r ) ∈ R r and Y u iid ∼ Normal(0, 1) for u = 1, . . . , r, it holds that Applying the Cramér-Wold theorem, we conclude that m 12 d → Normal(0, I r ). Finally, term m 13 In turn, we deduce the required result (see Eq. (26)) that To complete the proof we need to show consistency of the plug-in estimator of Var d i / E d 1 for A ij ∼ Bernoulli(π i π j ) and A ij ∼ Poisson(π i π j ). We defer this to Theorem D.1, where we show a more general statement.

C Proof of the Corollary of Theorem 2
As a reminder to the reader, the Corollary in the main text is as follows.

Corollary (Central limit theorem for E A ij ). Consider Assumptions 1-5. Define the estimator E
Furthermore for all i, j, /n , and can be consistently estimated using a plug-in estimator for A ij ∼ Bernoulli(π i π j ) and A ij ∼ Poisson(π i π j ).
Proof. We show that E A ij =π iπj , once appropriately standardized, converges in distribution to a Normal(0, 1) random variable. It can easily be seen that Under the hypothesis that (π i − π i )(π j − π j ) is asymptotically negligible, the asymptotic behavior of E A ij − π i π j will be dominated by π j (π i − π i ) + π i (π j − π j ). As a consequence, we standardize all quantities in Eq. (29) by the factor E d 1 / π 2 j Var d i + π 2 i Var d j , which can be interpreted as an approximation of the standard deviation of π j (π i − π i ) + π i (π j − π j ). Then, we can use Eq. (29) to write To deduce the required result, we will show that T 1 d → Normal(0, 1) and T 2 = o P (T 1 ). Slutsky's theorem will then enable us to combine these results and obtain the claimed convergence in distribution. Term T 2 : It remains to show that T 2 = o P (T 1 ); i.e., that (π i − π i )(π j − π j ) = o(π j (π i − π i ) + π i (π j − π j )).
We now use Lemma C.1 that we will show immediately below.

Proof. First, we appeal to a Taylor expansion in probability ofπ
Observe that the functionπ has continuous partial derivatives at (1, 1) . A Taylor expansion in probability [20, p. 201] of f requires in addition that (A − 1) 2 + (B − 1) 2 P → 0. By Chebyshev's inequality, we know that

From Assumptions 2 and 4 (⇒
We now can expand the function f (A, B) in Eq. (31) in a convergent Taylor series around (1, 1) . In combination with Assumptions 2 and 4 we obtain Furthermore, we conclude that Combining Eqs. (32) and (34), it follows that We conclude immediately the result of Lemma C.1; i.e., Having established the claimed central limit theorem, we now show that To complete the proof of the Corollary, we need to show consistency of the plug-in estimator of n π 2 j Var d i + π 2 i Var d j / n l=1 E d l for networks with edges A ij ∼ Bernoulli(π i π j ) or A ij ∼ Poisson(π i π j ). We defer this to Theorem D.1, where we show a more general statement.
Recall that modularity Q (Eq. [1] in main text) is an empirical quantity that estimates its population counterpart Q (Eq. [2] in main text), in the sense that → 0 at a rate no slower than (π i + π j )/ √ n (Assumption 3). More precisely we have the following.

Proof. Recall from Eq. (29) that
Furthermore, we know from Eq. (30) that From Lemma C.1 and Assumptions 1, 2 and 4, it follows that

D Consistency of the plug-in estimator for
Throughout the Theorem and Corollaries in the main text (and above), we state that Var d i / E d 1 can be consistently estimated using a plug-in estimator for A ij ∼ Bernoulli(π i π j ) and A ij ∼ Poisson(π i π j ). In fact, this is true more generally, as we show below. Each edge distribution leads to a different variance Var d i , each of which is Θ(E d i ) by Assumption 4. We now show that the term Var d i / E d 1 can be consistently estimated by a plug-in estimator, as long as Var d i can be consistently estimated by a plug-in estimator. More precisely, we have the following. Then, Var d i / E d 1 can be estimated consistently using the plug-in estimator Proof. We first write From term T 3 (Eq. (25)) in the proof of Theorem B.1, we know that under Assumption 4 (Var First, from Chebyshev's inequality, and from Assumption 4, we know that In return, it follows that We may apply a convergent Taylor expansion of f (x) = (1 + x) −1 at 1, since Via straightforward algebraic computations, we obtain and We know from Eq. (38) that Combining Eqs. (39) and (41) and applying Assumption 1, it then follows that = π 2 2 π 2 1 Finally, we know from Eq. (36) that The inverse of a random variable which converges in probability to a constant c must in turn converge to 1/c, as long as c = 0 [21, Theorem 2.1.3]. Applying this fact and the continuous mapping theorem, we obtain the claimed convergence in probability; i.e., Having established Theorem D.1, we now show for A ij ∼ Bernoulli(π i π j ) and A ij ∼ Poisson(π i π j ) that Var d i / Var d i P → 1. This allows us to apply Theorem D.1 to conclude that Var d i / E d 1 can be estimated consistently via its plug-in estimator.
and 4 (Var A ij = Θ(E A ij )), it follows that π 1 πi di d 1 P → 1, as we will now show. We write By Chebyshev's inequality and from Assumptions 2 and 4, we know that For E n , we will first establish the equivalence By Eq. (18), we know that from Assumption 1 it follows that π 2 2 / π 2 1 = O(1/n). Furthermore, by Chebyshev's inequality and from Assumptions 2 and 4, d 1 / E d 1 P −→ 1. Thus, it follows that For the non-random sequence {c n ; n ∈ N} in Eq. (43) it holds that The inverse of a random variable which converges in probability to a constant c must in turn converge to 1/c, as long as c = 0 [21, Theorem 2.1.3]. Furthermore, the product of two random variables, converging in probability to a constant c and a constant d respectively, itself converges to the product of the constants cd [21, Theorem 2.1.3]. Thus, it follows that Recall from Eq. (42) that In turn, we obtain the required result; i.e., From Assumption 2 (π i = ω(1/ √ n)), it follows that min i E d i diverges. Hence, we have shown the required result that Var d i can be consistently estimated by its plug-in estimator Var d i .
. We write It can easily been seen thatπ i π 1 = d i and π We have seen in Eq. (32) that Assumptions 2 and 4 imply that It follows from identical arguments that From Assumption 1, we conclude that Combining Eqs. (47) and (48), it follows that It follows in turn that in combination with Eq. (46), we obtain Term R n : Term S n : We show the convergence of S n from Eq. (49) in two steps: 1.
Step 2: We write the ratio of interest as  1 and 4). From Eq. (39), we know that under Assumption 1, the sequence {t n ; n ∈ N} converges to 1.
The inverse of a random variable which converges in probability to a constant c, must in turn converge to 1/c, as long as c = 0 [21, Theorem 2.1.3]. Furthermore, the product of two random variables, converging in probability to a constant c and a constant d respectively, itself converges to the product of the constants cd [21, Theorem 2.1.3]. Thus, Step 2 follows.
Returning now to Eq. (49) and following the same argument, we conclude that S n P → 1 and in turn, ).

E Proof of Theorem 1
We now state and prove Theorem E.1, which is identical to Theorem 1 in the main text, except for the formulation of the weights β j , j = 1, . . . , n. In Corollary F.1 below, we introduce the formulation for β j used in Theorem 1 to improve interpretability and show that both formulations are asymptotically equivalent. The proof below expands on the proof sketch given in the main text.
Theorem E.1 (Central limit theorem for modularity). In addition to Assumptions 1-5, suppose that the number K of communities grows strictly more slowly than n(; i.e., K/n → 0). Then, as n → ∞, The β i are defined in Eq. (52) in Lemma E.1 below and are non-random.
Proof. The proof consists of two main steps. First, in Lemma E.1, we will relate modularity to a linear combination of within-group degrees (d w i in Eq. (50) below) and between-group degrees (d b i in Eq. (50) below). Second, in Lemma E.2, we will show that this linear combination, when appropriately standardized, converges in distribution to a Normal(0, 1) random variable.
Let us first note some preliminaries. Recall from the main text: Let us denote π g(j),j 1 = i =j π i δ g(i)=g(j) and π We obtain We are now ready to proceed with our analysis. The following Lemma is identical to Lemma 1 in the main document.
. Then, the following identity holds: where the non-random quantities α j , β j , and are defined as follows: Proof. Since E A ij = d i d j / d 1 , modularity can be written as We will show this lemma in six steps. We 2. Expand the denominator d 1 around its mean in a convergent Taylor series; 5. Collect all higher-order non-random terms in Q into b; and 6. Show that the remaining lower-order random and non-random terms can be absorbed into .
Step 1: Recall from Eq. (29) that and from Eq. (30) that, given Assumptions 1, 2, and 4, it holds that As a consequence, we may combine these two results to write Focusing on the rightmost sum in Eq. (55), we then obtain from Eq. (56) n j=1 i<j Renaming the indices in the first summand from i to j and vice versa leads to Hence, n j=1 i<j E A ij δ g(i)=g(j) can be substituted into Eq. (55) as follows: We now change from a relative error term to an absolute error. In addition, we substitute n j=1 We will show in Step 6 below that where is the error term defined in Eq. (54). Thus, Step 2: In this step we focus on the penultimate term in Eq. (58). We appeal to a Taylor expansion of ( d 1 / E d 1 ) −1/2 = f (x) = x −1/2 at 1, and then control the remainder using Chebyshev's inequality. As a consequence, we obtain from Assumption 4 (Var A ij = Θ(E A ij )) that n j=1 π g(j),j 1 We will show in Step 6 below that n j=1 π g(j),j 1 Continuing Eq. (59), we have that Step 3: From Chebyshev's inequality and Assumption 4, we know that d j = Inserting this result into the second (i.e., lower-order) term of the Taylor expansion in Eq. (61), we obtain = n j=1 π g(j),j 1 Applying Chebyshev's inequality and then Assumption 4, we next obtain Step 4: We define non-random factors β j and α j as in Eqs. (52) and (53); i.e., Combining the results from Eqs. (58) and (65), we may rewrite Q in terms of α j and β j as After centering d w j and d b j about their respective means, we obtain Step 5 We now address the non-random terms in modularity. We treat the non-random terms in the two lines of Eq. (66) separately; i.e., Term a) : From the definition of α j and β j , we obtain We know from Eq. (18) that from Assumption 1 it follows that π 2 2 / π 2 1 = O(1/n). As a consequence, we can apply a convergent Taylor expansion to As a consequence, it follows that we may express Eq. (68) as We identify the first terms in Eqs. (70) and (71) as the terms of leading order. We will show in Step 6 that the remaining terms satisfy where we remind the reader that is the error term defined in Eq. (54). Finally, considering the leading-order terms in Eqs. (70) and (71), it then follows from the identity n j=1 i =j We may then combine terms a) and b) using Eqs. (67) and (73), whence In order to gain interpretability, we rearrange the term a) + b) even further: We will show in Step 6 that n j=1 i<j Recall from the definition of b in Eq.
Then, as a consequence of Eqs. (74) and (75), we see that Inserting the results from Eq. (76) into Eq. (66) and under the assumption that all error terms are controlled (see Step 6 below), we obtain the result of this lemma; i.e., Step 6: We now define and address the five error terms cited above; we call these (1) , (2) , . . . , (5) .
As a consequence, we conclude the required result of Lemma E.1; i.e., We now derive the asymptotic distribution of modularity Q. Recalling the definitions of α, β in Eqs. (52), (53), we define a sequence of random variables via In Lemma E.2 below we show the asymptotic behavior of X n . The Lemma parallels Lemma 2 in the main text.
Lemma E.2. Consider Assumptions 1-5, and suppose that the number K of communities grows strictly more slowly than n, so that K/n → 0. Then, as n → ∞, Proof. First we write X n as a sum of independent, zero-mean random variables: To apply the Lindeberg-Feller Central Limit Theorem to this sum, we show: 1. Var(c ij A ij ) < ∞; 2. The Lyapunov condition for exponent 1 is satisfied; i.e., From Assumption 1 and Eq. (17), we know that π 2 2 / π 2 1 ≤ max i π i π 1 / π 2 1 = O(1/n). Hence, we can apply a convergent Taylor expansion to f (x) = (1 − x) −α , α = 1/2, 1 at x = 0. We obtain Since π 2 2 / π 2 1 ≤ max i π i π 1 / π The first term in Eq. (93) is O(1), and thus we conclude c ij = O(1). This in turn allows us to combine the relative and additive error terms. Furthermore we see that c ij is, up to an additive error term of order at most 1/n, a function only of g(i) and g(j): We are now ready to address the two conditions sufficient for the Lindeberg-Feller Central Limit Theorem. Then, substituting a k for π k,∅ 1 in Eq. (95) (so that a 1 = π 1 ), we obtain K k=1 K t=1 c 2 tk Θ(a k a t ) = K k=1 K t=1 δ k=t + 2δ k=t B + B 2 + O 1 n Θ(a k a t ) We now address the two terms on the right-hand side of Eq. (96) separately: (1 + 2B)a 2 k = a