Abstract

Power-law distributions occur in many situations of scientific interest and have significant consequences for our understanding of natural and man-made phenomena. Unfortunately, the detection and characterization of power laws is complicated by the large fluctuations that occur in the tail of the distribution—the part of the distribution representing large but rare events—and by the difficulty of identifying the range over which power-law behavior holds. Commonly used methods for analyzing power-law data, such as least-squares fitting, can produce substantially inaccurate estimates of parameters for power-law distributions, and even in cases where such methods return accurate answers they are still unsatisfactory because they give no indication of whether the data obey a power law at all. Here we present a principled statistical framework for discerning and quantifying power-law behavior in empirical data. Our approach combines maximum-likelihood fitting methods with goodness-of-fit tests based on the Kolmogorov–Smirnov (KS) statistic and likelihood ratios. We evaluate the effectiveness of the approach with tests on synthetic data and give critical comparisons to previous approaches. We also apply the proposed methods to twenty-four real-world data sets from a range of different disciplines, each of which has been conjectured to follow a power-law distribution. In some cases we find these conjectures to be consistent with the data, while in others the power law is ruled out.

MSC codes

  1. 62-07
  2. 62P99
  3. 65C05
  4. 62F99

Keywords

  1. power-law distributions
  2. Pareto
  3. Zipf
  4. maximum likelihood
  5. heavy-tailed distributions
  6. likelihood ratio test
  7. model selection

Get full access to this article

View all available purchase options and get full access to this article.

References

1.
J. Abello, A. L. Buchsbaum, and J. Westbrook, A functional approach to external graph algorithms, Algorithmica, 32 (2002), pp. 437–458.
2.
D. Achlioptas, A. Clauset, D. Kempe, and C. Moore, On the bias of traceroute sampling: or, power-law degree distributions in regular graphs, J. ACM, 56 (4) (2009), article 21.
3.
L. A. Adamic and B. A. Huberman, The nature of markets in the World Wide Web, Quart. J. Electronic Commerce, 1 (2000), pp. 5–12.
4.
R. J. Adler, R. E. Feldman, and M. S. Taqqu, eds., A Practical Guide to Heavy Tails: Statistical Techniques and Applications, Birkhäuser, Boston, 1998.
5.
W. Aiello, F. Chung, and L. Lu, A random graph model for massive graphs, in Proceedings of the 32nd Annual ACM Symposium on Theory of Computing, ACM, New York, 2000, pp. 171–180.
6.
B. C. Arnold, Pareto Distributions, International Cooperative Publishing House, Fairland, MD, 1983.
7.
O. E. Barndorff-Nielsen and D. R. Cox, Inference and Asymptotics, Chapman and Hall, London, 1995.
8.
H. Bauke, Parameter estimation for power-law tail distributions by maximum likelihood methods, Eur. Phys. J. B, 58 (2007), pp. 167–173.
9.
I. Bhattacharya and L. Getoor, Collective entity resolution in relational data, ACM Trans. Knowledge Discovery from Data, 1 (1) (2007), article 5.
10.
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener, Graph structure in the web, Comput. Networks, 33 (2000), pp. 309–320.
11.
A. Clauset, M. Young, and K. S. Gleditsch, On the frequency of severe terrorist events, J. Conflict Resolution, 51 (2007), pp. 58–87.
12.
H. Cramér, Mathematical Methods of Statistics, Almqvist and Wiksells, Uppsala, 1945.
13.
R. B. D'Agostino and M. A. Stephens, eds., Goodness-of-Fit Techniques, Marcel Dekker, New York, 1986.
14.
L. de Hann and A. Ferreira, Extreme Value Theory: An Introduction, Springer-Verlag, New York, 2006.
15.
X. Dimitropoulos, D. Krioukov, M. Fomenkov, B. Huffaker, Y. Hyun, K. Claffy, and G. Riley, AS relationships: Inference and validation, ACM SIGCOMM Comput. Comm. Rev., 37 (2007), pp. 29–40.
16.
B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap, Chapman and Hall, New York, 1993.
17.
J. D. Farmer, L. Gillemot, F. Lillo, S. Mike, and A. Sen, What really causes large price changes?, Quant. Finance, 4 (2004), pp. 383–397.
18.
R. A. Fisher, On the mathematical foundations of theoretical statistics, Philos. Trans. Roy. Soc. London Ser. A, 222 (1922), pp. 309–368.
19.
M. L. Goldstein, S. A. Morris, and G. G. Yen, Problems with fitting to the power-law distribution, Eur. Phys. J. B, 41 (2004), pp. 255–258.
20.
P. D. Grünwald, The Minimum Description Length Principle, MIT Press, Cambridge, MA, 2007.
21.
A. P. Hackett, $70$ Years of Best Sellers, 1895–1965, R. R. Bowker Company, New York, 1967.
22.
P. Hall, On some simple estimates of an exponent of regular variation, J. Roy. Statist. Soc. Ser. B, 44 (1982), pp. 37–42.
23.
M. S. Handcock and J. H. Jones, Likelihood-based inference for stochastic models of sexual network formation, Theoret. Population Biol., 65 (2004), pp. 413–422.
24.
B. M. Hill, A simple general approach to inference about the tail of a distribution, Ann. Statist., 3 (1975), pp. 1163–1174.
25.
P. Holme, J. Karlin, and S. Forrest, Radial structure of the Internet, Proc. Roy. Soc. London Ser. A, 463 (2007), pp. 1231–1246.
26.
M. Huss and P. Holme, Currency and commodity metabolites: Their identification and relation to the modularity of metabolic networks, IET Systems Biol., 1 (2007), pp. 280–285.
27.
Y. Ijiri and H. A. Simon, Skew Distributions and the Sizes of Business Firms, North-Holland, Amsterdam, 1977.
28.
T. Ito, K. Tashiro, S. Muta, R. Ozawa, T. Chiba, M. Nishizawa, K. Yamamoto, S. Kuhara, and Y. Sakaki, Toward a protein–protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins, Proc. Natl. Acad. Sci. USA, 97 (2000), pp. 1143–1147.
29.
H. Jeffreys, Some tests of significance treated by the theory of probability, Proc. Cambridge Philos. Soc., 31 (1935), pp. 203–222.
30.
N. L. Johnson, S. Kotz, and N. Balakrishnan, Continuous Univariate Distributions, John Wiley, New York, 1994.
31.
R. E. Kass and A. E. Raftery, Bayes factors, J. Amer. Statist. Assoc., 90 (1994), pp. 773–795.
32.
P. L. Krapivsky, S. Redner, and F. Leyvraz, Connectivity of growing random networks, Phys. Rev. Lett., 85 (2000), pp. 4629–4632.
33.
M. Kratz and S. I. Resnick, The qq-estimator and heavy tails, Stoch. Models, 12 (1996), pp. 699–724.
34.
D. J. C. MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge University Press, Cambridge, UK, 2003.
35.
Y. Malevergne, V. F. Pisarenko, and D. Sornette, Empirical distributions of log-returns: Between the stretched exponential and the power law?, Quant. Finance, 5 (2005), pp. 379–401.
36.
N. M. Markovitch and U. R. Krieger, Nonparametric estimation of long-tailed density functions and its application to the analysis of World Wide Web traffic, Performance Eval., 42 (2000), pp. 205–222.
37.
D. Mason, Laws of large numbers for sums of extreme values, Ann. Probab., 10 (1982), pp. 754–764.
38.
D. G. Mayo, Error and the Growth of Experimental Knowledge, University of Chicago Press, Chicago, 1996.
39.
D. G. Mayo and D. R. Cox, Frequentist statistics as a theory of inductive inference, in Optimality: The Second Erich L. Lehmann Symposium, J. Rojo, ed., Institute of Mathematical Statistics, Bethesda, MD, 2006, pp. 77–97.
40.
M. Mitzenmacher, A brief history of generative models for power law and lognormal distributions, Internet Math., 1 (2004), pp. 226–251.
41.
M. Mitzenmacher, The future of power law research, Internet Math., 2 (2006), pp. 525–534.
42.
A. N. M. Muniruzzaman, On measures of location and dispersion and tests of hypotheses in a Pareto population, Bull. Calcutta Statist. Assoc., 7 (1957), pp. 115–123.
43.
M. E. J. Newman, Power laws, Pareto distributions and Zipf's law, Contemp. Phy., 46 (2005), pp. 323–351.
44.
M. E. J. Newman, S. Forrest, and J. Balthrop, Email networks and the spread of computer viruses, Phys. Rev. E, 66 (2002), article 035101.
45.
E. J. G. Pitman, Some Basic Theory for Statistical Inference, Chapman and Hall, London, 1979.
46.
W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes in C: The Art of Scientific Computing, 2nd ed., Cambridge University Press, Cambridge, UK, 1992.
47.
S. Redner, How popular is your paper? An empirical study of the citation distribution, Eur. Phys. J. B, 4 (1998), pp. 131–134.
48.
S. I. Resnick, Heavy-Tail Phenomena: Probabilistic and Statistical Modeling, Springer-Verlag, New York, 2006.
49.
D. C. Roberts and D. L. Turcotte, Fractality and self-organized criticality of wars, Fractals, 6 (1998), pp. 351–357.
50.
G. Schwarz, Estimating the dimension of a model, Ann. Statist., 6 (1978), pp. 461–464.
51.
H. L. Seal, The maximum likelihood fitting of the discrete Pareto law, J. Inst. Actuaries, 78 (1952), pp. 115–121.
52.
C. R. Shalizi, Dynamics of Bayesian Updating with Dependent Data and Misspecified Models, Preprint, 2009; available at http://arxiv.org/abs/0901.1342.
53.
M. Small and J. D. Singer, Resort to Arms: International and Civil Wars, 1816–1980, Sage Publications, Beverley Hills, CA, 1982.
54.
F. A. Smith, S. K. Lyons, S. K. M. Ernest, K. E. Jones, D. M. Kaufman, T. Dayan, P. A. Marquet, J. H. Brown, and J. P. Haskell, Body mass of late Quaternary mammals, Ecology, 84 (2003), p. 3403.
55.
D. Sornette, Critical Phenomena in Natural Sciences, 2nd ed., Springer-Verlag, Berlin, 2006.
56.
E. Sprinzak, S. Sattath, and H. Margalit, How reliable are experimental protein-protein interaction data, J. Molecular Biol., 327 (2003), pp. 919–923.
57.
S. A. Stoev, G. Michailidis, and M. S. Taqqu, Estimating Heavy-Tail Exponents through Max Self-Similarity, Preprint, 2006; available at http://arxiv.org/abs/math.ST/0609163.
58.
M. Stone, Cross-validatory choice and assessment of statistical predictions, J. Roy. Statist. Soc. Ser. B, 36 (1974), pp. 111–133.
59.
D. B. Stouffer, R. D. Malmgren, and L. A. N. Amaral, Comment on Barabási, Nature, 435, 207 (2005), Preprint, 2005; available at http://arxiv.org/abs/physics/0510216.
60.
L. Tierney, R. E. Kass, and J. B. Kadane, Fully exponential Laplace approximations to expectations and variances of nonpositive functions, J. Amer. Statist. Assoc., 84 (1989), pp. 710–716.
61.
C. Tsallis and M. P. de Albuquerque, Are citations of scientific papers a case of nonextensivity?, Eur. Phys. J. B, 13 (1999), pp. 777–780.
62.
Q. H. Vuong, Likelihood ratio tests for model selection and non-nested hypotheses, Econometrica, 57 (1989), pp. 307–333.
63.
L. Wasserman, All of Statistics: A Concise Course in Statistical Inference, Springer-Verlag, Berlin, 2003.
64.
L. Wasserman, Frequentist Bayes is objective, Bayesian Anal., 1 (2006), pp. 451–456.
65.
S. Weisberg, Applied Linear Regression, 2nd ed., Wiley, New York, 1985.
66.
M. S. Wheatland, A Bayesian approach to solar flare prediction, Astrophys. J., 609 (2004), pp. 1134–1139.
67.
S. S. Wilks, The large sample distribution of the likelihood ratio for testing composite hypotheses, Ann. Math. Statist., 9 (1938), pp. 60–62.
68.
W. Willinger and V. Paxson, Where mathematics meets the Internet, Notices Amer. Math. Soc., 45 (1998), pp. 961–970.
69.
S.-H. Yook, Z. N. Oltvai, and A.-L. Barabási, Functional and topological characterization of protein interaction networks, Proteomics, 4 (2004), pp. 928–942.

Information & Authors

Information

Published In

cover image SIAM Review
SIAM Review
Pages: 661 - 703
ISSN (online): 1095-7200

History

Submitted: 2 December 2007
Accepted: 2 February 2009
Published online: 6 November 2009

MSC codes

  1. 62-07
  2. 62P99
  3. 65C05
  4. 62F99

Keywords

  1. power-law distributions
  2. Pareto
  3. Zipf
  4. maximum likelihood
  5. heavy-tailed distributions
  6. likelihood ratio test
  7. model selection

Authors

Affiliations

Metrics & Citations

Metrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited By