Decoding from Pooled Data: Sharp Information-Theoretic Bounds
Consider a population consisting of $n$ individuals, each of whom has one of $d$ types (e.g., blood types, in which case $d=4$). We are allowed to query this population by specifying a subset of it, and in response we observe a noiseless histogram (a $d$-dimensional vector of counts) of types of the pooled individuals. This measurement model arises in practical situations such as pooling of genetic data and may also be motivated by privacy considerations. We are interested in the number of queries one needs to unambiguously determine the type of each individual. We study this information-theoretic question under the random, dense setting where in each query, a random subset of individuals of size proportional to $n$ is chosen. This makes the problem a particular example of a random constraint satisfaction problem (CSP) with a “planted” solution. We establish upper and lower bounds on the minimum number of queries $m$ such that there is no solution other than the planted one with probability tending to one as $n\to\infty$. The bounds are nearly matching. Our proof relies on the computation of the exact “annealed free energy” of this model in the thermodynamic limit, which corresponds to an exponential rate of decay of the expected number of solutions to this planted CSP. As a by-product of the analysis, we derive an identity of independent interest relating the Gaussian integral over the space of Eulerian flows of a graph to its spanning tree polynomial.
1. , Algorithmic barriers from phase transitions , in
Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science, 2008 , pp. 793 -- 802 .
2. , The chromatic number of random regular graphs , in Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques , Springer , Berlin, Heidelberg , 2004 , pp. 219 -- 228 .
3. , The two possible values of the chromatic number of a random graph , Ann. of Math. (2) , 162 ( 2005 ), pp. 1335 -- 1351 .
4. , Information-theoretic thresholds for community detection in sparse networks , in
Proceedings of the 49th Annual Conference on Learning Theory, 2016 , pp. 383 -- 416 .
5. , The condensation phase transition in random graph coloring , Comm. Math. Phys. , 341 ( 2016 ), pp. 543 -- 606 .
6. , Universality in polytope phase transitions and message passing algorithms , Ann. Appl. Probab. , 25 ( 2015 ), pp. 753 -- 822 .
7. , Algebraic potential theory on graphs , Bull. London Math. Soc. , 29 ( 1997 ), pp. 641 -- 682 .
8. , Convex Optimization , Cambridge University Press , Cambridge, UK , 2004 .
9. , Stable signal recovery from incomplete and inaccurate measurements , Comm. Pure Appl. Math. , 59 ( 2006 ), pp. 1207 -- 1223 .
10. , Exact matrix completion via convex optimization , Found. Comput. Math. , 9 ( 2009 ), pp. 717 -- 772 .
11. , Decoding by linear programming , IEEE Trans. Inform. Theory , 51 ( 2005 ), pp. 4203 -- 4215 .
12. , A combinatorial proof of the all minors matrix tree theorem , SIAM J. Algebraic Discrete Methods , 3 ( 1982 ), pp. 319 -- 329 , https://doi.org/10.1137/0603033.
13. , Random Constraint Satisfaction Problems, preprint, https://arxiv.org/abs/0911.2322 , 2009 . , https://arxiv.org/abs/0911.2322.
14. , On the chromatic number of random regular graphs , J. Combin. Theory Ser. B , 116 ( 2016 ), pp. 367 -- 439 .
15. , Analyzing \tt Walksat on random formulas , SIAM J. Comput. , 43 ( 2014 ), pp. 1456 -- 1485 , https://doi.org/10.1137/12090191X.
16. , \tt Walksat Stalls Well Below the Satisfiability Threshold, preprint, https://arxiv.org/abs/1608.00346 , 2016 . , https://arxiv.org/abs/1608.00346.
17. , A spectral approach to analysing belief propagation for 3-colouring , Combin. Probab. Comput. , 18 ( 2009 ), pp. 881 -- 912 .
18. , Belief propagation on replica symmetric random factor graph models , Ann. Inst. Henri Poincaré D , 5 ( 2018 ), pp. 211 -- 249 .
19. , Information Theory: Coding Theorems for Discrete Memoryless Systems , Cambridge University Press , Cambridge, UK , 2011 .
20. , Tight bounds on the threshold for permuted k-colorability , in Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques , Springer , Berlin, Heidelberg , 2012 , pp. 505 -- 516 .
21. , Asymptotic Methods in Analysis , Dover , New York , 1970 .
22. , Proof of the satisfiability conjecture for large k , in
Proceedings of the 47th Annual ACM Symposium on Theory of Computing, 2015 , pp. 59 -- 68 .
23. , Satisfiability threshold for random regular nae-sat , Comm. Math. Phys. , 341 ( 2016 ), pp. 435 -- 489 .
24. , IEEE Trans. Inform. Theory , 52 ( 2006 ), pp. 1289 -- 1306 .
25. , For most large underdetermined systems of linear equations, the minimal $\ell_1$-norm solution is also the sparsest solution , Comm. Pure Appl. Math. , 59 ( 2006 ), pp. 797 -- 829 .
26. , Information-theoretically optimal compressed sensing via spatial coupling and approximate message passing , IEEE Trans. Inform. Theory , 59 ( 2013 ), pp. 7434 -- 7464 .
27. , Pooling Designs and Nonadaptive Group Testing: Important Tools for DNA Sequencing , Ser. Appl. Math. 18 , World Scientific , Hackensack, NJ , 2006 .
28. , Decoding from pooled data: Phase transitions of message passing , in
Proceedings of the 2017 IEEE International Symposium on Information Theory, 2017 , pp. 2780 -- 2784 .
29. , Matrix Rank Minimization with Applications , Ph.D. thesis, Stanford University , Stanford, CA , 2002 ; available online from http://faculty.washington.edu/mfazel/thesis-final.pdf. , http://faculty.washington.edu/mfazel/thesis-final.pdf.
30. , Pooling analysis of genetic data: The association of leptin receptor (LEPR) polymorphisms with variables related to human adiposity , Genetics , 159 ( 2001 ), pp. 1163 -- 1178 .
31. , Reweighted belief propagation and quiet planting for random K-SAT , J. Satisf. Boolean Model. Comput. , 8 ( 2012/14 ) pp. 149 -- 171 .
32. , Gibbs states and the set of solutions of random constraint satisfaction problems , Proc. Natl. Acad. Sci. USA , 104 ( 2007 ), pp. 10318 -- 10323 .
33. , Hiding quiet solutions in random constraint satisfaction problems , Phys. Rev. Lett. , 102 ( 2009 ), 238701 .
34. , Group testing with random pools: Optimal two-stage algorithms , IEEE Trans. Inform. Theory , 57 ( 2011 ), pp. 1736 -- 1745 .
35. , Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization , SIAM Rev. , 52 ( 2010 ), pp. 471 -- 501 , https://doi.org/10.1137/070697835.
36. , Convex Analysis , Princeton University Press , Princeton, NJ , 1970 .
37. , Phase transitions in the pooled data problem , in
Proceedings of the Thirty-First Annual Conference on Neural Information Processing Systems, 2017 .
38. , On two random search problems , J. Statist. Plann. Inference , 11 ( 1985 ), pp. 23 -- 31 .
39. , DNA pooling: A tool for large-scale association studies , Nat. Rev. Genet. , 3 ( 2002 ), pp. 862 -- 871 .
40. , The number of solutions for random regular NAE-SAT , in
Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, 2016 , pp. 724 -- 731 .
41. , A statistical-mechanics approach to large-system analysis of CDMA multiuser detectors , IEEE Trans. Inform. Theory , 48 ( 2002 ), pp. 2888 -- 2910 .
C.Chen, Data extraction via histogram and arithmetic mean queries: Fundamental limits and algorithms , in Proceedings of the IEEE International Symposium on Information Theory, 2016 , pp. 1386 -- 1390 .
43. , Fundamental limits of almost lossless analog compression , in
Proceedings of the IEEE International Symposium on Information Theory, 2009 , pp. 359 -- 363 .
44. , Statistical Physics of Inference: Thresholds and Algorithms, preprint, https://arxiv.org/abs/1511.02476 , 2015 . , https://arxiv.org/abs/1511.02476.
45. , Non-adaptive pooling strategies for detection of rare faulty items , in
Proceedings of the IEEE International Conference on Communications Workshop, 2013 , pp. 1409 -- 1414 .
46. , Theory of Code Division Multiple Access Communication , John Wiley & Sons , Hoboken, NJ , 2004 .