First order optimization methods, which rely only on gradient information, are commonly used in diverse machine learning (ML) applications, owing to their simplicity of implementations and low per-iteration computational/storage costs. However, they suffer from significant disadvantages; most notably, their performance degrades with increasing problem ill-conditioning. Furthermore, they often involve a large number of hyperparameters, and are notoriously sensitive to parameters such as the step-size. By incorporating additional information from the Hessian, second-order methods, have been shown to be resilient to many such adversarial effects. However, these advantages come at the expense of higher per-iteration costs, which in “big data” regimes, can be computationally prohibitive.
In this paper, we show that, contrary to conventional belief, second-order methods, when designed suitably, can be much more efficient than first-order alternatives for large-scale ML applications. In convex settings, we show that variants of classical Newton's method in which the Hessian and/or gradient are randomly subsampled, coupled with efficient GPU implementations, far outperform state of the art implementations of existing techniques in popular ML software packages such as TensorFlow. We show that our proposed methods (i) achieve better generalization errors in significantly lower wall-clock time – orders of magnitude faster, compared to first-order alternatives (in TensorFlow) and, (ii) offers significantly smaller (and easily parameterized) hyperparameter space making our methods highly robust.