10 Simple Rules for Designing Learning Machines
When designing an estimator (or a learner, as machine learning people say), there are a number of desiderata to consider. I believe the following is a nearly MECE list (mutually exclusive and collectively exhaustive). However, if you have other desiderata that I have missed, please let me know in the comments. Recall that an estimator takes data as input, and outputs an estimate. Below is my list:
-
Sufficiency: The estimate must be sufficient to answer the underlying/motivating question. For example, if the question is about the average height of a population, the estimate must be a positive number, not a vector, not an image, etc. This is related to validity, as well as the classical definition of statistical sufficiency, in that we say that if an estimator is estimating a sufficient statistic, then it is a sufficient estimator. Note, however, that sufficient estimators do not have to estimate sufficient statistics. To evaluate the degree of sufficiency of an estimator for a given problem one determines how “close” the measurement and action spaces of the estimator are to the desired measurement and action space. For example, the action space may be real numbers, but an estimator may only provide integers, which means that it is not quite perfectly sufficient, but it may still do the trick.
-
Consistency: We prefer, under the model assumptions, for any finite sample size, that the expected estimate is equal to the assumed “truth”, i.e., is unbiased. If this unbiased property holds asymptotically, the estimator is said to be consistent. For random variables (estimands are random variables since they are functions of random variables), there are many different but related notions of convergence, so the extent to which an estimator is consistent is determined by the sense in which it converges to an unbiased result.
-
Efficiency: Assuming that the estimator converges to something, all else being equal, we desire that it converges “quickly”. The typical metric of statistical efficiency is the variance (for a scalar values estimate), or the Fisher information more generally. Ideally the estimator is efficient, meaning, under the model assumptions, it achieves the minimal possible variance. To a first order approximation, we often simply indicate whether the convergence rate is polynomial, polylog, or exponential. Note that efficiency can computed with respect to the sample size, as well as the dimensionality of the data (and both the “intrinsic” and “extrinsic” dimensionality). Note also that there are many different notions of convergence of random variables, and the estimators are indeed random variables (because their inputs are random variables). Because perfect efficiency is available only under the simplest models, relative efficiency is often more important, in practice. Probably almost correct learning theory, as described in Foundations of Machine Learning, is largely focused on finding estimators that efficiently (i.e., using a small number of samples) obtain low errors with high probability.
-
Uncertainty: Point estimates (for example, the maximum likelihood estimate), are often of primary concern, but uncertainty intervals around said point estimates are often required for effectively using the estimates. Point estimates of confidence intervals, and densities, are also technically point estimates. However, such point estimates fundamentally incorporate a notion of uncertainty, so they satisfy this desiderata. We note, however, that while some estimators have uncertainty estimates with theoretical guarantees, those guarantees depend strongly on the model assumptions. To evaluate an estimator in terms of uncertainty, one can determine the strength of the theoretical claims associated with its estimates of uncertainty. For example, certain manifold learning and deep learning techniques lack any theoretical guarantees around uncertainty. Certain methods of Bayesian inference (such as Markov Chain Monte Carlo) have strong theoretical guarantees, but those guarantees require infinitely many samples, and provide limited (if any) guarantees given a particular number of samples for all but the simplest problems. We desire estimators with strong guarantees for any finite number of computational operations.
-
Complexity: The “optimal” estimate, given a dataset, might be quite simple, or quite complex. We desire estimators that can flexibly adapt to the “best” complexity, given the context. Semiparametric estimators can be arbitrarily complex, but achieve parametric rates of convergence.
- Stability: Statisticians often quote George Box’s quip “all models are wrong”, which means that any assumptions that we make about our data (such as independence), are not exactly true in the real world. Given that any theoretical claim one can make is under a set of assumptions, it is desirable to be able to show that an estimator’s performance is robust to “perturbations” of those assumptions. There are a number of measures of robustness, including breakdown point and influence function. Another simple measure of robustness is: what kinds of perturbations is the estimator invariant with respect to? For example, is it invariant to translation, scale, shear, affine, or monotonic transformations? Nearly everything is invariant to translation, but fewer things are invariant to other kinds of transformations. Some additional example perturbations that one may desire robustness against include:
- data were assumed Gaussian, but the true distribution has fat tails,
- there are a set of outliers, sampled from some other distribution,
- the model assumed two classes, but there were actually three, and,
- data were not sampled independently.
-
Scalability: If an estimator requires an exponential amount of storage or computation, as a function of sample size or dimensionality, than it can typically only be applied to extremely small datasets. Similarly, if the data are relatively large (meaning that it takes “a while” to estimate), than estimation can often be sped up by a parallel implementation. However, the “parallelizability” of estimators can vary greatly. The theoretical space and time complexity of an estimator is typically written as a function of sample size, dimensionality (either intrinsic or extrinsic), and number of parallel threads.
-
Explainability: In many applications, including scientific inquiries, medical practice, and law, explainability and/or interpretability are crucial. While neither of these terms have accepted definitions, certain general guidelines have been established. First, the fewer features the estimator uses, the more interpretable it is. Second, passing all the features through a complex data transformation, such as a kernel, is relatively not interpretable. Third, when using decision trees, shallower trees are more interpretable, and when using decision forests, fewer trees are more interpretable. In that sense, in certain contexts, relative explainability can be explicitly quantified.
-
Automaticity: In most complex settings, estimators have a number of parameters themselves. For example, when running k-means, one must choose the number of clusters, and when running PCA, one must choose the number of dimensions. The ease and expediency with which one can choose/learn “good” parameter values for a given setting is crucial in real data applications. When the most computationally expensive “process” of an estimator can be nested, leveraging the answer from past values, this can greatly accelerate the acquisition of “good” results. For example, in PCA, by projecting into dimensions, larger than you think you’ll need, you have also projected into dimensions, and therefore do not need to explicitly compute each of those projections again. This is in contrast to non-negative matrix factorization, where the resulting matrices are not nested. This desiderata is the least formal of the bunch, and most difficult to quantify.
- Simplicity: This desiderata is a bit “meta”. Simplicity is kind of an over-arching goal, including that the estimator has a simple geometric intuition, which admits theoretical analysis, and generalizability, a simple algorithm, lacking in many algorithmic parameters (or tuning knobs), and a straightforward implementation.