In this post I highlight some of the most important books and articles for our group, and likely other groups. This list is certainly imperfect, so please comment about books/articles that you think are missing.

Books as references:

  1. exploratory data analysis (tukey)
  2. testing statistical hypotheses
  3. theory of point estimation
  4. convex optimization (boyd & other dude)
  5. elements of information theory (cover & thomas)
  6. elements of statistical learning (tibs & co)
  7. robust statistics (huber)
  8. density estimation (silverman)
  9. storytelling with data: Read this before making any publication quality figures
  10. writing science: Read this before trying to write a paper

Statistics / pattern recognition / data science:

These articles are not the most cited articles, and are not necessarily the first on a topic. However, each does an excellent job making a specific point that I believe is important to remember when doing data science.

  1. Model Based Clustering (Fraley & Rafterty, 2002): Explains the relationship between density estimation (via gaussian mixture modeling) and clustering. Still the best approach to cluster with medium-dimensional data, imho
  2. Model selection is arbitrary (George & Foster; 2000): Demonstrates that all the different model selection criteria are in fact arbitrary special cases of a general prior with 2 parameters
  3. Energy Statistics review (Rizzo & Szekely, 2016): Explains energy statistics that can perform a variety of tasks in high-dimensional data
  4. Random Forest (Breiman, 2001): Introduced random forest practice and theory, still the best “black box” machine learning algorithm
  5. GMRA (Allard, Chen, Maggioni; 2012): Introduces and explains relationships between dictionary learning and machine learning, with theory and implementation to boot
  6. MLE for Misspecified Models (White, 1982): Shows that MLE yields the minimum KL distance between the truth and the feasible region
  7. manifold learning is “just” kernel pca (Ham et al. 2004): Links several different important manifold learning techniques
  8. knn: Proves k-nearest neighbor is universally consistent regression
  9. grazing goat starves in high-dimensions: Shows that our intuition in high-dimensions is way off
  10. approximate nearest neighbor review: Well-written discussion on the original LSH paper and subsequent work, demonstrating that randomization is a very useful approximation
  11. MSE doesn’t work in >2 dimensions (Stein, 1960): The geometric reason that sparsity and other forms of regularization can help in finite samples
  12. lasso doesn’t work, even in true model: Shows that the lasso path includes lots of false positives, even when there are no correlations and the signal is sparse, and therefore shouldn’t be trusted
  13. generalized linear models: Showed that one can estimate many reasonable nonlinear regression functions with a sum of very simple nonlinear functions, though not used so much in practice these days, still a very important concept
  14. statistical pattern recognition (Jain, 2000): A wonderful review, shows the Trunk example that the optimal parameter/model with finite data is not necessarily the truth.
  15. imagenet classification via deep learning: Shows that with lots of training data, flops, on images, deep learning dramatically outperformed previous methods
  16. Statistical modeling: The two cultures (with comments and a rejoinder by the author) Explains the difference between machine learning and statistical modeling, but before the term “machine learning” was cool.
  17. Classifier Technology and the Illusion of Progress. In particular, it cites Hoadley, “Hoadley, in the same discussion, “coined a phrase called the ‘ping-pong theorem.’ This theorem says that if we revealed to Professor Breiman the performance of our best model and gave him our data, then he could develop an algorithmic model using random forests, which would outperform our model. But if he revealed to us the performance of his model, then we could develop a segmented scorecard, which would outperform his model.”

Other people’s neurodata collection:

  1. Array Tomography for collecting multispectral 3D gene expression maps, and Knowing a synapse when you see one for more discussion.
  2. CLARITY and iDisco: For seeing whole brains with fluorescence without physically sectioning
  3. Serial EM For seeing large volumes of nanoscale neuroanatomy
  4. Multimodal MRI For seeing in vivo non-invasive brain structure and function at millimeter scale
  5. Calcium imaging for seeing whole brain activity in zebrafish, or a whole bunch of activity in other stuff
  6. Behavior For characterizing “natural” behaviors and linking them to neurons.
  7. Biomarkers in psychiatry Stating that clinical psychiatrists do not yet utilize any brain imaging based biomarkers for any clinical diagnosis (as of july 2012).

Our work:

  1. Incommensurability phenomenon: Shows that if you run PCA twice on two different samples of noisy data, you can get arbitrarily different results
  2. You say, I say: Shows that graph invariants are test statistics, and therefore, we can determine which invariants are optimal for any test by thinking about them in these terms
  3. Graph Matching: Relax at your own risk: Shows that we can use a convex analytic approximation to initialize a non-convex numerical approximation, and get good results on an NP-hard problem, even when the convex approximatino is probably bad
  4. Consistency of ASE: Shows that spectral embedding yields consistent estimators of latent positions for random graph models, so then we can use them in typical machine learning algorithms for subsequent inference
  5. HSBM: One of the best statistical models of large graphs we know
  6. MGC: Shows that we can use and estimate locality for certain exploitation tasks, improving upon previous dependence tests both theoretically and empirically
  7. FlashGraph: Demonstrates the power of semi-external memory modeling to build algorithms on single-node multicore machines that best cluster implementations with orders of magnitude more resources
  8. Open Connectome paper: Explains the basis of our spatial database
  9. mi-lddmm: The current best approach for nonlinear registration, imho