Jekyll2019-06-21T15:09:39+00:00https://bitsandbrains.io/feed.xmlBits and BrainsNeuroData: enabling data-driven neuroscience9 Simple Rules to Write a Paper from Start to Finish2019-02-10T18:27:57+00:002019-02-10T18:27:57+00:00https://bitsandbrains.io/2019/02/10/how-to-write-a-paper<p><strong>tl;dr</strong></p>
<ol>
<li><strong>main result</strong> write draft paper title and draft killer fig that visually makes the main point of your story</li>
<li><strong>outline</strong> write title, abstract, and outline based on <a href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005619">how to structure a paper</a></li>
<li><strong>figures</strong> make the remaining figures, check that they follow <a href="https://bitsandbrains.io/2018/09/08/figures.html">figure guidelines</a></li>
<li><strong>draft</strong> flesh out details of paper.</li>
<li><strong>references</strong> add them and check that they are right</li>
<li><strong>wordsmithing</strong> check <a href="https://bitsandbrains.io/2018/10/14/paragraphs.html">paragraph guidelines</a>, <a href="https://bitsandbrains.io/2018/10/14/words.html">words guidelines</a>, and remove redundant works</li>
<li><strong>feedback</strong> send to co-authors for review & detailed critical feedback</li>
<li><strong>revision</strong> revise</li>
<li><strong>submission</strong> post to arxiv, submit to journal, tweet to world</li>
</ol>
<hr />
<p>Upon believing that you have completed work sufficient to write a peer reviewed manuscript, follow the below steps in order. If you are simply writing an abstract, just do the section entitled “Outline and Abstract”</p>
<ol>
<li>
<p><strong>Main Result</strong></p>
<ol>
<li>Write a one sentence summary of your work (will become your <strong>title</strong>; ~5 min). This sentences describes the main take home message / main result. It should not have any words that most of your readership will not be familiar with. It should be attention grabby, it should have less than 88 characters.</li>
<li>Also, draw/make the “killer fig” that makes the point as clearly and concisely as possible. Guidance for making paper quality figures is <a href="https://bitsandbrains.io/2018/09/08/figures.html">here</a>. (~5 min)</li>
</ol>
</li>
<li>
<p><strong>Outline and Abstract</strong></p>
<ol>
<li>Describe the other results, typically 3-5 additional figures or theorems. The goal of each of these is to support the main claim, for example, by further refining, adding controls, etc. Ideally, they are sequenced together in a logical chain, like a proof, each building on the next, to tell the story. (~1 hr)</li>
<li>Write a one paragraph summary (will become your <strong>abstract</strong>; ~30 min). This will be about 250 - 300 words, more than 500 words is a page, not an abstract. To include:
<ol>
<li>Big opportunity sentence: what is the grandest opportunity that this work is addressing?</li>
<li>Specific opportunity: what opportuntity specifically will this manuscript address?</li>
<li>Challenge sentence: what is hard about addressing this opportunity?</li>
<li>Gap sentence: what is currently missing?</li>
<li>Action sentence: what did you do to address the gap, overcome the challenge, and therefore meet the opportunity? it should provide the <em>key</em> intuition/insight, the magic that makes this work, where others failed.</li>
<li>Resolution sentence: what changes for the reader now that you have met this challenge?</li>
</ol>
</li>
</ol>
</li>
<li><strong>Figures</strong>
<ol>
<li>Make a first draft of all the figures and tables with detailed captions (~ 1 week). Captions should each be about a paragraph long. At this point, the figures need not be “camera ready”, but should have all the main points made.</li>
<li>Get feedback on figures (~1 week). Show the figures to your colleagues who are not co-authors on your paper, do not show them the captions, and ask them to tell you what the main point of each figure is. If they don’t get it right on their first try, interrupt them, and ask them to go to the next figure. Then, spend another week updating the figures, and repeat.</li>
</ol>
</li>
<li>
<p><strong>Draft</strong></p>
<ol>
<li>
<p>Make all the figures and captions “camera ready” (~1 week). Consult this <a href="https://bitsandbrains.io/2018/09/08/figures.html">figure checklist</a> to confirm that they are.</p>
</li>
<li>Write a five paragraph <strong>intro</strong> (~1 hr). This will be structured as follows:
<ol>
<li>bulleted list of ~3-5 main factors that create an opportunity for your work, filtering from most general to most specific, and not including anything ancillary (~20 min)</li>
<li>bulleted list of the ~3-5 main challenges that must be overcome (~20 min)</li>
<li>1 sentence summary of the <strong>gap</strong>, that is, the key ingredient that is missing (~5 min)</li>
<li>2-3 sentence summary of what you did (~5 min)</li>
<li>2-3 sentence summary on how your work changes the world (~5 min)</li>
</ol>
</li>
<li>
<p>Outline the methods and results: this is a 1 sentence summary of every point in <a href="https://github.com/neurodata/checklists/blob/master/methods_paper.md">methods_paper</a> (~20 min)</p>
</li>
<li>
<p>Fill in the details of the methods and results (~ 1 week).</p>
</li>
<li>Write discussion (~1 hr), to include (not a summary)
<ol>
<li>bulleted list of previous related work (~20 min)</li>
<li>bulleted list of potential extensions (~20 min)</li>
</ol>
</li>
</ol>
</li>
<li>
<p><strong>References</strong> make sure you have sufficiently cited the literature to place your work in context. For conferences, it is typical to have about 1 page of citations (10-20). For journal articles, 30-50 is more typical. Recall, the authors of these papers are likely to be the reviewers and readers for this paper. So, it is important that you highlight all the important work, and say how great it is.</p>
</li>
<li><strong>wordsmithing</strong>
<ol>
<li><strong>Paragraphs</strong> check that paragraphs follow <a href="https://bitsandbrains.io/2018/10/14/paragraphs.html">paragraph guidelines</a>.</li>
<li><strong>Words</strong> check that words follow <a href="https://bitsandbrains.io/2018/10/14/words.html">words guidelines</a>.</li>
<li><strong>shorten</strong> read the paper carefully, and remove any words that are unnecessary.</li>
</ol>
</li>
<li>
<p><strong>Feedback</strong> Get lots of feedback from >1 person who is in the community of potential readers of your published manuscript. Ask them to read it as if they are reviewing it for a journal, and to hold nothing back. Ask them to give you comments in one week. You are not beholden to them, but taking their criticism seriously and making improvements to the manuscript on their basis would be wise.</p>
</li>
<li>
<p><strong>Revision</strong></p>
<ol>
<li>Update abstract and introduction to final pre-feedback draft on text (~1 day).</li>
<li>Revise manuscript addressing each and every one of their concerns (~1 week). This does not necessarily mean making new figures, rather, it might mean clarifying various points of confusion.</li>
<li>Do another round of feedback, give them another week.</li>
</ol>
</li>
<li><strong>Submission</strong></li>
</ol>
<p>Finalize manuscript (~1 wk), submnit to journal, make code open source, make data anonymized and open access, post report to arXiv or biorXiv.</p>
<hr />
<p>If you follow the above plan, you will have a manuscript ready to submit 2 months after you start writing.</p>Joshua Vogelsteintl;drWhat We Do2019-02-10T18:27:57+00:002019-02-10T18:27:57+00:00https://bitsandbrains.io/2019/02/10/what-we-do<p>This blog post is our attempt to characterize what we, at <a href="https://neurodata.io/">NeuroData</a>, do. All of our actions are motivated by our mission, <code class="highlighter-rouge">to understand and improve intelligences</code>; the motivation behind our approach is described in our <a href="https://neurodata.io/about/">about page</a>. This post summarizes the bulk or our actual work, which can be divided into three complementary threads:</p>
<ol>
<li>big data systems</li>
<li>statistics / machine learning / artificial intelligence</li>
<li>applications</li>
</ol>
<h3 id="big-data-systems">Big Data Systems</h3>
<p>A big data system is a computational ecosystem, including hardware and software, design to support analysis of “big data”, operationally defined as data too big to fit into a single machine’s working memory. The operations of any big data system include: storing, interfacing, pipelining, and visualizing. Existing solutions have been inadequate to support efficient hypothesis generation and scientific discovery. We therefore build, extend, and deploy systems that generalize the functionality of existing systems to support scientific inquiry. We first developed the Open Connectome Project stack in 2011 to host the first ever big dataset in neuroscience, a 10 terabyte electron microscopy dataset from Davi Bock and Clay Read (<a href="https://doi.org/10.1145/2484838.2484870">ocp</a>). As the data size and complexity increased, the development needs extended beyond our capabilities, so we began collaborating extensively with other teams, including scientists and engineers at <a href="https://ai.google/research/people/VirenJain">Google</a>, <a href="https://alleninstitute.org/what-we-do/brain-science/">Allen Institute for Brain Science</a>, and <a href="https://www.janelia.org/">Janelia Research Campus</a>. This collaboration led to our existing open source, community developed, computational ecosystem (<a href="https://www.nature.com/articles/s41592-018-0181-1">ndcloud</a>). The core components of this work include <a href="https://www.biorxiv.org/content/10.1101/217745v1">bossDB</a> for big data storage, <a href="https://github.com/neurodata/ndwebtools">NDWebtools</a> for interfacing with the data, <a href="https://github.com/google/neuroglancer">neuroglancer</a> for visualizing, and various pipelines for different modalities, including <a href="https://neurodata.io/ndmg/">ndmg</a> for functional, structural, and diffusion magnetic resonance imaging and <a href="https://neurodata.io/reg/">reg</a> for registration of 2D and 3D volumes. We continue to extend our collaboration network and ecosystem, to support an ever growing need for big data systems in brain sciences across scales and modalities, ranging from electron microscopy, to whole clear brains, to human and non-human magnetic resonance imaging.</p>
<h3 id="statistics--machine-learning--artificial-intelligence">Statistics / Machine Learning / Artificial Intelligence</h3>
<p>Statistics, machine learning, and artificial intelligence are terms that describe complementary approaches to solving overlapping sets of problems. Central to all of them is the existence of some data samples, and a question; these approaches then build tools designed to learn an answer to the question from the data. Existing tools, however, have severe limitations that we must overcome in order to obtain answers to the questions of interest. First, raw data in neuroscience tends to be very high-dimensional (e.g., images can be petabytes), and exhibit many nonlinear relationships (e.g., the input/output function of a neuron). We have therefore developed a number computational statistics tools for such settings. This includes state of the art methods for dimensionality reduction (<a href="https://arxiv.org/abs/1709.01233">LOL</a>), classification (<a href="http://arxiv.org/abs/1506.03410">rerf</a>), time-series modeling [mr. sid]](https://doi.org/10.1016/J.PATREC.2016.12.012), clustering (<a href="https://arxiv.org/abs/1710.09859">eclust</a>), and hypothesis testing (<a href="https://elifesciences.org/articles/41690">mgc</a>). Second, the eventual representation of data most interesting to us is networks in the brain, or connectomes. We have therefore spent much of the last 10 years building foundational statistical estimators, theories, and algorithms for modeling populations of networks with graph, vertex, and edge attributes. A survey summarizing much of our work was recently published in JMLR (<a href="http://jmlr.org/papers/v18/17-448.html">rdpg</a>). We subsequently developed a python package that implements all of our theoretical developments (<a href="https://neurodata.io/graspy/">graspy</a>), and wrote a review article on our approach to modeling connectomes, called <a href="https://doi.org/10.1016/j.conb.2019.04.005"><code class="highlighter-rouge">connectal coding</code></a>.</p>
<h3 id="applications">Applications</h3>
<p>The potential applications of our work are widespread. We illustrate a couple applications spanning our most relevant work, including model (non-human) systems and human variation. First, we characterized which sets of neurons were causally involved in which behaviors, in <a href="https://doi.org/10.1126/science.1250298">larval Drosophila</a>. This led to an exploratory analysis of the mushroom body of the larval Drosophila, which we describe in detail in a <a href="http://arxiv.org/abs/1705.03297">technical report</a>, with further analysis available in our <a href="http://jmlr.org/papers/v18/17-448.html">survey JMLR paper</a>, and our <a href="https://doi.org/10.1016/j.conb.2019.04.005">connectal coding paper</a>. Second, to characterize human variation, we developed a cloud pipeline for estimation and analysis of human connectomes (<a href="https://doi.org/10.1093/gigascience/gix013">sic</a>), and used it to quantify variability present within and across individuals and studies (<a href="https://doi.org/10.1101/188706">ndmg</a>).</p>
<h3 id="open-science">Open Science</h3>
<p>A key principle underlying our work is the democratization of science: we desire that anybody, regardless of resources, is able to both access and contribute to the cutting edge of scientific discovery. To that end, all of the work we do is open science, including both open source code and open access data. Over the last 8 years, our website, https://neurodata.io, has grown from about 100 unique visitors each week to >1,000 unique visitors each month. In the last calendar year, >13,000 unique visitors browsed the site. These visitors span every inhabited continent, including over 2,000 different cities (see image below). Over the 8 years, approximately 80,000 unique individuals have visited our site, over double the number of people at a typical Society for Neuroscience conference, suggesting that many non-neuroscientists have visited our site. We hope that those people visiting the site feel the same sense of awe and inspiration as we do, associated with unraveling the secrets of mental function in these beautiful images of brains.</p>
<p><img src="/assets/post_images/neurodata-visitors.png" alt="significance" /></p>Joshua VogelsteinThis blog post is our attempt to characterize what we, at NeuroData, do. All of our actions are motivated by our mission, to understand and improve intelligences; the motivation behind our approach is described in our about page. This post summarizes the bulk or our actual work, which can be divided into three complementary threads: big data systems statistics / machine learning / artificial intelligence applicationsTips for Getting into a Top Graduate Program2018-10-21T18:27:57+00:002018-10-21T18:27:57+00:00https://bitsandbrains.io/2018/10/21/getting-into-grad-school<p>As faculty in the <a href="https://www.bme.jhu.edu/">Department of Biomedical Engineering</a> at Johns Hopkins University, the best BME department in the world, both in terms of <a href="https://www.usnews.com/best-colleges/rankings/engineering-doctorate-biological-biomedical">undergraduate</a> and <a href="https://www.usnews.com/best-graduate-schools/top-engineering-schools/biomedical-rankings">graduate</a> schools, I have learned what I and other faculty are looking for in applicants. Before getting into what the most important factors are, I believe it is important to understand our goals, which motivate those factors. From the perspective of the graduate admissions committee, our goal is to estimate whether we believe that BME@JHU is the best place for <em>you</em> to thrive to achieve you ultimate dreams. In other words, we try to ascertain whether the environment that we create at JHU will be maximally supportive of both your strengths and weaknesses. As it turns out, this does not mean necessarily that we accept the best students in some abstract sense (as defined by some metric), but rather, we try to accept the students for which we believe that <em>we</em> will be the best mentors for you. Of course, this is a complicated objective function, and one for which we will most likely sometimes make errors. Nonetheless, it is our goal. To make such estimations, we look for the following:</p>
<ol>
<li>
<p>Research Experience: First and foremost, we are a research university. So, the best way for us to determine whether our research environment will support you to flourish is to understand your previous research experience, and in which settings you flourished more than others. Although successful research is difficult to quantify, research artifacts provide some data with which we can evaluate your achievements. Such artifacts include poster presentations, conference proceedings, journal publications, numerical packages, and even patents sometimes. If you are the first (or co-first) author on any of these, we typically assume that much of the work is yours, and thus first author research artifacts are most informative. Middle author works are also informative, especially if you clarify your role in the research in your personal statement. Note, however, strong research experience is <em>not</em> a pre-requisite for admission. Rather, it is an information-rich piece of data for us.</p>
</li>
<li>
<p>Grades: JHU is not just a research university, we are also a teaching university, and we take our teach responsibilities quite seriously. Moreover, many of our graduate level BME courses are also serious and time consuming. It is important to us that you perform well in them, because they provide the necessary background upon which our research programs are based. It is <em>not</em> important to us that you get straight A’s. Rather, we care that you perform well in the courses that will be the most relevant for your research during your PhD, typically quantitative and biology classes for us. It is also not crucial that you perform well in every semester. We understand that life happens, and certain things are more important than coursework (family, health, well-being, etc.). Finally, we appreciate that not everybody gets the same opportunities in high school, and therefore not everybody is equally prepared. Thus, the grades in the most recent semesters, in the most relevant courses, are most important. Aim for getting A’s in them, but GPA alone neither gets students admission nor rejection.</p>
</li>
<li>
<p>Recommendations: While these do not come directly from you, they are quite important to us. BME@JHU is like a big extended family. We work closely with one another, sit near each other, we have been doing so for a long time, and plan to continue doing so for many years to come. Therefore, our community is quite important to us, and our success comes largely from surrounding ourselves not just with the smartest people in the world, but more importantly, really good people. So, the recommendation letters are a way for us to get information about how pleasant it is to work with you. I particularly look for recommendations from other faculty with successful research programs, as they are the most informative with regards to what it takes to have a successful PhD. Recommendations from industry can be somewhat informative, but less so. In other words, the number of PhD students somebody has mentored matters in our assessment. In terms of content, we are looking for recommendations that write that you are pleasant to work with, and amongst the best of his/her previous students along <em>some</em> dimensions, such as productivity, passion, drive, creativity, organization, etc. In other words, you excel in the kinds of personality traits that we think contribute to successful PhDs. Much like research experience and grades, a good or bad recommendation cannot determine your acceptance.</p>
</li>
<li>
<p>Personal Statement: Your personal statement is your opportunity to express yourself. The most important aspect of a personal statement for me is <strong>passion</strong>. Success in our field, I believe, is strongly correlated with passion. Even if that is not the case, it is more fun for me to work with people that are passionate about solving some problems. So, express yourself freely and passionately. And be specific. Find a few faculty members in the department that you are applying to, and write about what, in particular, you find most exciting about their work. In this way, we’ll be able to align your passions with ours in the review process. If you’ve reached out to any of the faculty, or anybody else associated with the department prior to application, mention it, and how it has informed your decision to apply. And don’t forget to spell/grammar check it, especially if you are a <em>native</em> English speaker.</p>
</li>
</ol>
<p>There are a few things that people invest a bunch of energy in, that do not matter hardly at all. First is the GRE. Evidence is building that it is classist, racist, and sexist (see for example, <a href="http://www.takepart.com/article/2015/11/07/gre-bias">here</a>). JHU will probably stop not just requiring them, but even allowing them. As it currently stands, only if somebody does quite poorly on the quantitative aspect of the GRE (say, below 70%), does his/her GRE score even come up for discussion. Second, is fellowships. In general, we have not heard of them, do not understand the criteria for winning them, who applies, or what is achieved. Therefore, we are typically unable to consider them in a meaningful fashion. I’ve literally never ever heard them come up in discussing any applicant, and I’ve now been privvy to discussion literally hundreds, maybe over 1,000 applicants across multiple different departments.</p>
<p>As a final note, my lab, as well as many other successful labs, are <em>always</em> accepting exceptional graduate students. The success of our labs’ depends on the success of excellent students, so we are always searching for and hoping to find people whose passions align with ours, and whose abilities either align with or complement our own. I hope this is helpful. If anybody disagrees with my assessment, or has other recommendations, or further questions, I’d love to hear from you in the comments.</p>Joshua VogelsteinAs faculty in the Department of Biomedical Engineering at Johns Hopkins University, the best BME department in the world, both in terms of undergraduate and graduate schools, I have learned what I and other faculty are looking for in applicants. Before getting into what the most important factors are, I believe it is important to understand our goals, which motivate those factors. From the perspective of the graduate admissions committee, our goal is to estimate whether we believe that BME@JHU is the best place for you to thrive to achieve you ultimate dreams. In other words, we try to ascertain whether the environment that we create at JHU will be maximally supportive of both your strengths and weaknesses. As it turns out, this does not mean necessarily that we accept the best students in some abstract sense (as defined by some metric), but rather, we try to accept the students for which we believe that we will be the best mentors for you. Of course, this is a complicated objective function, and one for which we will most likely sometimes make errors. Nonetheless, it is our goal. To make such estimations, we look for the following:11 Simple Rules for Releasing Data Science Tools2018-10-21T18:27:57+00:002018-10-21T18:27:57+00:00https://bitsandbrains.io/2018/10/21/numerical-packages<p>These notes were co-written by myself and a number of other people, including <a href="http://gkiar.me/">Greg Kiar</a>, <a href="http://ericwb.me/">Eric Bridgeford</a>, <a href="https://www.mcgill.ca/qls/researchers/jb-poline">JB Poline</a>, and <a href="https://users.encs.concordia.ca/~tglatard/">Tristan Glatard</a>. Inspired by the FAIR Guiding Principles for scientific data management and stewardship (see <a href="https://www.nature.com/articles/sdata201618">here</a>), we devised the FIRM guidelines for scientific software, specifically numerical packages. The FIRM guidelines stipulate that anybody in the world should be able to: <strong>F</strong>ind, <strong>I</strong>nstall, <strong>R</strong>un, and <strong>M</strong>odify your code. Below is a working draft of our ideas; as always, your feedback is solicited.</p>
<h3 id="find">Find</h3>
<ol>
<li>To make your code findable, we recommend three steps:
<ol>
<li>Make the code open source on a searchable code repository (e.g., <a href="https://github.com/">github</a> or <a href="https://about.gitlab.com/">gitlab</a>).</li>
<li>Generate a permanent Digital Object Identifier (DOI) so that you can freely move the code to other web-servies if you so desire without breaking the links (e.g., using <a href="https://zenodo.org/">zenodo</a>).</li>
<li>Add a license so that others can freely use your code without worrying about legal ramifications (see <a href="https://opensource.org/licenses">here</a> for options).</li>
</ol>
</li>
</ol>
<h3 id="install">Install</h3>
<ol>
<li>
<p>Provide installation guidelines, including <em>1-line installation</em> instructions with system requirements (including hardware and OS), software dependencies, and expected install time.</p>
</li>
<li>
<p>Deposit your code into a standard package manager, such as <a href="https://cran.r-project.org/">CRAN</a> for R or <a href="https://pypi.org/">PyPi</a> for Python. You might also provide a container or virtual machine image with your package pre-installed, for example, using <a href="https://www.docker.com/">Docker</a>, <a href="https://www.sylabs.io/docs/">Singularity</a> or <a href="https://gigantum.com/">Gigantum</a>.</p>
</li>
</ol>
<h3 id="run">Run</h3>
<ol>
<li>
<p>Provide a demo, including requisite data, expected results, and runtime on specified hardware. The demo should be simple, intuitive, and fast to run. We recommend using <a href="https://rmarkdown.rstudio.com/">Rmarkdown</a> for R and a <a href="http://jupyter.org/">Jupyter Notebook</a> for Python.</p>
</li>
<li>
<p>Write a readme with a quick start guide, including installation and a simplified (plain text) version of the demo.</p>
</li>
<li>
<p>Make sure each function includes auto-generated documentation. We recommend <a href="https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html">Roxygen</a> for R and <a href="http://www.sphinx-doc.org/en/master/">Sphinx</a> for Python.</p>
</li>
</ol>
<h3 id="modify">Modify</h3>
<ol>
<li>Include contribution guidelines, including:
<ol>
<li>style guidelines (<a href="https://google.github.io/styleguide/Rguide.xml">Google’s</a> or <a href="http://adv-r.had.co.nz/Style.html">Hadley’s</a> for R, or <a href="https://www.python.org/dev/peps/pep-0008/">PEP8</a> for Python)</li>
<li>bug reports,</li>
<li>pull requests, and</li>
<li>feature additions.</li>
</ol>
</li>
<li>Write unit tests for each function. Examples are <a href="http://testthat.r-lib.org/">testthat</a> for R and <a href="https://docs.python.org/3/library/unittest.html">unittest</a> for Python.</li>
<li>Incorporate continuous integration, for example, using either <a href="https://travis-ci.org/">TravisCI</a> or <a href="https://circleci.com/">CircleCI</a>.</li>
<li>Add the following <a href="https://shields.io/#/">badges</a> to your repo:
<ol>
<li>DOI.</li>
<li>license,</li>
<li>stable release version so people know which release they are on (from package manager),</li>
<li><a href="https://readthedocs.org/">documentation</a> to indicate that you generated documentation,</li>
<li><a href="https://codeclimate.com/">code quality</a> to indicate that your code is written using modern best practices,</li>
<li><a href="https://coveralls.io/">coverage</a> to indicate the extent to which you have written tests for your functions,</li>
<li><a href="https://www.docker.com/">build status</a> to indicate whether the virtual machine that contains the latest version of your code is running,</li>
<li>total number of downloads,</li>
</ol>
</li>
<li>Finally, benchmarks establishing current performance (using appropriate metrics) on standard problems, and better yet also comparing to other standard methods. Ideally, the code the generate the benchmark numbers are provided in <a href="http://jupyter.org/">Jupyter notebooks</a> provided in your <a href="https://gigantum.com/">Gigantum</a> project.</li>
</ol>
<p>A few examples of numerical packages that we have released that satisfy all (or most of) these rules include:</p>
<ul>
<li><a href="https://github.com/neurodata/graspy">graspy</a></li>
<li><a href="https://github.com/neurodata/mgc">mgc</a></li>
<li><a href="https://github.com/neurodata/rerF/">Randomer Forest</a></li>
<li><a href="https://github.com/neurodata/ndmg">ndmg</a></li>
<li><a href="https://github.com/neurodata/LOL">LOL</a></li>
<li><a href="https://github.com/neurodata/ndreg">ndreg</a></li>
</ul>Joshua VogelsteinThese notes were co-written by myself and a number of other people, including Greg Kiar, Eric Bridgeford, JB Poline, and Tristan Glatard. Inspired by the FAIR Guiding Principles for scientific data management and stewardship (see here), we devised the FIRM guidelines for scientific software, specifically numerical packages. The FIRM guidelines stipulate that anybody in the world should be able to: Find, Install, Run, and Modify your code. Below is a working draft of our ideas; as always, your feedback is solicited.10 Simple Rules for Writing paragraphs2018-10-14T18:27:57+00:002018-10-14T18:27:57+00:00https://bitsandbrains.io/2018/10/14/paragraphs<p>Here is a list of tips I check for each paragraph whenever I write anything:</p>
<ol>
<li>Each paragraph is about a single concrete and coherent idea
<ol>
<li>Does the first sentence of the paragraph introduce this idea?</li>
<li>Do all subsequent sentences in the paragraph further clarify the first sentence?</li>
<li>Did I use transitional words to establish logical relationships between sentences?</li>
<li>Did I write anything like, “Next, we …”? If so, remove. The structure of the paragraphs should flow into one another.</li>
<li>Did I sequence light to heavy: so the earlier sentences that start light/succinct, follow-up with “heavy” details?</li>
<li>Does the last sentence of every paragraph link it to next paragraph?</li>
</ol>
</li>
<li>Tense
<ol>
<li>Is the tense consistent within each paragraph (results are past tense)?</li>
<li>Is it past tense only for things that you or other people did previously?</li>
<li>Are there any passive tense phrases (e.g., “it can be shown”)? If so, revise to active tense.</li>
</ol>
</li>
<li>Any time you introduce a new concept
<ol>
<li>Have other people discussed this concept? If so, did you use familiar notation/naming conventions? If not, have you clarifed/justified the difference?</li>
<li>If you’ve introduced 1 new concept, have you used the name of that topic consistently? And did you never use any other term (to bury the concept in their mind)?</li>
<li>When introducing a novel concept/word/equation/notation/etc., did you explain it <em>before</em> usage, rather than after (else the reader will not understand when reading it, and we don’t want that)?</li>
</ol>
</li>
</ol>
<p>See this <a href="https://www.youtube.com/watch?v=rZxaSMzstB8">youtube video</a> for details.</p>
<p>I realize this is actually 12 things to check. I’m ok with that.</p>Joshua VogelsteinHere is a list of tips I check for each paragraph whenever I write anything:How to Structure a Grant2018-10-14T18:27:57+00:002018-10-14T18:27:57+00:00https://bitsandbrains.io/2018/10/14/structuring-a-grant<p>I write many grants. Most of them, or at least the parts that I write, are very quantitative in nature. Most of how I think about writing them comes directly from discussions with <a href="http://optimizescience.com">Brett Mensh</a>. He is the world’s expert on grant-writing, and I highly recommend you contact him.</p>
<h3 id="specific-aims">Specific Aims</h3>
<p>This is one page, and it matters more than everything else combined. I do it as follows:</p>
<ol>
<li>A paragraph introducing the problem we are solving</li>
<li>A paragraph on why it is hard, ie, why other really smart people (e.g., the review panel) have not yet been able to solve it.</li>
<li>A paragraph motivating our overall approach/philosophy to the problem</li>
<li>The 3-4 aims. For each aim, there is a 1-2 line <em>action</em> statement of what the aim is, and what it will deliver. For example, “Develop nonparametric machine learning techniques to identify brain-imaging biomarkers for depression using the Healthy Brain Network Dataset.” Note that there is a verb (develop), and it is clear to the funders what they’ll get (a new biomarker for depression), and how we will do it (nonparametric machine learning).</li>
</ol>
<h3 id="research-strategy">Research Strategy</h3>
<ol>
<li>
<p>Significance: Up to 1-2 pages, talking about how important your problem is to solve, funneling down from most general to most specific. One sentence in <strong>bold</strong> to highlight the potential impact of your proposed work.</p>
</li>
<li>
<p>Innovation: Up to 0.5 pages, highlighting the novel technical contributions, again with 1 sentence in <strong>bold</strong> to focus on the key innovation of the proposed work.</p>
</li>
<li>Approach: ~9-12 pages (depending on the specific grant), organized into an “overview” section followed by 3-4 aims. The 2-3 page overview section describes commonalities between the aims, any data that are being used, and related things. The overview can also include or be followed by a “general background” that applies to each of the aims. Each aim is about 2-3 pages, and includes the following sections:
<ol>
<li><em>Introduction</em>: a 1 paragraph jargon-free introduction of what you will accomplish in this aim, and how.</li>
<li><em>Justification and Feasibility</em> or <em>Preliminary Results</em>: Up to 1 page describing why you are particularly well-suited to accomplish the goals in the allotted time given the allotted resources.</li>
<li><em>Research Design</em>: ~2 paragraphs on the details of what you’ll actually do.</li>
<li><em>Expected Outcomes</em>: 1 paragraph on what you expect to actually “deliver” back to the funding agency.</li>
<li><em>Potential Pitfalls and Alternative Strategies</em>: A few lines to indicate that you understand which parts are difficult, and have a contingency plan.</li>
</ol>
</li>
<li>Timeline and Future Direction: ~1/2 pages, describing when the activities will happen, including a table organized by Aim, and connecting the work to your future long-term agenda.</li>
</ol>
<h3 id="some-other-tips">Some other tips:</h3>
<ol>
<li>Follow my blog post on <a href="/2018/10/14/words.html">words</a></li>
<li>Follow my blog post on <a href="/2018/10/14/paragraphs.html">paragraph</a></li>
<li>Follow my blog post on <a href="/2018/09/08/figures.html">figures</a></li>
<li>The “name” of each aim/task should be an <a href="http://www.quickslide-powerpoint.com/en/blog/action-titles-providing-orientation-well-thought-out-slide-titles">“action title”</a></li>
<li>Each aim/task should follow OCAR <!---Consider putting a hyperlink here too? ---></li>
<li>For each sub-aim/task, include a <strong>bold</strong> sentence precisely and concisely stating its objective (the <em>action</em> part)</li>
<li>Make sure to distinguish your own work from others explicitly every time your work is cited</li>
<li>Check that formatting is consistent across <em>all</em> documents (both within type, eg biosketches) and across.</li>
<li>Use <a href="https://www.google.com/docs/about/">google docs</a></li>
<li>Use <a href="https://paperpile.com/">paperpile</a> for references (free version for 2 weeks)</li>
<li>Use <a href="https://chrome.google.com/webstore/detail/auto-latex-equations/iaainhiejkciadlhlodaajgbffkebdog?hl=en-US">auto-latex</a> for equations</li>
<li>Keep figures at very end until last opportunity</li>
</ol>
<h4 id="nsf-specific">NSF specific</h4>
<ol>
<li>For summary, focus on gap and impact</li>
<li>For intellectual merit, use as much language from <a href="http://www.sciencemag.org/sites/default/files/documents/Big%20Ideas%20compiled.pdf">NSF Big Ideas</a> as possible</li>
<li>Broader impacts is about societal, not intellectual benefit, focusing on STEM education, minorities, disabilities, open source, etc.</li>
</ol>Joshua VogelsteinI write many grants. Most of them, or at least the parts that I write, are very quantitative in nature. Most of how I think about writing them comes directly from discussions with Brett Mensh. He is the world’s expert on grant-writing, and I highly recommend you contact him.Words/Phrases to Check in Technical Writing2018-10-14T18:27:57+00:002018-10-14T18:27:57+00:00https://bitsandbrains.io/2018/10/14/words<p>Here is a list of tips I check whenever I write anything:</p>
<ol>
<li>check for any misspelled words using spellcheck</li>
<li>replace contractions with complete words (eg, don’t –> do not)</li>
<li>replace abbreviations with complete words (eg, “e.g.” –> for example)</li>
<li>replace colloquialisms with more formal words (eg, nowadays –> recently)</li>
<li>when referring to figures, be consistent, probably “Figure X” everywhere, although “Fig.~X” is also permissable, “figure x” is not really.</li>
<li>punctuate every equation (e.g., it gets a comma or a period after)</li>
<li>In latex, replace all double quotes with `` and ‘’.</li>
</ol>
<p>For the following list of words, literally do a search of every instance of each of the below words, and modify the text as described below.</p>
<ol>
<li>i, we, our, us, you, your –> rewrite sentence (almost always)</li>
<li>in order to/for –> to/for</li>
<li>clearly, obviously –> remove, might not be so clear/obvious to everyone</li>
<li>this, they, it –> be specific, which noun this/they/it is referring to is often vague</li>
<li>very –> use a stronger adjective/adverb</li>
<li>data is –> data are</li>
<li>novel –> remove, novelty should be implied by context. If it is not clear by context, update context</li>
<li>most/least/best/worst/better/worse/optimal/*est: requires a citation as it is an empirical claim, or evidence, and a dimension along which the comparison is made</li>
<li>usually/typically –> same deal, either provide a citation/evidence, or don’t say it, replace with “frequently”</li>
<li>no reason / essential/necessary / necessitate / no way / impossible –> remove, these are all too strong, just because you haven’t thought of a reason, or a counter example, or another way, does not mean that nobody else has/can.</li>
<li>done –> completed.</li>
<li>utilized –> used</li>
<li>firstly –> first, and similarly for second, third, etc.</li>
<li>& –> and</li>
<li>arguably –> possibly, likely, perhaps (why argue with your reader?)</li>
<li>as such –> ok sparsely, but often overused, check and revise sentences.</li>
<li>numbers have space before unit, eg 1GB –> 1 GB.</li>
<li>first –> somebody reading this will think they did it first, and they are at least partially correct. first-ness should be implied by context.</li>
<li>in this manuscript –> nothing, it is implied.</li>
<li>a priori –> should be italics</li>
<li>is used –> rewrite sentence, avoid passive tense whenever possible</li>
<li>can be seen / it has been shown –> typically just remove, sometimes replace with “shows”, or reword sentences</li>
<li>we want to –> we (though should be reworded to avoid “we” entirely)</li>
<li>Fig, Fig., fig –> Figure (or at least be consistent)</li>
<li>we chose appropriate –> we chose (let them decide whether it was appropriate)</li>
<li>“note that” or “we note that” or “we highlight” or “we highlight that” –> simply remove.</li>
<li>can be / we think / could be / might be / etc. –> these are always true, and therefore the clause that follows could be anything, and is too weak. be stronger, but only quantitatively if there is evidence, and cite it.</li>
<li>should –> who is the arbiter of what should and shouldn’t be, and how do you have privileged access to that information? (hint: you don’t) (ps - i realize i used should in this document)</li>
<li>methodology –> method</li>
<li>it appears –> usually just remove this and its better</li>
</ol>Joshua VogelsteinHere is a list of tips I check whenever I write anything:The Varieties of Hypothesis Tests2018-09-29T18:27:57+00:002018-09-29T18:27:57+00:00https://bitsandbrains.io/2018/09/29/categories-of-testing<p>Hypothesis testing is a procedure for evaluating the likelihood that a given dataset corresponds to a particular “null” model. Formally, all hypothesis testing scenarios can be described as follows.</p>
<p>Let <script type="math/tex">X_i \sim F</script>, for <script type="math/tex">i \in [n]=\{1,\ldots,n\}</script> be a random variable, where each <script type="math/tex">X_i</script> is sampled independently and identically according to some distribution <script type="math/tex">F</script>. Realizations of <script type="math/tex">X_i</script> are <script type="math/tex">x_i \in \mathcal{X}</script>. We further assume that <script type="math/tex">F \in \mathcal{F}</script>, that is, <script type="math/tex">F</script> is a distribution that lives in a model <script type="math/tex">\mathcal{F} = \{F : \mathcal{F}\}</script>. Moreover, we partition <script type="math/tex">\mathcal{F}</script> into two complementary sets, <script type="math/tex">\mathcal{F}_0</script> and <script type="math/tex">\mathcal{F}_A</script>, such that <script type="math/tex">\mathcal{F}_0 \cup \mathcal{F}_A = \mathcal{F}</script> and <script type="math/tex">\mathcal{F}_0 \cap \mathcal{F}_A = \emptyset</script>. Given these definitions, all hypothesis tests can be written as:</p>
<p>\begin{align}
H_0: F \in \mathcal{F}_0, \qquad
H_A: F \in \mathcal{F}_A.
\end{align}</p>
<!-- For example, $$\mathcal{F}$$ could be the set of all Gaussians, and therefore $$x_i \in \mathbb{R}$$, and $$\mathcal{F}_0$$ could be $$\mathcal{N}(0,1)$$, and $$\mathcal{F}_A=\mathcal{N}(1,1)$$. -->
<p>Note that a hypothesis can be either <em>simple</em> or <em>composite</em>: according to <a href="https://en.wikipedia.org/wiki/Null_hypothesis">wikipedia</a>, simple hypothesis tests are
“any hypothesis which specifies the population distribution completely,” whereas composite tests are “any hypothesis which does not specify the population distribution completely.” Given this, we consider four different kinds of hypothesis tests:</p>
<ol>
<li>
<p>Simple-Null, Simple-Alternate (S/S)
\begin{align}
H_0: F = F_0, \qquad
H_A: F = F_A.
\end{align}</p>
</li>
<li>
<p>Simple-Null, Composite-Alternate (S/C)
\begin{align}
H_0: F = F_0, \qquad
H_A: F \in \mathcal{F}_A.
\end{align}</p>
</li>
<li>
<p>Composite-Null, Simple-Alternate (C/S)
\begin{align}
H_0: F \in \mathcal{F}_0, \qquad
H_A: F \in \mathcal{F}_A.
\end{align}</p>
</li>
<li>
<p>Composite-Null, Composite-Alternate (C/C):
\begin{align}
H_0: F \in \mathcal{F}_0, \qquad
H_A: F \in \mathcal{F}_A.
\end{align}</p>
</li>
</ol>
<p>Each of the above can be re-written in terms of parameters. Specifically, we can make the following substitutions:</p>
<p>\begin{align}
F = F_j \Rightarrow \theta(F) = \theta(F_j), \qquad
F \in \mathcal{F}_j \Rightarrow \theta(F) \in \Theta_j,
\end{align}
where <script type="math/tex">\Theta_j = \{ \theta : \theta(F) \in \mathcal{F}_j \}</script>.</p>
<p>Note that composite tests (those where the null or alternate is composite) can be both one-sided or two-sided. In particular, the simple null hypothesis can be restated: <script type="math/tex">H_0: F-F_0 = 0</script>. Given that, a one-sided simple/composite test is:
\begin{align}
H_0: F - F_0 = 0, \qquad
H_A: F - F_0 > 0.
\end{align}
Of course, we could just as easily replace the <script type="math/tex">></script> in the alternate with <script type="math/tex">% <![CDATA[
< %]]></script>.<br />
The two-sided composite test can be written:
\begin{align}
H_0: F - F_0 = 0, \qquad
H_A: F - F_0 \neq 0,
\end{align}
or equivalently,
\begin{align}
H_0: F = F_0, \qquad
H_A: F \neq F_0.
\end{align}</p>
<p>In other words, the simple null, two-sided composite is equivalent to a test for equality (see below).</p>
<p>Thus, there are essentially three kinds of composite hypotheses:</p>
<ol>
<li>one-sided,</li>
<li>two-sided, or</li>
<li>neither (e.g., the alternative is a more complex set).</li>
</ol>
<h3 id="examples">Examples</h3>
<h4 id="independence-testing">Independence Testing</h4>
<p>An independence test is one of the fundamental tests in statistics. In this case, let <script type="math/tex">Z_i = (X_i, Y_i) \sim F_{XY}</script>. Now we test:
\begin{align}
H_0: F_{X,Y} = F_{X} F_{Y} \qquad
H_A: F_{X,Y} \neq F_{X} F_{Y}.
\end{align}</p>
<p>In other words, we define <script type="math/tex">\mathcal{F}_0</script> as the set of joint distributions on <script type="math/tex">(X,Y)</script> such that the joint equals the product of the marginals: <script type="math/tex">\mathcal{F}_0 = \{ F_{X,Y} = F_{X}F_{Y} \}</script>, and we define <script type="math/tex">\mathcal{F}_A</script> as the complement of that set, <script type="math/tex">\mathcal{F}_0^c = \mathcal{F} \backslash \mathcal{F}_0 = \mathcal{F}_A</script>. Independence tests are special cases of C/C tests.</p>
<h4 id="two-sample-testing">Two-Sample Testing</h4>
<p>Let <script type="math/tex">X_i \sim F_1</script>, for <script type="math/tex">i \in [n]</script> be a random variable, where each <script type="math/tex">X_i</script> is sampled independently and identically according to some distribution <script type="math/tex">F_1</script>. Realizations of <script type="math/tex">X_i</script> are <script type="math/tex">x_i \in \mathcal{X}</script>.</p>
<p>Additionally, let <script type="math/tex">Y_i \sim F_2</script>, for <script type="math/tex">i \in [m]</script> be a random variable, where each <script type="math/tex">Y_i</script> is sampled independently and identically according to some distribution <script type="math/tex">F_2</script>. Realizations of <script type="math/tex">Y_i</script> are <script type="math/tex">y_i \in \mathcal{Y}</script>.</p>
<p>Given these definitions, all two-sample testing can be written as:</p>
<p>\begin{align}
H_0: F_1 = F_2 \qquad
H_A: F_1 \neq F_2.
\end{align}</p>
<p>This can be written as an independence test. First, define the mixture distribution <script type="math/tex">F = \pi_1 F_1 + \pi_2 F_2</script>, where <script type="math/tex">\pi_1,\pi_2 \geq 0</script> and <script type="math/tex">\pi_1+\pi_2=1</script>. Now, sample <script type="math/tex">U_i \sim F</script> for <script type="math/tex">n+m</script> times. To make it exactly equal to the above, set <script type="math/tex">\pi_1 = n/(n+m)</script>. Moreover, define <script type="math/tex">V_i</script> to be the latent “class label”, that is <script type="math/tex">V_i=1</script> if <script type="math/tex">U_i \sim F_1</script> and <script type="math/tex">V_i=2</script> if <script type="math/tex">U_i \sim F_2</script>. Now, we can form the independence test:</p>
<p>\begin{align}
H_0: F_{UV} = F_U F_V \qquad
H_A: F_{UV} \neq F_U F_V.
\end{align}</p>
<p>Moreover, two-sample tests can also be written as simple goodness-of-fit tests, which is readily apparent, as described below. Thus, simple goodness-of-fit tests are also independence tests.</p>
<h4 id="goodness-of-fit-testing">Goodness-of-Fit Testing</h4>
<p>The most general kind of a goodness-of-fit test is a C/C test:
\begin{align}
H_0: F \in \mathcal{F}_0 \qquad
H_A: F \notin \mathcal{F}_0.
\end{align}
In other words, <script type="math/tex">\mathcal{F}_A = \mathcal{F}_0^c = \mathcal{F} \backslash \mathcal{F}_0</script>.</p>
<p>The S/C goodness-of-fit test is a special case:
\begin{align}
H_0: F = F_0 \qquad
H_A: F \neq F_0.
\end{align}</p>
<p>This special case is clearly an instance of a two-sample test; the only difference is that the second distribution is not estimated from the data, but rather, is provided by the null.</p>
<h4 id="k-sample-tests">K-Sample Tests</h4>
<p>More generally, assume there exists <script type="math/tex">K</script> different distributions, <script type="math/tex">F_1, \ldots, F_K</script>, the K-Sample test is:</p>
<p>\begin{align}
H_0: F_1 = F_2 = \cdots = F_K \qquad
H_A: \text{any not equal}.
\end{align}</p>
<p>We can do the same trick as above, defining <script type="math/tex">F</script> to be a mixture of <script type="math/tex">K</script> distributions, and <script type="math/tex">V_i</script> denotes from which component is sample <script type="math/tex">i</script>. Thus, <script type="math/tex">K</script>-sample tests are also independence tests.</p>
<h4 id="paired-k-sample-testing">Paired K-Sample Testing</h4>
<p>Let <script type="math/tex">X_i</script> and <script type="math/tex">Y_i</script> be defined as above, except now we sample matched pairs, <script type="math/tex">(X_i,Y_i) \sim F_X F_Y</script>. The paired two sample test is still:
\begin{align}
H_0: F = G \qquad
H_A: F \neq G,
\end{align}
but there exists more powerful test statistics that consider the “matchedness”. Of course, this also extends to the <script type="math/tex">K</script>-sample setting.</p>
<p>Note that this is a special case of <script type="math/tex">K</script>-sample testing, and thus is also an independence test.</p>
<h4 id="test-for-symmetry">Test for Symmetry</h4>
<p>Assume we desire to determine whether the distribution of <script type="math/tex">X</script> is symmetric about zero. Define <script type="math/tex">F_+</script> as the half of the distribution that is positive, and <script type="math/tex">F_-</script> as the half that is negative. Then, we simply have a typical two-sample test:
\begin{align}
H_0: F_+ = -F_- \qquad
H_A: F_+ \neq -F_-,
\end{align}
which is an independence test.</p>
<h4 id="multinomial-test">Multinomial Test</h4>
<p>Assume <script type="math/tex">X \sim</script>Multinomial<script type="math/tex">(p_1, p_2, p_3; n)</script>, and let:
\begin{align}
H_0: p_1 = p_2 = p_3 \qquad
H_A: \text{not all equal}.
\end{align}</p>
<p>This is again a <script type="math/tex">K</script> sample test, which is an independence test.</p>
<p>Considering a slightly different case:
\begin{align}
H_0: p_1 = p_2 \qquad
H_A: p_1 \neq n_2.
\end{align}
This is a slight generalization of the above, but can still be written as a <script type="math/tex">K</script> sample test, and is therefore an independence test.</p>
<h4 id="one-sided-test">One Sided Test</h4>
<p>Assume <script type="math/tex">X \sim \mathcal{N}(\mu,1)</script>, and consider the following test:
\begin{align}
H_0: \mu = 0 \qquad
H_A: \mu \neq 0.
\end{align}</p>
<p>This is a two-sample test, and therefore, an independence test. A slightly more complicated variant is:
\begin{align}
H_0: \mu < 0 \qquad
H_A: \mu \geq 0.
\end{align}</p>
<p>Viewing this as an independence test is a bit more complicated. In particular, if we set it up as the above two-sample test, and we reject, it could be because <script type="math/tex">% <![CDATA[
\mu < 0 %]]></script> or because <script type="math/tex">\mu > 0</script>. However, in practice, it is difficult to address. Letting our test statistic be <script type="math/tex">\bar{x}= \frac{1}{n} \sum_i x_i</script>, we consider four scenarios:</p>
<ol>
<li>If <script type="math/tex">% <![CDATA[
\bar{x} < 0 %]]></script>, and we reject, then we know that <script type="math/tex">\bar{x}</script> is significantly below <script type="math/tex">0</script>, and therefore, we should <em>not</em> reject the real null.</li>
<li>If <script type="math/tex">% <![CDATA[
\bar{x} < 0 %]]></script> and we fail to reject, then we have failed to reject our original hypothesis, and we are ok.</li>
<li>If <script type="math/tex">\bar{x} > 0</script>, and we reject, meaning that <script type="math/tex">\bar{x}</script> is significantly above <script type="math/tex">0</script>, and therefore we are safe to reject the null.</li>
<li>If <script type="math/tex">\bar{x} > 0</script>, and we fail to reject, that means <script type="math/tex">\bar{x}</script> is not significantly bigger than zero, but we cannot reject the original null.</li>
</ol>
<p>Thus, we can use the test for whether <script type="math/tex">\mu=0</script>, probably at a fairly severe reduction in power, but nonetheless, maintain a valid test.</p>Joshua VogelsteinHypothesis testing is a procedure for evaluating the likelihood that a given dataset corresponds to a particular “null” model. Formally, all hypothesis testing scenarios can be described as follows.10 Methods for Linear Dimensionality Reduction2018-09-25T18:27:57+00:002018-09-25T18:27:57+00:00https://bitsandbrains.io/2018/09/25/linear-dimensionality-reduction<p>Colleagues and trainees have asked for a basic summary of different linear dimensionality reduction methods. Here is my take on them.</p>
<ol>
<li>
<p>PCA - finds the directions of maximal variance in your data matrix. It <strong>actually solves for the right answer</strong>, that is, finds the true optimal dimensions. They are optimal in the sense that they are the best dimensions for finding a low-dimensional representation that minimizes squared error loss. This means it cannot be beat by fancy things like manifold learning, etc., <strong>for squared error loss</strong>. In practice, squared error loss is typically not what we really want to optimize.</p>
</li>
<li>
<p>ICA - As it turns out, PCA finds directions that are orthogonal (perpendicular to one another). If one simply relaxes this constraint, one obtains ICA. Unlike PCA however, <strong>there is no guarantee that the algorithm finds the optimal answer</strong>. The objective function is non-convex, so the initialization matters. I am not familiar with any theory that states the answer any particular approach has guaranteed performance, meaning, one can never say with high probability the answer is very good (when running ICA). It is commonly used for blind signal separation in some disciplines. Often initialized by PCA. There are many extensions/generalizations of ICA.</p>
</li>
<li>
<p>NMF - Related to ICA in certain ways, in that it changes the constraints of PCA. For this, it helps me to think of PCA as a matrix factorization problem, where there are loading vectors and principal components. The PC’s are constrained to be orthonormal, meaning orthogonal and each one sums to one. In NMF, <strong>one or both of the matrices are constrained to be non-negative</strong>. This is particularly useful when you have domain knowledge that the “truth” is non-negative, such as images. However, in practice, simply running PCA and then truncating things below zero to equal zero often works as well with much less hassle.</p>
</li>
<li>
<p>LDA - Here, rather than merely a “data matrix”, we also have a target variable that is categorical (eg, binary). Whereas PCA tries to find the directions that maximize the variance, <strong>LDA finds the 1 dimension that maximizes the “separability” of the two classes</strong>, under a gaussian assumption. It is consistent under the gaussian assumption, meaning if you have enough data, it will eventually achieve the optimal result.</p>
</li>
<li>
<p>Linear regression - Same as LDA, but here the target variable is continuous, rather than categorical. <strong>So, linear regression finds the 1 dimension that “is closest” to the target variable</strong>. As it turns out, in a very real sense, linear regression is a strict generalization of LDA, meaning that if one runs linear regression on a two-class classification problem, one will recover the same vector that LDA would provide. Like LDA, it is optimal under a gaussian assumption if you have enough data.</p>
</li>
<li>
<p>CCA - Is a generalization of linear regression and LDA, which applies when you have multiple target variables (eg, height and weight). CCA finds the dimensions of both the “data matrix” and the “target matrix” that <strong>jointly maximize their covariance</strong>. If the target matrix is one-dimensional, it reduces to linear regression. if it is one-dimensional and categorical, it reduces to LDA. Again, it achieves optimality under linear gaussian assumptions with enough data.</p>
</li>
<li>
<p>PLS - Like CCA, but rather than maximizing covariance, it <strong>maximizes correlation</strong> between the learned dimensions. Often this works better than CCA and other stuff. The geometry and theory underlying PLS is much less clear, however, and efficient implementations are less readily available.</p>
</li>
<li>
<p>JIVE - Also like CCA, in the sense that there are multiple modalities, but now there are also multiple subjects, and each subject has matrix valued data (and each matrix is the same size/shape/features). JIVE explicitly tries to model the variance that is <strong>shared across individuals</strong>, and separate it out from the subject specific signal, which is also separated from subject specific noise. It does so by imposing several constraints, specifically that features are orthogonal to one another.</p>
</li>
<li>
<p>LOL - We made this one up; the arxiv version is <a href="https://arxiv.org/abs/1709.01233">here</a>. It is designed to find a low-dimensional representation <strong>when dimensionality is much larger than sample size</strong>. We proved that, under gaussian assumptions, it works better than PCA and other linear methods for subsequent classification.</p>
</li>
<li>
<p>Random Projections - These have support from an elegant theory, the <a href="https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma">Johnson–Lindenstrauss lemma</a>, which states that “random” projections, can achieve low reconstruction error with high-probability. There are many possible ways to generate these random projections, my favorite way is <a href="https://dl.acm.org/citation.cfm?id=1150436">very sparse random projections</a>, which simply requires a few non-zero numbers, and the rest can be +1 or -1. We use this result to accelerate LOL.</p>
</li>
</ol>
<p>Others have written/compared all the various linear methods in greater detail. One example is <a href="https://stat.columbia.edu/~cunningham/pdf/CunninghamJMLR2015.pdf">here</a>, which addresses many of the methods I discussed above, as well as some others (but misses a PLS and JIVE).</p>Joshua VogelsteinColleagues and trainees have asked for a basic summary of different linear dimensionality reduction methods. Here is my take on them.10 Simple Rules for Designing Learning Machines2018-09-24T18:27:57+00:002018-09-24T18:27:57+00:00https://bitsandbrains.io/2018/09/24/modeling-desiderata<p>When designing an estimator (or a learner, as machine learning people say), there are a number of desiderata to consider. I believe the following is a nearly <a href="https://en.wikipedia.org/wiki/MECE_principle">MECE list</a> (mutually exclusive and collectively exhaustive). However, if you have other desiderata that I have missed, please let me know in the comments. Recall that an estimator takes data as input, and outputs an <em>estimate</em>. Below is my list:</p>
<ol>
<li>
<p><strong>Sufficiency</strong>: The estimate must be sufficient to answer the underlying/motivating question. For example, if the question is about the average height of a population, the estimate must be a positive number, not a vector, not an image, etc. This is related to <a href="https://en.wikipedia.org/wiki/Validity_(statistics)">validity</a>, as well as the classical definition of <a href="https://en.wikipedia.org/wiki/Sufficient_statistic">statistical sufficiency</a>, in that we say that if an estimator is estimating a sufficient statistic, then it is a sufficient estimator. Note, however, that sufficient estimators do not have to estimate sufficient statistics. To evaluate the degree of <em>sufficiency</em> of an estimator for a given problem one determines how “close” the measurement and action spaces of the estimator are to the desired measurement and action space. For example, the action space may be real numbers, but an estimator may only provide integers, which means that it is not quite perfectly sufficient, but it may still do the trick.</p>
</li>
<li>
<p><strong>Consistency</strong>: We prefer, under the model assumptions, for any finite sample size, that the expected estimate is equal to the assumed “truth”, i.e., is <em>unbiased</em>. If this unbiased property holds asymptotically, the estimator is said to be <em>consistent</em>. For random variables (estimands are random variables since they are functions of random variables), there are many different but related notions of <a href="https://en.wikipedia.org/wiki/Convergence_of_random_variables">convergence</a>, so the extent to which an estimator is consistent is determined by the sense in which it converges to an unbiased result.</p>
</li>
<li>
<p><strong>Efficiency</strong>: Assuming that the estimator converges to <em>something</em>, all else being equal, we desire that it converges “quickly”. The typical metric of statistical efficiency is the variance (for a scalar values estimate), or the Fisher information more generally. Ideally the estimator is <em>efficient</em>, meaning, under the model assumptions, it achieves the minimal possible variance. To a first order approximation, we often simply indicate whether the convergence rate is polynomial, polylog, or exponential. Note that efficiency can computed with respect to the sample size, as well as the dimensionality of the data (and both the “intrinsic” and “extrinsic” dimensionality). Note also that there are many different notions of <a href="https://en.wikipedia.org/wiki/Convergence_of_random_variables">convergence of random variables</a>, and the estimators are indeed random variables (because their inputs are random variables). Because perfect efficiency is available only under the simplest models, <em>relative efficiency</em> is often more important, in practice. Probably almost correct learning theory, as described in <a href="http://a.co/d/bYJlTWA">Foundations of Machine Learning</a>, is largely focused on finding estimators that efficiently (i.e., using a small number of samples) obtain low errors with high probability.</p>
</li>
<li>
<p><strong>Uncertainty</strong>: Point estimates (for example, the maximum likelihood estimate), are often of primary concern, but uncertainty intervals around said point estimates are often required for effectively using the estimates. Point estimates of confidence intervals, and densities, are also technically point estimates. However, such point estimates fundamentally incorporate a notion of uncertainty, so they satisfy this desiderata. We note, however, that while some estimators have uncertainty estimates with theoretical guarantees, those guarantees depend strongly on the model assumptions. To evaluate an estimator in terms of <em>uncertainty</em>, one can determine the strength of the theoretical claims associated with its estimates of uncertainty. For example, certain manifold learning and deep learning techniques lack any theoretical guarantees around uncertainty. Certain methods of Bayesian inference (such as Markov Chain Monte Carlo) have strong theoretical guarantees, but those guarantees require infinitely many samples, and provide limited (if any) guarantees given a particular number of samples for all but the simplest problems. We desire estimators with strong guarantees for any finite number of computational operations.</p>
</li>
<li>
<p><strong>Complexity</strong>: The “optimal” estimate, given a dataset, might be quite simple, or quite complex. We desire estimators that can flexibly adapt to the “best” complexity, given the context. <a href="https://www.amazon.com/Efficient-Adaptive-Estimation-Semiparametric-Models/dp/0387984739/ref=sr_1_6?s=books&ie=UTF8&qid=1537811338&sr=1-6&keywords=semiparametric"><em>Semiparametric</em></a> estimators can be arbitrarily complex, but achieve parametric rates of convergence.</p>
</li>
<li><strong>Stability</strong>: Statisticians often quote George Box’s quip “all models are wrong”, which means that any assumptions that we make about our data (such as independence), are not exactly true in the real world. Given that any theoretical claim one can make is under a set of assumptions, it is desirable to be able to show that an estimator’s performance is <em>robust</em> to “perturbations” of those assumptions. There are a number of <a href="https://en.wikipedia.org/wiki/Robust_statistics#Measures_of_robustness">measures of robustness</a>, including breakdown point and influence function. Another simple measure of robustness is: what kinds of perturbations is the estimator invariant with respect to? For example, is it invariant to translation, scale, shear, affine, or monotonic transformations? Nearly everything is invariant to translation, but fewer things are invariant to other kinds of transformations. Some additional example perturbations that one may desire robustness against include:
<ul>
<li>data were assumed Gaussian, but the true distribution has fat tails,</li>
<li>there are a set of outliers, sampled from some other distribution,</li>
<li>the model assumed two classes, but there were actually three, and,</li>
<li>data were not sampled independently.</li>
</ul>
</li>
<li>
<p><strong>Scalability</strong>: If an estimator requires an exponential amount of storage or computation, as a function of sample size or dimensionality, than it can typically only be applied to extremely small datasets. Similarly, if the data are relatively large (meaning that it takes “a while” to estimate), than estimation can often be sped up by a parallel implementation. However, the “parallelizability” of estimators can vary greatly. The theoretical space and time complexity of an estimator is typically written as a function of sample size, dimensionality (either intrinsic or extrinsic), and number of parallel threads.</p>
</li>
<li>
<p><strong>Explainability</strong>: In many applications, including scientific inquiries, medical practice, and law, explainability and/or interpretability are crucial. While neither of these terms have accepted definitions, certain general guidelines have been established. First, the fewer features the estimator uses, the more interpretable it is. Second, passing all the features through a complex data transformation, such as a kernel, is relatively not interpretable. Third, when using decision trees, shallower trees are more interpretable, and when using decision forests, fewer trees are more interpretable. In that sense, in certain contexts, relative explainability can be explicitly quantified.</p>
</li>
<li>
<p><strong>Automaticity</strong>: In most complex settings, estimators have a number of parameters themselves. For example, when running k-means, one must choose the number of clusters, and when running PCA, one must choose the number of dimensions. The ease and expediency with which one can choose/learn “good” parameter values for a given setting is crucial in real data applications. When the most computationally expensive “process” of an estimator can be nested, leveraging the answer from past values, this can greatly accelerate the acquisition of “good” results. For example, in PCA, by projecting into <script type="math/tex">D</script> dimensions, larger than you think you’ll need, you have also projected into <script type="math/tex">1, \ldots, D-1</script> dimensions, and therefore do not need to explicitly compute each of those projections again. This is in contrast to non-negative matrix factorization, where the resulting matrices are not nested. This desiderata is the least formal of the bunch, and most difficult to quantify.</p>
</li>
<li><strong>Simplicity</strong>: This desiderata is a bit “meta”. Simplicity is kind of an over-arching goal, including that the estimator has a simple geometric intuition, which admits theoretical analysis, and generalizability, a simple algorithm, lacking in many algorithmic parameters (or tuning knobs), and a straightforward implementation.</li>
</ol>Joshua VogelsteinWhen designing an estimator (or a learner, as machine learning people say), there are a number of desiderata to consider. I believe the following is a nearly MECE list (mutually exclusive and collectively exhaustive). However, if you have other desiderata that I have missed, please let me know in the comments. Recall that an estimator takes data as input, and outputs an estimate. Below is my list: