Jekyll2019-03-21T12:11:25+00:00https://bitsandbrains.io/feed.xmlBits and BrainsNeuroData: enabling data-driven neuroscienceHow to Write an Scientific Abstract/Introduction/Paper2019-02-10T18:27:57+00:002019-02-10T18:27:57+00:00https://bitsandbrains.io/2019/02/10/how-to-write-a-paper<p>Upon believing that you have completed work sufficient to write a peer reviewed manuscript, follow the below steps in order. If you are simply writing an abstract, just do 1 and 7.</p>
<ol>
<li>
<p>Write a one sentence summary of your work (will become your <strong>title</strong>; ~5 min). This sentences describes the main take home message / main result. It should not have any words that most of your readership will not be familiar with. It should be attention grabby, it should have less than 88 characters.</p>
</li>
<li>
<p>Describe the “killer fig” that makes the point as clearly and concisely as possible. Guidance for making paper quality figures is <a href="https://bitsandbrains.io/2018/09/08/figures.html">here</a>. (~5 min)</p>
</li>
<li>
<p>Describe the other results, typically 3-5 additional figures or theorems. The goal of each of these is to support the main claim, for example, by further refining, adding controls, etc. Ideally, they are sequenced together in a logical chain, like a proof, each building on the next, to tell the story. (~1 hr)</p>
</li>
<li>
<p>Make a first draft of all the figures and tables with detailed captions (~ 1 week). Captions should each be about a paragraph long. At this point, the figures need not be “camera ready”, but should have all the main points made.</p>
</li>
<li>
<p>Get feedback on figures (~1 week). Show the figures to your colleagues who are not co-authors on your paper, do not show them the captions, and ask them to tell you what the main point of each figure is. If they don’t get it right on their first try, interupt them, and ask them to go to the next figure. Then, spend another week updating the figures, and repeat.</p>
</li>
<li>
<p>Make all the figures and captions “camera ready” (~1 week). Consult <a href="[here](https://bitsandbrains.io/2018/09/08/figures.html)">this</a> to confirm that they are</p>
</li>
<li>Write a one paragraph summary (will become your <strong>abstract</strong>; ~30 min). This will be about 250 - 300 words, more than 500 words is a page, not an abstract. To include:
<ol>
<li>Big opportunity sentence: what is the grandest opportunity that this work is addressing?</li>
<li>Specific opportunity: what opportuntity specifically will this manuscript address?</li>
<li>Challenge sentence: what is hard about addressing this opportunity?</li>
<li>Gap sentence: what is currently missing?</li>
<li>Action sentence: what did you do to address the gap, overcome the challenge, and therefore meet the opportunity? it should provide the <em>key</em> intuition/insight, the magic that makes this work, where others failed.</li>
<li>Resolution sentence: what changes for the reader now that you have met this challenge?</li>
</ol>
</li>
<li>Write a five paragraph <strong>intro</strong> (~1 hr). This will be structured as follows:
<ol>
<li>bulleted list of ~3-5 main factors that create an opportunity for your work, filtering from most general to most specific, and not including anything ancillary (~20 min)</li>
<li>bulleted list of the ~3-5 main challenges that must be overcome (~20 min)</li>
<li>1 sentence summary of the <strong>gap</strong>, that is, the key ingredient that is missing (~5 min)</li>
<li>2-3 sentence summary of what you did (~5 min)</li>
<li>2-3 sentence summary on how your work changes the world (~5 min)</li>
</ol>
</li>
<li>
<p>Outline the methods and results: this is a 1 sentence summary of every point in <a href="https://github.com/neurodata/checklists/blob/master/methods_paper.md">methods_paper</a> (~20 min)</p>
</li>
<li>
<p>Fill in the details of the methods and results (~ 1 week).</p>
</li>
<li>Write discussion (~1 hr), to include (not a summary)
<ol>
<li>bulleted list of previous related work (~20 min)</li>
<li>bulleted list of potential extensions (~20 min)</li>
</ol>
</li>
<li>Update abstract and introduction to final pre-feedback draft on text (~1 day).</li>
<li>
<p>Get lots of feedback from >1 person who is in the community of potential readers of your published manuscript. Ask them to read it as if they are reviewing it for a journal, and to hold nothing back. Ask them to give you comments in one week. You are not beholden to them, but taking their criticism seriously and making improvements to the manuscript on their basis would be wise.</p>
</li>
<li>
<p>Revise manuscript addressing each and every one of their concerns (~1 week). This does not necessarily mean making new figures, rather, it might mean clarifying various points of confusion.</p>
</li>
<li>Do another round of feedback, give them another week.</li>
<li>Finalize manuscript (~1 wk).</li>
</ol>
<p>If you follow the above plan, you will have a manuscript ready to submit 2 months after you start writing.</p>Joshua VogelsteinUpon believing that you have completed work sufficient to write a peer reviewed manuscript, follow the below steps in order. If you are simply writing an abstract, just do 1 and 7.Tips for Getting into a Top Graduate Program2018-10-21T18:27:57+00:002018-10-21T18:27:57+00:00https://bitsandbrains.io/2018/10/21/getting-into-grad-school<p>As faculty in the <a href="https://www.bme.jhu.edu/">Department of Biomedical Engineering</a> at Johns Hopkins University, the best BME department in the world, both in terms of <a href="https://www.usnews.com/best-colleges/rankings/engineering-doctorate-biological-biomedical">undergraduate</a> and <a href="https://www.usnews.com/best-graduate-schools/top-engineering-schools/biomedical-rankings">graduate</a> schools, I have learned what I and other faculty are looking for in applicants. Before getting into what the most important factors are, I believe it is important to understand our goals, which motivate those factors. From the perspective of the graduate admissions committee, our goal is to estimate whether we believe that BME@JHU is the best place for <em>you</em> to thrive to achieve you ultimate dreams. In other words, we try to ascertain whether the environment that we create at JHU will be maximally supportive of both your strengths and weaknesses. As it turns out, this does not mean necessarily that we accept the best students in some abstract sense (as defined by some metric), but rather, we try to accept the students for which we believe that <em>we</em> will be the best mentors for you. Of course, this is a complicated objective function, and one for which we will most likely sometimes make errors. Nonetheless, it is our goal. To make such estimations, we look for the following:</p>
<ol>
<li>
<p>Research Experience: First and foremost, we are a research university. So, the best way for us to determine whether our research environment will support you to flourish is to understand your previous research experience, and in which settings you flourished more than others. Although successful research is difficult to quantify, research artifacts provide some data with which we can evaluate your achievements. Such artifacts include poster presentations, conference proceedings, journal publications, numerical packages, and even patents sometimes. If you are the first (or co-first) author on any of these, we typically assume that much of the work is yours, and thus first author research artifacts are most informative. Middle author works are also informative, especially if you clarify your role in the research in your personal statement. Note, however, strong research experience is <em>not</em> a pre-requisite for admission. Rather, it is an information-rich piece of data for us.</p>
</li>
<li>
<p>Grades: JHU is not just a research university, we are also a teaching university, and we take our teach responsibilities quite seriously. Moreover, many of our graduate level BME courses are also serious and time consuming. It is important to us that you perform well in them, because they provide the necessary background upon which our research programs are based. It is <em>not</em> important to us that you get straight A’s. Rather, we care that you perform well in the courses that will be the most relevant for your research during your PhD, typically quantitative and biology classes for us. It is also not crucial that you perform well in every semester. We understand that life happens, and certain things are more important than coursework (family, health, well-being, etc.). Finally, we appreciate that not everybody gets the same opportunities in high school, and therefore not everybody is equally prepared. Thus, the grades in the most recent semesters, in the most relevant courses, are most important. Aim for getting A’s in them, but GPA alone neither gets students admission nor rejection.</p>
</li>
<li>
<p>Recommendations: While these do not come directly from you, they are quite important to us. BME@JHU is like a big extended family. We work closely with one another, sit near each other, we have been doing so for a long time, and plan to continue doing so for many years to come. Therefore, our community is quite important to us, and our success comes largely from surrounding ourselves not just with the smartest people in the world, but more importantly, really good people. So, the recommendation letters are a way for us to get information about how pleasant it is to work with you. I particularly look for recommendations from other faculty with successful research programs, as they are the most informative with regards to what it takes to have a successful PhD. Recommendations from industry can be somewhat informative, but less so. In other words, the number of PhD students somebody has mentored matters in our assessment. In terms of content, we are looking for recommendations that write that you are pleasant to work with, and amongst the best of his/her previous students along <em>some</em> dimensions, such as productivity, passion, drive, creativity, organization, etc. In other words, you excel in the kinds of personality traits that we think contribute to successful PhDs. Much like research experience and grades, a good or bad recommendation cannot determine your acceptance.</p>
</li>
<li>
<p>Personal Statement: Your personal statement is your opportunity to express yourself. The most important aspect of a personal statement for me is <strong>passion</strong>. Success in our field, I believe, is strongly correlated with passion. Even if that is not the case, it is more fun for me to work with people that are passionate about solving some problems. So, express yourself freely and passionately. And be specific. Find a few faculty members in the department that you are applying to, and write about what, in particular, you find most exciting about their work. In this way, we’ll be able to align your passions with ours in the review process. If you’ve reached out to any of the faculty, or anybody else associated with the department prior to application, mention it, and how it has informed your decision to apply. And don’t forget to spell/grammar check it, especially if you are a <em>native</em> English speaker.</p>
</li>
</ol>
<p>There are a few things that people invest a bunch of energy in, that do not matter hardly at all. First is the GRE. Evidence is building that it is classist, racist, and sexist (see for example, <a href="http://www.takepart.com/article/2015/11/07/gre-bias">here</a>). JHU will probably stop not just requiring them, but even allowing them. As it currently stands, only if somebody does quite poorly on the quantitative aspect of the GRE (say, below 70%), does his/her GRE score even come up for discussion. Second, is fellowships. In general, we have not heard of them, do not understand the criteria for winning them, who applies, or what is achieved. Therefore, we are typically unable to consider them in a meaningful fashion. I’ve literally never ever heard them come up in discussing any applicant, and I’ve now been privvy to discussion literally hundreds, maybe over 1,000 applicants across multiple different departments.</p>
<p>As a final note, my lab, as well as many other successful labs, are <em>always</em> accepting exceptional graduate students. The success of our labs’ depends on the success of excellent students, so we are always searching for and hoping to find people whose passions align with ours, and whose abilities either align with or complement our own. I hope this is helpful. If anybody disagrees with my assessment, or has other recommendations, or further questions, I’d love to hear from you in the comments.</p>Joshua VogelsteinAs faculty in the Department of Biomedical Engineering at Johns Hopkins University, the best BME department in the world, both in terms of undergraduate and graduate schools, I have learned what I and other faculty are looking for in applicants. Before getting into what the most important factors are, I believe it is important to understand our goals, which motivate those factors. From the perspective of the graduate admissions committee, our goal is to estimate whether we believe that BME@JHU is the best place for you to thrive to achieve you ultimate dreams. In other words, we try to ascertain whether the environment that we create at JHU will be maximally supportive of both your strengths and weaknesses. As it turns out, this does not mean necessarily that we accept the best students in some abstract sense (as defined by some metric), but rather, we try to accept the students for which we believe that we will be the best mentors for you. Of course, this is a complicated objective function, and one for which we will most likely sometimes make errors. Nonetheless, it is our goal. To make such estimations, we look for the following:11 Simple Rules for Releasing Data Science Tools2018-10-21T18:27:57+00:002018-10-21T18:27:57+00:00https://bitsandbrains.io/2018/10/21/numerical-packages<p>These notes were co-written by myself and a number of other people, including <a href="http://gkiar.me/">Greg Kiar</a>, <a href="http://ericwb.me/">Eric Bridgeford</a>, <a href="https://www.mcgill.ca/qls/researchers/jb-poline">JB Poline</a>, and <a href="https://users.encs.concordia.ca/~tglatard/">Tristan Glatard</a>. Inspired by the FAIR Guiding Principles for scientific data management and stewardship (see <a href="https://www.nature.com/articles/sdata201618">here</a>), we devised the FIRM guidelines for scientific software, specifically numerical packages. The FIRM guidelines stipulate that anybody in the world should be able to: <strong>F</strong>ind, <strong>I</strong>nstall, <strong>R</strong>un, and <strong>M</strong>odify your code. Below is a working draft of our ideas; as always, your feedback is solicited.</p>
<h3 id="find">Find</h3>
<ol>
<li>To make your code findable, we recommend three steps:
<ol>
<li>Make the code open source on a searchable code repository (e.g., <a href="https://github.com/">github</a> or <a href="https://about.gitlab.com/">gitlab</a>).</li>
<li>Generate a permanent Digital Object Identifier (DOI) so that you can freely move the code to other web-servies if you so desire without breaking the links (e.g., using <a href="https://zenodo.org/">zenodo</a>).</li>
<li>Add a license so that others can freely use your code without worrying about legal ramifications (see <a href="https://opensource.org/licenses">here</a> for options).</li>
</ol>
</li>
</ol>
<h3 id="install">Install</h3>
<ol>
<li>
<p>Provide installation guidelines, including <em>1-line installation</em> instructions with system requirements (including hardware and OS), software dependencies, and expected install time.</p>
</li>
<li>
<p>Deposit your code into a standard package manager, such as <a href="https://cran.r-project.org/">CRAN</a> for R or <a href="https://pypi.org/">PyPi</a> for Python. You might also provide a container or virtual machine image with your package pre-installed, for example, using <a href="https://www.docker.com/">Docker</a>, <a href="https://www.sylabs.io/docs/">Singularity</a> or <a href="https://gigantum.com/">Gigantum</a>.</p>
</li>
</ol>
<h3 id="run">Run</h3>
<ol>
<li>
<p>Provide a demo, including requisite data, expected results, and runtime on specified hardware. The demo should be simple, intuitive, and fast to run. We recommend using <a href="https://rmarkdown.rstudio.com/">Rmarkdown</a> for R and a <a href="http://jupyter.org/">Jupyter Notebook</a> for Python.</p>
</li>
<li>
<p>Write a readme with a quick start guide, including installation and a simplified (plain text) version of the demo.</p>
</li>
<li>
<p>Make sure each function includes auto-generated documentation. We recommend <a href="https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html">Roxygen</a> for R and <a href="http://www.sphinx-doc.org/en/master/">Sphinx</a> for Python.</p>
</li>
</ol>
<h3 id="modify">Modify</h3>
<ol>
<li>Include contribution guidelines, including:
<ol>
<li>style guidelines (<a href="https://google.github.io/styleguide/Rguide.xml">Google’s</a> or <a href="http://adv-r.had.co.nz/Style.html">Hadley’s</a> for R, or <a href="https://www.python.org/dev/peps/pep-0008/">PEP8</a> for Python,</li>
<li>bug reports,</li>
<li>pull requests, and</li>
<li>feature additions.</li>
</ol>
</li>
<li>Write unit tests for each function. Examples are <a href="http://testthat.r-lib.org/">testthat</a> for R and <a href="https://docs.python.org/3/library/unittest.html">unittest</a> for Python.</li>
<li>Incorporate continuous integration, for example, using either <a href="https://travis-ci.org/">TravisCI</a> or <a href="https://circleci.com/">CircleCI</a>.</li>
<li>Add the following <a href="https://shields.io/#/">badges</a> to your repo:
<ol>
<li>DOI.</li>
<li>license,</li>
<li>stable release version so people know which release they are on (from package manager),</li>
<li><a href="https://readthedocs.org/">documentation</a> to indicate that you generated documentation,</li>
<li><a href="https://codeclimate.com/">code quality</a> to indicate that your code is written using modern best practices,</li>
<li><a href="https://coveralls.io/">coverage</a> to indicate the extent to which you have written tests for your functions,</li>
<li><a href="https://www.docker.com/">build status</a> to indicate whether the virtual machine that contains the latest version of your code is running,</li>
<li>total number of downloads,</li>
</ol>
</li>
<li>Finally, benchmarks establishing current performance (using appropriate metrics) on standard problems, and better yet also comparing to other standard methods. Ideally, the code the generate the benchmark numbers are provided in <a href="http://jupyter.org/">Jupyter notebooks</a> provided in your <a href="https://gigantum.com/">Gigantum</a> project.</li>
</ol>
<p>A few examples of numerical packages that we have released that satisfy all (or most of) these rules include:</p>
<ul>
<li><a href="https://github.com/neurodata/ndmg">ndmg</a></li>
<li><a href="https://github.com/neurodata/lumberjack">lumberjack</a></li>
<li><a href="https://github.com/neurodata/mgc">mgc</a></li>
<li><a href="https://github.com/neurodata/LOL">LOL</a></li>
<li><a href="https://github.com/neurodata/ndreg">ndreg</a></li>
<li><a href="https://github.com/neurodata/knorR">knor</a></li>
<li><a href="https://github.com/boutiques/boutiques">boutiques</a></li>
<li><a href="https://github.com/gkiar/clowdr">clowdr</a></li>
</ul>Joshua VogelsteinThese notes were co-written by myself and a number of other people, including Greg Kiar, Eric Bridgeford, JB Poline, and Tristan Glatard. Inspired by the FAIR Guiding Principles for scientific data management and stewardship (see here), we devised the FIRM guidelines for scientific software, specifically numerical packages. The FIRM guidelines stipulate that anybody in the world should be able to: Find, Install, Run, and Modify your code. Below is a working draft of our ideas; as always, your feedback is solicited.10 Simple Rules for Writing paragraphs2018-10-14T18:27:57+00:002018-10-14T18:27:57+00:00https://bitsandbrains.io/2018/10/14/paragraphs<p>Here is a list of tips I check for each paragraph whenever I write anything:</p>
<ol>
<li>Each paragraph is about a single concrete and coherent idea
<ol>
<li>Does the first sentence of the paragraph introduce this idea?</li>
<li>Do all subsequent sentences in the paragraph further clarify the first sentence?</li>
<li>Did I use transitional words to establish logical relationships between sentences?</li>
<li>Did I write anything like, “Next, we …”? If so, remove. The structure of the paragraphs should flow into one another.</li>
<li>Did I sequence light to heavy: so the earlier sentences that start light/succinct, follow-up with “heavy” details?</li>
<li>Does the last sentence of every paragraph link it to next paragraph?</li>
</ol>
</li>
<li>Tense
<ol>
<li>Is the tense consistent within each paragraph (results are past tense)?</li>
<li>Is it past tense only for things that you or other people did previously?</li>
<li>Are there any passive tense phrases (e.g., “it can be shown”)? If so, revise to active tense.</li>
</ol>
</li>
<li>Any time you introduce a new concept
<ol>
<li>Have other people discussed this concept? If so, did you use familiar notation/naming conventions? If not, have you clarifed/justified the difference?</li>
<li>If you’ve introduced 1 new concept, have you used the name of that topic consistently? And did you never use any other term (to bury the concept in their mind)?</li>
<li>When introducing a novel concept/word/equation/notation/etc., did you explain it <em>before</em> usage, rather than after (else the reader will not understand when reading it, and we don’t want that)?</li>
</ol>
</li>
</ol>
<p>See this <a href="https://www.youtube.com/watch?v=rZxaSMzstB8">youtube video</a> for details.</p>
<p>I realize this is actually 12 things to check. I’m ok with that.</p>Joshua VogelsteinHere is a list of tips I check for each paragraph whenever I write anything:How to Structure a Grant2018-10-14T18:27:57+00:002018-10-14T18:27:57+00:00https://bitsandbrains.io/2018/10/14/structuring-a-grant<p>I write many grants. Most of them, or at least the parts that I write, are very quantitative in nature. Most of how I think about writing them comes directly from discussions with <a href="http://optimizescience.com">Brett Mensh</a>. He is the world’s expert on grant-writing, and I highly recommend you contact him.</p>
<h3 id="specific-aims">Specific Aims</h3>
<p>This is one page, and it matters more than everything else combined. I do it as follows:</p>
<ol>
<li>A paragraph introducing the problem we are solving</li>
<li>A paragraph on why it is hard, ie, why other really smart people (e.g., the review panel) have not yet been able to solve it.</li>
<li>A paragraph motivating our overall approach/philosophy to the problem</li>
<li>The 3-4 aims. For each aim, there is a 1-2 line <em>action</em> statement of what the aim is, and what it will deliver. For example, “Develop nonparametric machine learning techniques to identify brain-imaging biomarkers for depression using the Healthy Brain Network Dataset.” Note that there is a verb (develop), and it is clear to the funders what they’ll get (a new biomarker for depression), and how we will do it (nonparametric machine learning).</li>
</ol>
<h3 id="research-strategy">Research Strategy</h3>
<ol>
<li>
<p>Significance: Up to 1-2 pages, talking about how important your problem is to solve, funneling down from most general to most specific. One sentence in <strong>bold</strong> to highlight the potential impact of your proposed work.</p>
</li>
<li>
<p>Innovation: Up to 0.5 pages, highlighting the novel technical contributions, again with 1 sentence in <strong>bold</strong> to focus on the key innovation of the proposed work.</p>
</li>
<li>Approach: ~9-12 pages (depending on the specific grant), organized into an “overview” section followed by 3-4 aims. The 2-3 page overview section describes commonalities between the aims, any data that are being used, and related things. The overview can also include or be followed by a “general background” that applies to each of the aims. Each aim is about 2-3 pages, and includes the following sections:
<ol>
<li><em>Introduction</em>: a 1 paragraph jargon-free introduction of what you will accomplish in this aim, and how.</li>
<li><em>Justification and Feasibility</em> or <em>Preliminary Results</em>: Up to 1 page describing why you are particularly well-suited to accomplish the goals in the allotted time given the allotted resources.</li>
<li><em>Research Design</em>: ~2 paragraphs on the details of what you’ll actually do.</li>
<li><em>Expected Outcomes</em>: 1 paragraph on what you expect to actually “deliver” back to the funding agency.</li>
<li><em>Potential Pitfalls and Alternative Strategies</em>: A few lines to indicate that you understand which parts are difficult, and have a contingency plan.</li>
</ol>
</li>
<li>Timeline and Future Direction: ~1/2 pages, describing when the activities will happen, including a table organized by Aim, and connecting the work to your future long-term agenda.</li>
</ol>
<h3 id="some-other-tips">Some other tips:</h3>
<ol>
<li>Follow my blog post on <a href="/2018/10/14/words.html">words</a></li>
<li>Follow my blog post on <a href="/2018/10/14/paragraphs.html">paragraph</a></li>
<li>Follow my blog post on <a href="/2018/09/08/figures.html">figures</a></li>
<li>The “name” of each aim/task should be an <a href="http://www.quickslide-powerpoint.com/en/blog/action-titles-providing-orientation-well-thought-out-slide-titles">“action title”</a></li>
<li>Each aim/task should follow OCAR <!---Consider putting a hyperlink here too? ---></li>
<li>For each sub-aim/task, include a <strong>bold</strong> sentence precisely and concisely stating its objective (the <em>action</em> part)</li>
<li>Make sure to distinguish your own work from others explicitly every time your work is cited</li>
<li>Check that formatting is consistent across <em>all</em> documents (both within type, eg biosketches) and across.</li>
<li>Use <a href="https://www.google.com/docs/about/">google docs</a></li>
<li>Use <a href="https://paperpile.com/">paperpile</a> for references (free version for 2 weeks)</li>
<li>Use <a href="https://chrome.google.com/webstore/detail/auto-latex-equations/iaainhiejkciadlhlodaajgbffkebdog?hl=en-US">auto-latex</a> for equations</li>
<li>Keep figures at very end until last opportunity</li>
</ol>
<h4 id="nsf-specific">NSF specific</h4>
<ol>
<li>For summary, focus on gap and impact</li>
<li>For intellectual merit, use as much language from <a href="http://www.sciencemag.org/sites/default/files/documents/Big%20Ideas%20compiled.pdf">NSF Big Ideas</a> as possible</li>
<li>Broader impacts is about societal, not intellectual benefit, focusing on STEM education, minorities, disabilities, open source, etc.</li>
</ol>Joshua VogelsteinI write many grants. Most of them, or at least the parts that I write, are very quantitative in nature. Most of how I think about writing them comes directly from discussions with Brett Mensh. He is the world’s expert on grant-writing, and I highly recommend you contact him.Taboo Words/Phrases in Technical Writing2018-10-14T18:27:57+00:002018-10-14T18:27:57+00:00https://bitsandbrains.io/2018/10/14/words<p>Here is a list of tips I check whenever I write anything:</p>
<ol>
<li>check for any misspelled words using spellcheck</li>
<li>replace contractions with complete words (eg, don’t –> do not)</li>
<li>replace abbreviations with complete words (eg, “e.g.” –> for example)</li>
<li>replace colloquialisms with more formal words (eg, nowadays –> recently)</li>
<li>i, we, our, us, you, your –> rewrite sentence (almost always)</li>
<li>in order to/for –> to/for</li>
<li>clearly, obviously –> remove, might not be so clear/obvious to everyone</li>
<li>this, they, it –> be specific, which noun this/they/it is referring to is often vague</li>
<li>very –> use a stronger adjective/adverb</li>
<li>data is –> data are</li>
<li>novel –> remove, novelty should be implied by context. If it is not clear by context, update context</li>
<li>most/least/best/worst/better/worse/optimal/*est: requires a citation as it is an empirical claim, or evidence, and a dimension along which the comparison is made</li>
<li>usually –> same deal, either provide a citation/evidence, or don’t say it, replace with “frequently”</li>
<li>no reason / essential/necessary / no way / impossible –> remove, these are all too strong, just because you haven’t thought of a reason, or a counter example, or another way, does not mean that nobody else has/can.</li>
<li>done –> completed.</li>
<li>utilized –> used</li>
<li>firstly –> first, and similarly for second, third, etc.</li>
<li>& –> and</li>
<li>arguably –> possibly, likely, perhaps (why argue with your reader?)</li>
<li>as such –> ok sparsely, but often overused, check and revise sentences.</li>
<li>numbers have space before unit, eg 1GB –> 1 GB.</li>
<li>first –> somebody reading this will think they did it first, and they are at least partially correct. firstness should be implied by context.</li>
<li>in this manuscript –> nothing, it is implied.</li>
<li>a priori –> should be italics</li>
<li>is used –> rewrite sentence, avoid passive tense whenever possible</li>
<li>can be seen –> typically just remove, sometimes replace with “shows”, or reword sentences</li>
<li>we want to –> we (though should be reworded to avoid “we” entirely)</li>
<li>Fig, Fig., fig –> Figure (or at least be consistent)</li>
<li>we chose appropriate –> we chose (let them decide whether it was appropriate)</li>
<li>“note that” or “we note that” or “we highlight” or “we highlight that” –> simply remove.</li>
</ol>Joshua VogelsteinHere is a list of tips I check whenever I write anything:The Varieties of Hypothesis Tests2018-09-29T18:27:57+00:002018-09-29T18:27:57+00:00https://bitsandbrains.io/2018/09/29/categories-of-testing<p>Hypothesis testing is a procedure for evaluating the likelihood that a given dataset corresponds to a particular “null” model. Formally, all hypothesis testing scenarios can be described as follows.</p>
<p>Let <script type="math/tex">X_i \sim F</script>, for <script type="math/tex">i \in [n]=\{1,\ldots,n\}</script> be a random variable, where each <script type="math/tex">X_i</script> is sampled independently and identically according to some distribution <script type="math/tex">F</script>. Realizations of <script type="math/tex">X_i</script> are <script type="math/tex">x_i \in \mathcal{X}</script>. We further assume that <script type="math/tex">F \in \mathcal{F}</script>, that is, <script type="math/tex">F</script> is a distribution that lives in a model <script type="math/tex">\mathcal{F} = \{F : \mathcal{F}\}</script>. Moreover, we partition <script type="math/tex">\mathcal{F}</script> into two complementary sets, <script type="math/tex">\mathcal{F}_0</script> and <script type="math/tex">\mathcal{F}_A</script>, such that <script type="math/tex">\mathcal{F}_0 \cup \mathcal{F}_A = \mathcal{F}</script> and <script type="math/tex">\mathcal{F}_0 \cap \mathcal{F}_A = \emptyset</script>. Given these definitions, all hypothesis tests can be written as:</p>
<p>\begin{align}
H_0: F \in \mathcal{F}_0, \qquad
H_A: F \in \mathcal{F}_A.
\end{align}</p>
<!-- For example, $$\mathcal{F}$$ could be the set of all Gaussians, and therefore $$x_i \in \mathbb{R}$$, and $$\mathcal{F}_0$$ could be $$\mathcal{N}(0,1)$$, and $$\mathcal{F}_A=\mathcal{N}(1,1)$$. -->
<p>Note that a hypothesis can be either <em>simple</em> or <em>composite</em>: according to <a href="https://en.wikipedia.org/wiki/Null_hypothesis">wikipedia</a>, simple hypothesis tests are
“any hypothesis which specifies the population distribution completely,” whereas composite tests are “any hypothesis which does not specify the population distribution completely.” Given this, we consider four different kinds of hypothesis tests:</p>
<ol>
<li>
<p>Simple-Null, Simple-Alternate (S/S)
\begin{align}
H_0: F = F_0, \qquad
H_A: F = F_A.
\end{align}</p>
</li>
<li>
<p>Simple-Null, Composite-Alternate (S/C)
\begin{align}
H_0: F = F_0, \qquad
H_A: F \in \mathcal{F}_A.
\end{align}</p>
</li>
<li>
<p>Composite-Null, Simple-Alternate (C/S)
\begin{align}
H_0: F \in \mathcal{F}_0, \qquad
H_A: F \in \mathcal{F}_A.
\end{align}</p>
</li>
<li>
<p>Composite-Null, Composite-Alternate (C/C):
\begin{align}
H_0: F \in \mathcal{F}_0, \qquad
H_A: F \in \mathcal{F}_A.
\end{align}</p>
</li>
</ol>
<p>Each of the above can be re-written in terms of parameters. Specifically, we can make the following substitutions:</p>
<p>\begin{align}
F = F_j \Rightarrow \theta(F) = \theta(F_j), \qquad
F \in \mathcal{F}_j \Rightarrow \theta(F) \in \Theta_j,
\end{align}
where <script type="math/tex">\Theta_j = \{ \theta : \theta(F) \in \mathcal{F}_j \}</script>.</p>
<p>Note that composite tests (those where the null or alternate is composite) can be both one-sided or two-sided. In particular, the simple null hypothesis can be restated: <script type="math/tex">H_0: F-F_0 = 0</script>. Given that, a one-sided simple/composite test is:
\begin{align}
H_0: F - F_0 = 0, \qquad
H_A: F - F_0 > 0.
\end{align}
Of course, we could just as easily replace the <script type="math/tex">></script> in the alternate with <script type="math/tex">% <![CDATA[
< %]]></script>.<br />
The two-sided composite test can be written:
\begin{align}
H_0: F - F_0 = 0, \qquad
H_A: F - F_0 \neq 0,
\end{align}
or equivalently,
\begin{align}
H_0: F = F_0, \qquad
H_A: F \neq F_0.
\end{align}</p>
<p>In other words, the simple null, two-sided composite is equivalent to a test for equality (see below).</p>
<p>Thus, there are essentially three kinds of composite hypotheses:</p>
<ol>
<li>one-sided,</li>
<li>two-sided, or</li>
<li>neither (e.g., the alternative is a more complex set).</li>
</ol>
<h3 id="examples">Examples</h3>
<h4 id="independence-testing">Independence Testing</h4>
<p>An independence test is one of the fundamental tests in statistics. In this case, let <script type="math/tex">Z_i = (X_i, Y_i) \sim F_{XY}</script>. Now we test:
\begin{align}
H_0: F_{X,Y} = F_{X} F_{Y} \qquad
H_A: F_{X,Y} \neq F_{X} F_{Y}.
\end{align}</p>
<p>In other words, we define <script type="math/tex">\mathcal{F}_0</script> as the set of joint distributions on <script type="math/tex">(X,Y)</script> such that the joint equals the product of the marginals: <script type="math/tex">\mathcal{F}_0 = \{ F_{X,Y} = F_{X}F_{Y} \}</script>, and we define <script type="math/tex">\mathcal{F}_A</script> as the complement of that set, <script type="math/tex">\mathcal{F}_0^c = \mathcal{F} \backslash \mathcal{F}_0 = \mathcal{F}_A</script>. Independence tests are special cases of C/C tests.</p>
<h4 id="two-sample-testing">Two-Sample Testing</h4>
<p>Let <script type="math/tex">X_i \sim F_1</script>, for <script type="math/tex">i \in [n]</script> be a random variable, where each <script type="math/tex">X_i</script> is sampled independently and identically according to some distribution <script type="math/tex">F_1</script>. Realizations of <script type="math/tex">X_i</script> are <script type="math/tex">x_i \in \mathcal{X}</script>.</p>
<p>Additionally, let <script type="math/tex">Y_i \sim F_2</script>, for <script type="math/tex">i \in [m]</script> be a random variable, where each <script type="math/tex">Y_i</script> is sampled independently and identically according to some distribution <script type="math/tex">F_2</script>. Realizations of <script type="math/tex">Y_i</script> are <script type="math/tex">y_i \in \mathcal{Y}</script>.</p>
<p>Given these definitions, all two-sample testing can be written as:</p>
<p>\begin{align}
H_0: F_1 = F_2 \qquad
H_A: F_1 \neq F_2.
\end{align}</p>
<p>This can be written as an independence test. First, define the mixture distribution <script type="math/tex">F = \pi_1 F_1 + \pi_2 F_2</script>, where <script type="math/tex">\pi_1,\pi_2 \geq 0</script> and <script type="math/tex">\pi_1+\pi_2=1</script>. Now, sample <script type="math/tex">U_i \sim F</script> for <script type="math/tex">n+m</script> times. To make it exactly equal to the above, set <script type="math/tex">\pi_1 = n/(n+m)</script>. Moreover, define <script type="math/tex">V_i</script> to be the latent “class label”, that is <script type="math/tex">V_i=1</script> if <script type="math/tex">U_i \sim F_1</script> and <script type="math/tex">V_i=2</script> if <script type="math/tex">U_i \sim F_2</script>. Now, we can form the independence test:</p>
<p>\begin{align}
H_0: F_{UV} = F_U F_V \qquad
H_A: F_{UV} \neq F_U F_V.
\end{align}</p>
<p>Moreover, two-sample tests can also be written as simple goodness-of-fit tests, which is readily apparent, as described below. Thus, simple goodness-of-fit tests are also independence tests.</p>
<h4 id="goodness-of-fit-testing">Goodness-of-Fit Testing</h4>
<p>The most general kind of a goodness-of-fit test is a C/C test:
\begin{align}
H_0: F \in \mathcal{F}_0 \qquad
H_A: F \notin \mathcal{F}_0.
\end{align}
In other words, <script type="math/tex">\mathcal{F}_A = \mathcal{F}_0^c = \mathcal{F} \backslash \mathcal{F}_0</script>.</p>
<p>The S/C goodness-of-fit test is a special case:
\begin{align}
H_0: F = F_0 \qquad
H_A: F \neq F_0.
\end{align}</p>
<p>This special case is clearly an instance of a two-sample test; the only difference is that the second distribution is not estimated from the data, but rather, is provided by the null.</p>
<h4 id="k-sample-tests">K-Sample Tests</h4>
<p>More generally, assume there exists <script type="math/tex">K</script> different distributions, <script type="math/tex">F_1, \ldots, F_K</script>, the K-Sample test is:</p>
<p>\begin{align}
H_0: F_1 = F_2 = \cdots = F_K \qquad
H_A: \text{any not equal}.
\end{align}</p>
<p>We can do the same trick as above, defining <script type="math/tex">F</script> to be a mixture of <script type="math/tex">K</script> distributions, and <script type="math/tex">V_i</script> denotes from which component is sample <script type="math/tex">i</script>. Thus, <script type="math/tex">K</script>-sample tests are also independence tests.</p>
<h4 id="paired-k-sample-testing">Paired K-Sample Testing</h4>
<p>Let <script type="math/tex">X_i</script> and <script type="math/tex">Y_i</script> be defined as above, except now we sample matched pairs, <script type="math/tex">(X_i,Y_i) \sim F_X F_Y</script>. The paired two sample test is still:
\begin{align}
H_0: F = G \qquad
H_A: F \neq G,
\end{align}
but there exists more powerful test statistics that consider the “matchedness”. Of course, this also extends to the <script type="math/tex">K</script>-sample setting.</p>
<p>Note that this is a special case of <script type="math/tex">K</script>-sample testing, and thus is also an independence test.</p>
<h4 id="test-for-symmetry">Test for Symmetry</h4>
<p>Assume we desire to determine whether the distribution of <script type="math/tex">X</script> is symmetric about zero. Define <script type="math/tex">F_+</script> as the half of the distribution that is positive, and <script type="math/tex">F_-</script> as the half that is negative. Then, we simply have a typical two-sample test:
\begin{align}
H_0: F_+ = -F_- \qquad
H_A: F_+ \neq -F_-,
\end{align}
which is an independence test.</p>
<h4 id="multinomial-test">Multinomial Test</h4>
<p>Assume <script type="math/tex">X \sim</script>Multinomial<script type="math/tex">(p_1, p_2, p_3; n)</script>, and let:
\begin{align}
H_0: p_1 = p_2 = p_3 \qquad
H_A: \text{not all equal}.
\end{align}</p>
<p>This is again a <script type="math/tex">K</script> sample test, which is an independence test.</p>
<p>Considering a slightly different case:
\begin{align}
H_0: p_1 = p_2 \qquad
H_A: p_1 \neq n_2.
\end{align}
This is a slight generalization of the above, but can still be written as a <script type="math/tex">K</script> sample test, and is therefore an independence test.</p>
<h4 id="one-sided-test">One Sided Test</h4>
<p>Assume <script type="math/tex">X \sim \mathcal{N}(\mu,1)</script>, and consider the following test:
\begin{align}
H_0: \mu = 0 \qquad
H_A: \mu \neq 0.
\end{align}</p>
<p>This is a two-sample test, and therefore, an independence test. A slightly more complicated variant is:
\begin{align}
H_0: \mu < 0 \qquad
H_A: \mu \geq 0.
\end{align}</p>
<p>Viewing this as an independence test is a bit more complicated. In particular, if we set it up as the above two-sample test, and we reject, it could be because <script type="math/tex">% <![CDATA[
\mu < 0 %]]></script> or because <script type="math/tex">\mu > 0</script>. However, in practice, it is difficult to address. Letting our test statistic be <script type="math/tex">\bar{x}= \frac{1}{n} \sum_i x_i</script>, we consider four scenarios:</p>
<ol>
<li>If <script type="math/tex">% <![CDATA[
\bar{x} < 0 %]]></script>, and we reject, then we know that <script type="math/tex">\bar{x}</script> is significantly below <script type="math/tex">0</script>, and therefore, we should <em>not</em> reject the real null.</li>
<li>If <script type="math/tex">% <![CDATA[
\bar{x} < 0 %]]></script> and we fail to reject, then we have failed to reject our original hypothesis, and we are ok.</li>
<li>If <script type="math/tex">\bar{x} > 0</script>, and we reject, meaning that <script type="math/tex">\bar{x}</script> is significantly above <script type="math/tex">0</script>, and therefore we are safe to reject the null.</li>
<li>If <script type="math/tex">\bar{x} > 0</script>, and we fail to reject, that means <script type="math/tex">\bar{x}</script> is not significantly bigger than zero, but we cannot reject the original null.</li>
</ol>
<p>Thus, we can use the test for whether <script type="math/tex">\mu=0</script>, probably at a fairly severe reduction in power, but nonetheless, maintain a valid test.</p>Joshua VogelsteinHypothesis testing is a procedure for evaluating the likelihood that a given dataset corresponds to a particular “null” model. Formally, all hypothesis testing scenarios can be described as follows.10 Methods for Linear Dimensionality Reduction2018-09-25T18:27:57+00:002018-09-25T18:27:57+00:00https://bitsandbrains.io/2018/09/25/linear-dimensionality-reduction<p>Colleagues and trainees have asked for a basic summary of different linear dimensionality reduction methods. Here is my take on them.</p>
<ol>
<li>
<p>PCA - finds the directions of maximal variance in your data matrix. It <strong>actually solves for the right answer</strong>, that is, finds the true optimal dimensions. They are optimal in the sense that they are the best dimensions for finding a low-dimensional representation that minimizes squared error loss. This means it cannot be beat by fancy things like manifold learning, etc., <strong>for squared error loss</strong>. In practice, squared error loss is typically not what we really want to optimize.</p>
</li>
<li>
<p>ICA - As it turns out, PCA finds directions that are orthogonal (perpendicular to one another). If one simply relaxes this constraint, one obtains ICA. Unlike PCA however, <strong>there is no guarantee that the algorithm finds the optimal answer</strong>. The objective function is non-convex, so the initialization matters. I am not familiar with any theory that states the answer any particular approach has guaranteed performance, meaning, one can never say with high probability the answer is very good (when running ICA). It is commonly used for blind signal separation in some disciplines. Often initialized by PCA. There are many extensions/generalizations of ICA.</p>
</li>
<li>
<p>NMF - Related to ICA in certain ways, in that it changes the constraints of PCA. For this, it helps me to think of PCA as a matrix factorization problem, where there are loading vectors and principal components. The PC’s are constrained to be orthonormal, meaning orthogonal and each one sums to one. In NMF, <strong>one or both of the matrices are constrained to be non-negative</strong>. This is particularly useful when you have domain knowledge that the “truth” is non-negative, such as images. However, in practice, simply running PCA and then truncating things below zero to equal zero often works as well with much less hassle.</p>
</li>
<li>
<p>LDA - Here, rather than merely a “data matrix”, we also have a target variable that is categorical (eg, binary). Whereas PCA tries to find the directions that maximize the variance, <strong>LDA finds the 1 dimension that maximizes the “separability” of the two classes</strong>, under a gaussian assumption. It is consistent under the gaussian assumption, meaning if you have enough data, it will eventually achieve the optimal result.</p>
</li>
<li>
<p>Linear regression - Same as LDA, but here the target variable is continuous, rather than categorical. <strong>So, linear regression finds the 1 dimension that “is closest” to the target variable</strong>. As it turns out, in a very real sense, linear regression is a strict generalization of LDA, meaning that if one runs linear regression on a two-class classification problem, one will recover the same vector that LDA would provide. Like LDA, it is optimal under a gaussian assumption if you have enough data.</p>
</li>
<li>
<p>CCA - Is a generalization of linear regression and LDA, which applies when you have multiple target variables (eg, height and weight). CCA finds the dimensions of both the “data matrix” and the “target matrix” that <strong>jointly maximize their covariance</strong>. If the target matrix is one-dimensional, it reduces to linear regression. if it is one-dimensional and categorical, it reduces to LDA. Again, it achieves optimality under linear gaussian assumptions with enough data.</p>
</li>
<li>
<p>PLS - Like CCA, but rather than maximizing covariance, it <strong>maximizes correlation</strong> between the learned dimensions. Often this works better than CCA and other stuff. The geometry and theory underlying PLS is much less clear, however, and efficient implementations are less readily available.</p>
</li>
<li>
<p>JIVE - Also like CCA, in the sense that there are multiple modalities, but now there are also multiple subjects, and each subject has matrix valued data (and each matrix is the same size/shape/features). JIVE explicitly tries to model the variance that is <strong>shared across individuals</strong>, and separate it out from the subject specific signal, which is also separated from subject specific noise. It does so by imposing several constraints, specifically that features are orthogonal to one another.</p>
</li>
<li>
<p>LOL - We made this one up; the arxiv version is <a href="https://arxiv.org/abs/1709.01233">here</a>. It is designed to find a low-dimensional representation <strong>when dimensionality is much larger than sample size</strong>. We proved that, under gaussian assumptions, it works better than PCA and other linear methods for subsequent classification.</p>
</li>
<li>
<p>Random Projections - These have support from an elegant theory, the <a href="https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma">Johnson–Lindenstrauss lemma</a>, which states that “random” projections, can achieve low reconstruction error with high-probability. There are many possible ways to generate these random projections, my favorite way is <a href="https://dl.acm.org/citation.cfm?id=1150436">very sparse random projections</a>, which simply requires a few non-zero numbers, and the rest can be +1 or -1. We use this result to accelerate LOL.</p>
</li>
</ol>
<p>Others have written/compared all the various linear methods in greater detail. One example is <a href="https://stat.columbia.edu/~cunningham/pdf/CunninghamJMLR2015.pdf">here</a>, which addresses many of the methods I discussed above, as well as some others (but misses a PLS and JIVE).</p>Joshua VogelsteinColleagues and trainees have asked for a basic summary of different linear dimensionality reduction methods. Here is my take on them.10 Simple Rules for Designing Learning Machines2018-09-24T18:27:57+00:002018-09-24T18:27:57+00:00https://bitsandbrains.io/2018/09/24/modeling-desiderata<p>When designing an estimator (or a learner, as machine learning people say), there are a number of desiderata to consider. I believe the following is a nearly <a href="https://en.wikipedia.org/wiki/MECE_principle">MECE list</a> (mutually exclusive and collectively exhaustive). However, if you have other desiderata that I have missed, please let me know in the comments. Recall that an estimator takes data as input, and outputs an <em>estimate</em>. Below is my list:</p>
<ol>
<li>
<p><strong>Sufficiency</strong>: The estimate must be sufficient to answer the underlying/motivating question. For example, if the question is about the average height of a population, the estimate must be a positive number, not a vector, not an image, etc. This is related to <a href="https://en.wikipedia.org/wiki/Validity_(statistics)">validity</a>, as well as the classical definition of <a href="https://en.wikipedia.org/wiki/Sufficient_statistic">statistical sufficiency</a>, in that we say that if an estimator is estimating a sufficient statistic, then it is a sufficient estimator. Note, however, that sufficient estimators do not have to estimate sufficient statistics. To evaluate the degree of <em>sufficiency</em> of an estimator for a given problem one determines how “close” the measurement and action spaces of the estimator are to the desired measurement and action space. For example, the action space may be real numbers, but an estimator may only provide integers, which means that it is not quite perfectly sufficient, but it may still do the trick.</p>
</li>
<li>
<p><strong>Consistency</strong>: We prefer, under the model assumptions, for any finite sample size, that the expected estimate is equal to the assumed “truth”, i.e., is <em>unbiased</em>. If this unbiased property holds asymptotically, the estimator is said to be <em>consistent</em>. For random variables (estimands are random variables since they are functions of random variables), there are many different but related notions of <a href="https://en.wikipedia.org/wiki/Convergence_of_random_variables">convergence</a>, so the extent to which an estimator is consistent is determined by the sense in which it converges to an unbiased result.</p>
</li>
<li>
<p><strong>Efficiency</strong>: Assuming that the estimator converges to <em>something</em>, all else being equal, we desire that it converges “quickly”. The typical metric of statistical efficiency is the variance (for a scalar values estimate), or the Fisher information more generally. Ideally the estimator is <em>efficient</em>, meaning, under the model assumptions, it achieves the minimal possible variance. To a first order approximation, we often simply indicate whether the convergence rate is polynomial, polylog, or exponential. Note that efficiency can computed with respect to the sample size, as well as the dimensionality of the data (and both the “intrinsic” and “extrinsic” dimensionality). Note also that there are many different notions of <a href="https://en.wikipedia.org/wiki/Convergence_of_random_variables">convergence of random variables</a>, and the estimators are indeed random variables (because their inputs are random variables). Because perfect efficiency is available only under the simplest models, <em>relative efficiency</em> is often more important, in practice. Probably almost correct learning theory, as described in <a href="http://a.co/d/bYJlTWA">Foundations of Machine Learning</a>, is largely focused on finding estimators that efficiently (i.e., using a small number of samples) obtain low errors with high probability.</p>
</li>
<li>
<p><strong>Uncertainty</strong>: Point estimates (for example, the maximum likelihood estimate), are often of primary concern, but uncertainty intervals around said point estimates are often required for effectively using the estimates. Point estimates of confidence intervals, and densities, are also technically point estimates. However, such point estimates fundamentally incorporate a notion of uncertainty, so they satisfy this desiderata. We note, however, that while some estimators have uncertainty estimates with theoretical guarantees, those guarantees depend strongly on the model assumptions. To evaluate an estimator in terms of <em>uncertainty</em>, one can determine the strength of the theoretical claims associated with its estimates of uncertainty. For example, certain manifold learning and deep learning techniques lack any theoretical guarantees around uncertainty. Certain methods of Bayesian inference (such as Markov Chain Monte Carlo) have strong theoretical guarantees, but those guarantees require infinitely many samples, and provide limited (if any) guarantees given a particular number of samples for all but the simplest problems. We desire estimators with strong guarantees for any finite number of computational operations.</p>
</li>
<li>
<p><strong>Complexity</strong>: The “optimal” estimate, given a dataset, might be quite simple, or quite complex. We desire estimators that can flexibly adapt to the “best” complexity, given the context. <a href="https://www.amazon.com/Efficient-Adaptive-Estimation-Semiparametric-Models/dp/0387984739/ref=sr_1_6?s=books&ie=UTF8&qid=1537811338&sr=1-6&keywords=semiparametric"><em>Semiparametric</em></a> estimators can be arbitrarily complex, but achieve parametric rates of convergence.</p>
</li>
<li><strong>Stability</strong>: Statisticians often quote George Box’s quip “all models are wrong”, which means that any assumptions that we make about our data (such as independence), are not exactly true in the real world. Given that any theoretical claim one can make is under a set of assumptions, it is desirable to be able to show that an estimator’s performance is <em>robust</em> to “perturbations” of those assumptions. There are a number of <a href="https://en.wikipedia.org/wiki/Robust_statistics#Measures_of_robustness">measures of robustness</a>, including breakdown point and influence function. Another simple measure of robustness is: what kinds of perturbations is the estimator invariant with respect to? For example, is it invariant to translation, scale, shear, affine, or monotonic transformations? Nearly everything is invariant to translation, but fewer things are invariant to other kinds of transformations. Some additional example perturbations that one may desire robustness against include:
<ul>
<li>data were assumed Gaussian, but the true distribution has fat tails,</li>
<li>there are a set of outliers, sampled from some other distribution,</li>
<li>the model assumed two classes, but there were actually three, and,</li>
<li>data were not sampled independently.</li>
</ul>
</li>
<li>
<p><strong>Scalability</strong>: If an estimator requires an exponential amount of storage or computation, as a function of sample size or dimensionality, than it can typically only be applied to extremely small datasets. Similarly, if the data are relatively large (meaning that it takes “a while” to estimate), than estimation can often be sped up by a parallel implementation. However, the “parallelizability” of estimators can vary greatly. The theoretical space and time complexity of an estimator is typically written as a function of sample size, dimensionality (either intrinsic or extrinsic), and number of parallel threads.</p>
</li>
<li>
<p><strong>Explainability</strong>: In many applications, including scientific inquiries, medical practice, and law, explainability and/or interpretability are crucial. While neither of these terms have accepted definitions, certain general guidelines have been established. First, the fewer features the estimator uses, the more interpretable it is. Second, passing all the features through a complex data transformation, such as a kernel, is relatively not interpretable. Third, when using decision trees, shallower trees are more interpretable, and when using decision forests, fewer trees are more interpretable. In that sense, in certain contexts, relative explainability can be explicitly quantified.</p>
</li>
<li>
<p><strong>Automaticity</strong>: In most complex settings, estimators have a number of parameters themselves. For example, when running k-means, one must choose the number of clusters, and when running PCA, one must choose the number of dimensions. The ease and expediency with which one can choose/learn “good” parameter values for a given setting is crucial in real data applications. When the most computationally expensive “process” of an estimator can be nested, leveraging the answer from past values, this can greatly accelerate the acquisition of “good” results. For example, in PCA, by projecting into <script type="math/tex">D</script> dimensions, larger than you think you’ll need, you have also projected into <script type="math/tex">1, \ldots, D-1</script> dimensions, and therefore do not need to explicitly compute each of those projections again. This is in contrast to non-negative matrix factorization, where the resulting matrices are not nested. This desiderata is the least formal of the bunch, and most difficult to quantify.</p>
</li>
<li><strong>Simplicity</strong>: This desiderata is a bit “meta”. Simplicity is kind of an over-arching goal, including that the estimator has a simple geometric intuition, which admits theoretical analysis, and generalizability, a simple algorithm, lacking in many algorithmic parameters (or tuning knobs), and a straightforward implementation.</li>
</ol>Joshua VogelsteinWhen designing an estimator (or a learner, as machine learning people say), there are a number of desiderata to consider. I believe the following is a nearly MECE list (mutually exclusive and collectively exhaustive). However, if you have other desiderata that I have missed, please let me know in the comments. Recall that an estimator takes data as input, and outputs an estimate. Below is my list:3 Additional Constituents for Decisions Under Uncertainty2018-09-23T18:27:57+00:002018-09-23T18:27:57+00:00https://bitsandbrains.io/2018/09/23/probabilistic-decisions<p>In a <a href="/2018/09/22/deciding-stuff.html">previous post</a>, we enumerated the four necessary constituents for deciding anything: (1) measurements, (2) potential actions, (3) a decision rule, and (4) a loss function. Many great things have come from this stylized setting, including essentially everything in the fields of “pattern recognition” and “data mining”. One can often obtain stronger theoretical results, and better empirical performance, by adopting additional assumptions/structure to the decision process. Here are three additional constituents that are essentially required when incorporating uncertainty.</p>
<ol>
<li>
<p><strong>Probabilistic Generative Model</strong>: This is a set of probability distributions under consideration. For example, “Gaussian” can be a probabilistic generative model (or “model”, for short), where all Gaussian distributions, each characterized by a particular mean and variance, are elements of the model. Note that if there are multiple measurements being used to make a decision, some assumptions are required to link those measurements. The simplest, and probably most commonly used assumption, is that each measurement is sampled independently and identically from a true but unknown distribution in the model. Formally, <script type="math/tex">\mathcal{M} = \{\theta : F_\theta \in \mathcal{M} \}</script>.</p>
</li>
<li>
<p><strong>Estimator</strong>: To quote wikipedia, who was quoting Tukey, “an <em>estimator</em> is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished.” In this context, estimators estimate a decision rule, which chooses an action. For example, in linear discriminant analysis, the estimator <em>learns</em> the decision boundary, and the estimate <em>is</em> the decision boundary.<br />
Formally, an estimator is a sequence of functions, <script type="math/tex">\{ f_n \}</script>, that maps from a set of <script type="math/tex">n</script> measurements <script type="math/tex">x_i</script>, each in some space <script type="math/tex">\mathcal{X}</script>, and yields some <em>action</em> <script type="math/tex">a</script>, which is one of the set of feasible actions, <script type="math/tex">\mathcal{A}</script>; that is, <script type="math/tex">f_n : \{ \mathcal{X}_i \}^n \rightarrow \Delta</script>. Often, the action space is in fact a set of admissable decision rules.</p>
</li>
<li>
<p><strong>Risk Functional</strong>: Under uncertainty, a given decision rule will incur losses probabilistically depending on the particular realized measurements. Therefore, simply minimizing loss on the observed measurements may not be desirable, and in particular, may result in over-fitting. It is therefore more desirable to choose/learn a decision rule that minimizes some functional of the distribution of loss induced by the true but unknown distribution. For example, one may desire a decision rule that minimizes the expected loss. In other contexts, one may instead desire a decision rule that minimizes the expected loss subject to a constraint on the size of the expected variance. This comes up, for example, in financial portfolio optimization.</p>
</li>
</ol>
<p>With these three additional constituents, one can begin constructing estimators that have desirable properties, as will be described in the <a href="/2018/09/24/modeling-desiderata.html">next post</a></p>
<!-- 3. **Estimator** (or **learner**): Something, or someone, must take as input the measurements, and output an action. A human can do this, or a mechanical devise, or some combination of the two, e.g., a human operating a computer. Note, however, that mechanical devises, on their own (at least for now), cannot estimate/learn without human intervention. -->Joshua VogelsteinIn a previous post, we enumerated the four necessary constituents for deciding anything: (1) measurements, (2) potential actions, (3) a decision rule, and (4) a loss function. Many great things have come from this stylized setting, including essentially everything in the fields of “pattern recognition” and “data mining”. One can often obtain stronger theoretical results, and better empirical performance, by adopting additional assumptions/structure to the decision process. Here are three additional constituents that are essentially required when incorporating uncertainty.