Data Scientists, Beware Your Own Arrogance Aaron Beach January 6, 2014 Technical // SUMMARIES ?> Before joining SendGrid as the Senior Data Scientist, I had a chance to witness a lot of big data research, in both academia and industry. I have observed a number of pitfalls, which I hope to avoid myself, and would like to share them with you. This post is about arrogance, an easily-manifested trap for any data scientist. Fundamentally, the techniques and technologies employed by data scientists are not new. What’s new is the confluence of understanding techniques and the ability to implement them in software quickly, allowing algorithms and data structures to be applied to new domains in a matter of weeks/months as opposed to years. A critical mass of public opinion was reached around 2010-2011, convincing CEOs and managers in tech companies that something of substance was emerging and that these new “data scientists” could do things that current domain experts could not. In a few cases and for a few gifted individuals this was true. However, the emergence of “data scientists” has not changed the fact that trying out new ideas requires the humility to “fail early and often,” nor does it eliminate the hard work and risk involved with maturing the idea into a product or service. The failure to recognize that data science is just one piece of puzzle with its own limitations leads to the first sin of data science: arrogance. This sin manifests itself in various forms, I’ll discuss how I’ve seen it function among CEOs or founders of tech startups, among data scientists themselves, and among engineers. The Confident CEO or Founder The first form of arrogance exists at the top of many tech companies. All CEOs and successful founders of tech startups have confidence. However, many CEOs lack technical knowledge in their field and founders of “tech startups” often lack deep skills of any kind (including business). However, you can’t argue with success, least of all your own. Their own success or sometimes just the success of others may convince them that adopting the right concepts like “data science” or “big data” will automatically produce success. They watch for “mega” trends. They call data the new “oil,” implying untapped potential, accessible using proven recipes (the right concepts/process). So how does this usually play out? Start with the following strategy: take your self-inflated vision, hire an elite dev team and a few data scientists and throw it at the wall to see what sticks. In reality, developing a product from data is a messy iterative process and there is no way around the engineering. More often than not the data will quickly tell you the vision is flawed. If you began by assuming data science was effectively a tool for turning abstract ideas into value via data, you may reject the first thing the data tells you, that your idea sucks. Blame failure on something else [wash, rinse, repeat]. The Academic in Industry Another type of arrogance that may go undetected by even the most humble and honest CEO is the academic variety, because this sin comes veiled as a virtue. Once when I was telling my family about a recent conference at which I presented I referred to “how things were done in academia.” My grandmother, curious at why “they” did things so strangely in this place, asked “Where is Academia?” Good question! As it is with recognition confined to their own specific field and compensation spread thinner across the growing educated population, one need not wonder at why academics tend to romanticize their belonging to this mystical brotherhood. “In academia,” you see, novelty and rigor are king and there are no points for second place. And since humility is not nearly as common as humiliation, research tends to be jaded and vengeful, producing a unique form of arrogance “veiled as a virtue.” Once informed that second rate academic skills can be re-spun as first rate data science skills many accept demotion to the derogatory “industry.” So how does this usually play out when they arrive? After getting over the awkward introductions and identifying a “fundamental” problem that fits their style of analysis, they go to work. As they learned to do in grad school, they tend to spend months refining their analysis before sharing results. Once “finished,” the results come in the form of equation(s), proofs of those equation(s), and a PowerPoint slide with maybe colorful graphs. They may even have hacked together “an app” for viewing the graphs. One of two things results, either the graphs and PowerPoint are salvaged as show-n-tell or handed off to interested (but bewildered) engineers who are rightfully concerned with more pressing issues. This is a case of “The Emperor’s New Clothes,” the academic arrogance of the data scientist has led them to see their analysis as success-unto-itself and often no one else has the incentive or political clout to call shenanigans. The Over-Engineering Engineer And finally an old sin in new clothes, that of engineering for engineering sake. Engineers often find themselves at odds with the business side of business, preferring “good engineering” to making money. In the case of big data or data science, engineering arrogance may manifest itself again as belief that one’s particular skill set is a virtue unto itself. Spellbound by new “big data” technologies and how it reminds them of their good ol’ days at tech school, the engineers will cite anecdotes of the cost of bad engineering and complain until you let them tinker with these technologies indefinitely. Luckily, this sin is nothing new and will probably be called out much sooner than if the engineer had “data scientist” in their title. Even the most well-meaning data scientist can fall into the arrogance trap. The above are three of the most common examples I have seen. Look out for these in yourself and others to help put better science into data science.