The road to becoming a data scientist is not always an easy one. The best advice is to begin preparing for your journey now. But where to begin? There are far too many resources out there. How do you decide the starting point? Did you miss out on topics you should have studied? Which are the best resources to learn?
I recently finished The Data Science Handbook by Carl Shan, William Chen, Henry Wang, and Max Song. It is a compilation of in-depth interviews with 25 remarkable data scientists, where they share their insights, stories, and advice. As a data scientist, I really enjoyed these interviews that cover data science career paths, application, and building data-driven cultures/teams in a very relevant fashion. I think that newcomers and experienced practitioners alike will benefit from reading.
The book answers questions such as:
- Why is data science so important in today’s world and economy?
- How does one master the triple disciplines of programming, statistics and domain expertise to become an effective data scientist?
- How do you transition from academia, or other fields, to a position in data science?
- What separates the work of a data scientist from a statistician, and a software engineer? How can they work together?
- What should you look for when evaluating data science roles at companies?
- What does it take to build an effective data science team?
- What mindsets, techniques and skills distinguish a great data scientist from the merely good?
- What lies in the future of data science?
In this post, I want to share the best questions and answers I gathered from the book, from all 25 interviewees themselves.
1 - DJ Patil (VP of Products @ RelateIQ, Former US Chief Data Scientist, Former Director of Data Science @ LinkedIn)
As a data scientist, what type of skills should one be building to expand or broaden their versatility?
I think what data gives you is a unique excuse to interact with many different functions of a business. As a result, you tend to be more in the center and that means you get to understand what lots of different functions are, what other people do, how you can interact with them. In other words, you’re constantly in the fight rather than being relegated to the bench. So you get a lot of time on the field. That’s what changes things.
One of the first things I tell new data scientists when they get into the organization is that they better be the first ones in the building and the last ones out. If that means four hours of sleep, get used to it. It’s going to be that way for the first six months, probably a year plus.
That’s how you accelerate on the learning curve. Once you get in there, you’re in the conversations. You want to be in those conversations where people are suffering at two in the morning. You’re worn down. They are worn down. All your emotional barriers come down and now you’re really bonding. There’s a reason they put Navy Seals through training hell. They don’t put them in hell during their first firefight. You go into a firefight completely unprepared and you die. You make them bond before the firefight so you can rely on each other and increase their probability of survival in the firefight. It’s not about bonding during the firefight, it’s about bonding before.
That’s what I would say about the people you talked to at any of the good data places. They’ve been working 10x harder than most places because it is either do or die. As a result, they have learned through many iterations. That’s what makes them good.
2 - Hilary Mason (Data Scientist in Residence @ Accel, General Manager for Machine Learning @ Cloudera, Founder @ FastForwardLabs)
What advice would you give to aspirational data scientists?
A lot of people are afraid to get started because they’re afraid they’re going to do something stupid and people will make fun of them. Yes, you will do something stupid, but people are actually nicer than you think and the ones who make fun of you don’t matter.
Try to do a project that plays to your strengths. In general, I divide the work of a data scientist into three buckets: Stats, Code, and Storytelling/Visualization. Whichever one of those you’re best at, do a project that highlights that strength. Then, do a project using whichever one of those you’re worst at. This helps you grow, learn something new, and figure out what you need to learn next. Keep going from there.
This has a bunch of advantages. For one thing, you know what data science is actually like. A lot of data scientists spend their time cleaning data and writing Hadoop scripts. It’s not all fun – you should experience that.
Second, it gives you something to show people. You can tell people what cool things you’re trying out – people get really excited about that. They’re not going to say you tried and you suck, they’re going to say, “Wow, you actually did something. That’s cool!” This can help you get a job.
3 - Pete Skomoroch (Co-Founder & CEO @SkipFlag, Former Principle Data Scientist @ LinkedIn)
What’s your perspective regarding the growing importance of data science within companies?
I think data teams are building really important things. They are actually going about it in a very deliberate way and they’re using reason, theory and evidence. People from science backgrounds are well suited for this, because you’re building up a theory of what you think will happen if you were to make certain changes to the product. I think that that is really at the core of the skillset that you want in engineering product development and data science to make informed decisions.
I think that data science is going to become this discipline that drives decision-making and product development. In order for data to have the biggest impact, it needs to be in the early phases of product development rather than just added as an afterthought.
It also involves giving feedback to the product and engineering team about the quality, type and quantity of data that will be collected and affected given certain product decisions. It’s incredibly important to have someone sitting in the room and advocating for the data team every time a new product feature is proposed. That may be easier if data science itself rolls up within the engineering or product organization, or has an advocate reporting to the CEO like a Chief Scientist or Chief Data Officer.
4 - Mike Dewar (Data Scientist @ The New York Times R&D Lab, Data Ambassador @ DataKind)
What advice do you have for people in academia transitioning to data science?
Code in public, that’s number one. If you’re going to be a data scientist, you’re probably going to have to be able to program in one way or another. There are lots of different options, but you’re probably going to have to be quantitative and be able to write non-trivial programs on the computer. As you code, as you practice, as you go to hackathons, as you code for your post doctorate or for your PhD or for your graduate degree, make sure you do it in public. Put it on Github. To a certain extent I’m on the other side of it now where I put every thing I think of on Github, so it’s a bit of a mess.
Especially with PhDs, one of the problems we see is that although they come from impressive universities, they have impressive resumes, and they’ve written these nice papers, but we still have no idea if they can actually write code. That makes them more difficult to hire.
The other thing is networking. It’s more or less the same thing, but it’s important. In major cities it’s very easy for you to get out of your office or house and visit meetups and user groups to give a talk. Giving talks about your academic work to lay people is an incredibly interesting and enlightening experience, one that you should go through. It also exposes you to the business communities and the various kinds of people that you might want to get jobs from in the future. It also shows you what other people are up to; it knocks your academic naivety very quickly, which is great.
5 - Riley Newman (General Partner @ Wave Capital, Former Head of Data Science @ Airbnb)
What would you say are some of the most valuable or relevant skills that someone in academia should build right now?
Many people coming into data science from academia have honed their ability to think mathematically or statistically and, to some extent, work with data. The big division that I see is the ability to lend those skills towards problems that will result in an actionable solution. In other words, the types of questions they ask are as important, or more, than the methodology behind solving them. In their research, they focus on why something is the way it is or how it works; in industry we’re more interested in what we should do. If the how or why lends itself to answering this, great. But if nothing changes as a result of your work, then it wasn’t that valuable.
When we ask other data scientists that question, we hear about technical skills like Python and programming. We don’t hear as much about extracting actionable insights.
I’m not saying that those aren’t relevant; I’m presupposing that anyone hoping to generate actionable insights from data has the ability to work with the tools of the trade. At Airbnb, we mostly use Hive, R, Python, and Excel.
When we interview people, our process is very transparent (see Quora post on this,here). We give candidates a day to solve a problem similar to something we’ve faced, using real (but anonymized) data. They spend the day seated with the team and are treated like anyone else, meaning they can collaborate with anyone. At the end of the day, we have them walk us through what they found and tell us what we should be doing differently as a result. This is too tight of a timeframe for someone to learn a tool while trying to use it to solve the problem. Their time needs to be completely focused on getting to that actionable insight.
6 - Clare Corthell (Founder @ Luminant Data, Former Data Product Manager @ Clover Health, Former Data Scientist @ Mattermark, Author of Open Source Data Science Master)
What could someone in school, or otherwise without too much background in industry learn from your experience?
The ability to evolve my own career with a self-designed curriculum begins to outline the immense cracks in the foundation of higher education*. The deconstruction of this system was very long in coming, but it’s happening now. The lesson is the following: if you take initiative and acquire skills that increment your value, the market is able and willing to reward you.
Though people continue to believe and espouse old patterns of education and success, these patterns do not represent requirements or insurance. The lack of any stamp of approval is a false barrier. There are no rules.
It’s important to understand the behavior of the market and institutions with regard to your career. When breaking out of the patterns of success, know that people will judge you differently than others who have followed the rules.
There are two very discrete things that I learned: The market is requiring people to perform tryouts for jobs instead of interviews, and most companies don’t hire for your potential future value.
Tryouts as Interviews: The economy has set a very high bar for people coming into a new profession. Job descriptions always describe a requirement for previous experience, which is paradoxical because you need experience to get it. Don’t let that scare you, not for a minute. Pull on your bootstraps and get in the door by giving yourself that experience - design and execute on a project that demonstrates your ability to self-lead. Demonstrate that you can take an undefined problem and design a solution. It will give you the confidence, the skills, and the background to merit everything from the first interview to the salary you negotiate.
Even more concretely, work with a non-profit organization (or another organization that doesn’t have the economic power to hire programmers or data scientists) to create a project that is meaningful for the organization and also shows off your skills. It’s a great way to do demonstrative and meaningful work while also aiding an organization that could use your help, and likely has problems people are paying attention to solving. Win-win.
Current Value vs Potential: Look for companies that will hire you for your potential. It’s important to be upfront about your grit, self-sufficiency, and ability to hit the ground running. Luckily, with disciplines like data science, the market is on your side. Sometimes companies can spring for a Junior Data Scientist and invest in your growth, which is really what you wanted from the beginning.
Everyone will tell you this, but I work on product so I’ll underline it even more strongly: Learn to write production-level code. The more technical you are, the more valuable you are. Being able to write production code makes you imminently hirable and trainable.
8 - Kevin Novak (Chief Data Officer @ Tala, Former Head of Data Science @ Uber)
For somebody who gets into data science and realizes that it’s not for them, what can these people transition into?
A solid understanding of what’s not working will inform the direction of transition better. Data science is at the confluence of computer programming, mathematics and communication as part of the work structure.
If you don’t like mathematics, a better and more obvious role is to get into business development of product.
On the other hand, if you like the mathematics but don’t like programming, an analyst position may be more suitable. Some people are arguing that a data scientist is an evolution of the analyst, but I believe these two roles are on fundamentally divergent paths. An analyst is someone who is answering more financial or quantitative information using an existing toolkit. A data scientist is more of a mix of software carpentry, engineering and product.
If you are good at mathematics and engineering, but not good at communication, I would recommend becoming an engineer. There are a lot of organizational charts in a lot of companies where the engineers are isolated from the other departments. A lot of companies can offer that sort of environment where an engineer can just focus on the problem at hand.
7 - Drew Conway (Founder & CEO @ Alluvium, Co-Founder @ DataKind, Co-Author of Machine Learning for Hackers, Known for his Data Science Venn Diagram)
What advice do you have for people who have both a social science and a computer science background and who want to go into data science?
The piece of advice that I would have would be to continue following this track. You’re a social scientist and you care about human problems and the specific genre of those problems that triggers your interest. If you have a desire to solve a problem from the world of social science using the skills of your computer science, you need to dive pretty deep into whatever the technical tool is that you care about. I talk to a lot of social scientists who are thinking about learning Python or R and they’re not sure which one to pick up, but just dive deeply into one of them.
It doesn’t make any difference. Just pick one, use it and learn from your mistakes, but make sure you’re asking intelligent questions.
You’re either trying to learn something new or you have an interview or a question that you ask that you don’t know the answer to and you can say, “I tried this, but I wasn’t quite sure so I went back and tried something different.”
A piece of motivation I would give people is that sometimes others say to me, “You’re so unique, no one else can make the transition from social science to data science today?”
That’s absolutely wrong.
The problems that you care about, people will pay you lots of money to work on. Every way that an internet company makes money is by humans making choices; the choice to buy something, the choice to click on something, to share something, to connect with someone.
All those things are questions that are fundamental to the social sciences. So you already have all of the training necessary to identify the problems that are out there in the real world. Now all you have to do is figure out how to solve them using the tools from an industry.
Don’t think you can’t do it because, the reality is that you’re already way ahead of the game. Now you have to learn the easy stuff. The hard stuff you already know. Go learn these things, and then get better at it.
9 - Chris Moody (Manager of Applied AI @ Stitch Fix)
What do you feel are the defining qualities of a top-notch data scientist, compared with someone who is merely good?
I think it deals with communication. I think that’s the difference between the good scientists and the great. Both are going to know a lot about statistics, the techniques they can use, and how to design, implement, and execute an experiment. Those things are all important. The biggest thing, though, is that you need to be able to communicate those results. That’s a lot harder than it looks.
I think the easiest thing for a graduate student to do, coming into this field, is to gloss over it, but that’s the single most important thing. Most people complain that graduate students don’t have a great programming background. All of their other intuitions, well designed experiments, caveated results, are sound. But I think that a lot of people believe that a programming background is not necessary.
So, maybe it is programming for a lot of people, but if you’re already pretty good, then you’re probably already a good programmer. The last step is just communication. People need to sense the passion inside of you. This defines the most successful people. It’s the realization that you are working with other people, and for a lot of scientists, I think that’s quite a shock. It really goes against this notion of romantic science.
Isaac Newton spent three years in a shack during the plague. He didn’t want to get the plague and he hated talking to everyone. Granted, he was possibly autistic in some ways, but I think a lot of people follow that archetype of going back and living by themselves, and then they emerge with all of their findings. But in reality, it needs to be a much more continuous process. It needs to be a much smoother process than just coming back and reeling off a list of accomplishments. So it’s always communication, but that’s the easiest part to skip over.
10 - Erich Owens (Engineering Manager @ Facebook Artificial Intelligence Research)
What would you say are some of the qualities that separate the best data scientists from the rest?
The brilliant ones I’ve seen at the few companies I’ve worked at were the ones who could read papers, prototype and then turn it into a scalable system. I’ve met quite a few people who would have a great idea, but would then take forever to implement it even in Matlab.
So I think strong programming skills coupled with systems-level thinking is very important. Building scalable systems may limit your ideas, but it makes them that much more powerful in terms of impact. At Quid, for instance, there were engineers who could build systems on their own and think theoretically. In my opinion, the combination of strong theory and the ability to implement that in a scalable manner are makes a data scientist stand out.
11 - Eithon Cadag (Senior Data Science Manager @ Microsoft)
How do you make sense of what people are doing with the term “data science” today?
I didn’t even know this term existed until I got this position. I didn’t know data was a discipline of science. I thought it was a prerequisite for science, not a study unto itself. I’ve heard the definition, “It’s someone who’s better at coding than a statistician, and someone who’s better at statistics than a programmer.” In some ways you can turn it on its head: it’s someone who’s worse at coding than a software engineer but is worse at statistics than a statistician! I’m joking of course, but that’s how I feel about it sometimes since I’m well aware of my own shortcomings.
A lot of people that do this role have very interesting backgrounds. You don’t have a huge majority of people coming from a specific discipline; it’s mixed. When you look at something like computational biology, we’re used to dealing with messy, noisy, ill-formatted data. There are quite a few people who come from a biology background that do data analytics and data science. Maybe they picked up data wrangling skills along the way to do extract-transform-load.
The other component that is also pretty critical is some kind of statistical training. At the end of the day, the term data science means you’re a scientist, and you have an obligation to deliver results correctly. If you’re not happy with it you go back to the drawing board. There’s an important ability to understand and be able to evaluate whether or not what you’ve done makes sense from a statistical standpoint.
Then there is the domain expertise aspect. In many cases, we’re tackling problems that are fairly difficult and that require a lot of knowledge of a particular area. Moreover, being able to go to subject matter experts and speak the same language goes a long way to gaining credibility and trust from the person with whom you’re working.
I think many of the applied science areas of study, and certainly things that involve experimentation, are where many people get a lot of broad experience. Graduate school is great for giving you that deep domain knowledge and then hopefully along the way you’ve picked up sufficient amounts of statistics or mathematics to speak coherently about what you’ve generated, as well as the technical chops to execute.
Sometimes it’s just practice. For example, maybe you won’t know having certain data in a specific way is a problem, unless you’ve seen it before and have done the repetitions to deal with it in a very fast way. If you do this enough, even a massive data set can be turned over very quickly because you’ve seen it before and you know exactly what to do. In some sense a lot of it is as much pure practice as it is science.
12 - George Roumeliotis (Data Science Manager @ Airbnb, Former Principal Data Scientist @ Walmart & Intuit)
What are some of the mistakes often made by younger hires?
First, you have to proactively build relationships with your non-technical colleagues. Data Scientists are often by temperament introverts, but if you want to be effective and successful, you need to step outside that comfort zone. Email a non-technical colleague you’ve never met, and ask them to lunch. Make it your responsibility to form such relationships before you need them.
Next, practice viewing the world in terms of business processes. What’s a business process? It’s a foreign concept to many new Data Scientists coming directly from academia. A business process encompasses the people, systems and steps involved in a business activity. Generally speaking, a Data Science project has the goal of improving some existing business process. The truth is, it’s really difficult to change a business process.
For example, it took me a long time to grasp that improving the efficiency of a business process might actually be perceived as threatening to someone’s job, and the natural reaction of that person might be to consciously or unconsciously undermine any progress. So you have to develop deep empathy for the people involved in business processes, and create solutions that help those people transition to higher-value work. That sounds like a lot of responsibility for a Data Scientist, but if you don’t think about things like that, your ideas might never be implemented in the real world.
13 - Diane Wu (Co-Founder & CEO @ Trace Genomics, Former Data Scientist @ MetaMind & Palantir)
What distinguishes the best data scientists from the rest?
There are statisticians and there are computer scientists and designers. And then, there are people who are very good at all of these things. The reason why this role–data scientist–was created, and the reason why it’s a little bit undefined, is that it requires that you’re good at many different things. You have to think about problems, both as an engineer and also as a statistician. You have to know what tests are right, how to approach the problem, how to engineer the solution and how to sift through large datasets.
And then afterwards, you have to present your findings in a clear way. This might require you to create visualizations. Having an understanding of graphic theory and the language of visualization is useful. This ties into communication because as a data scientist you’re communicating with someone who doesn’t have a ton of time to analyze data. They look at the figure and want to be able to extract meaning from it in a few minutes.
Finding someone who’s a good engineer and a good communicator is incredibly difficult. You don’t need to be the best at everything, but some people who are great communicators need to learn how to be great engineers and vice versa.
14 - Jace Kohlmeier (Former Dean of Data Science @ Khan Academy)
For aspiring data scientists who come from a strong quantitative research field, some might not have spent so much time with software engineering. What are some ways for them to increase their programming skills?
In my opinion, to be a great data scientist, you must be a great (or at least a very productive) programmer. That doesn’t mean that you have to be a savant in computer science, it just means that you have to be fluent with code and experienced in building real systems.
What I would suggest for someone who’s looking to build skills in those areas is, number one, you just have to write code and you have to write a lot of it. There will always be differences between a first year programmer, a fifth year programmer, and a tenth year programmer, at least for people who spent those years practicing the right way. The hack to get better faster is to get lots of good feedback. And the best way to get feedback is to find great developers to work with who will give you code reviews.
The great thing today – which wasn’t available in my day – is you can get involved with open source projects and get very specific feedback from great developers. This is a tremendous resource and opportunity for people who want to improve their programming skills. So write a lot of code, and make sure you’re getting code reviews from quality programmers.
15 - Joe Blitzstein (Statistics Professor @ Harvard University)
What’s the best way to keep on learning after university?
I noticed that’s a trap that people fall into, thinking, “I’m perpetually feeling unprepared.” It’s a dangerous way of thinking - that until you know X, Y, Z and W, you’re not going to be able to do data science. Once you start learning this thing, you realize there are four other things you need to learn. Then, you try to learn those things, and you realize you don’t have this, this, and this.
You do need some basic foundation in statistics and CS skills, but both statistics and computer science are enormous fields that are also rapidly evolving. So, you need durable concepts. Right now, for people that want to do data science, I highly recommend learning R and Python. But in 10 or 20 years, who knows what the main languages will be?
It’s a mistake to think, “why am I learning R now? R won’t be used in 20 years.” Well, first of all, R might still be used in 20 years, but even if it isn’t, there’s going to be a need for the thinking that produced R. The people who create the successors to R will have probably grown up using R. So, they’re still going to have that frame of reference.
You want the skills that are language-independent. You need fundamental ways of thinking about uncertainty and communicating those thoughts in a way that is not that dependent on any particular programming language. It’s definitely important to have that kind of foundation, but keep in mind that it’s hopeless for anyone to actually know all the relevant parts of statistics and CS, even for some small portion of data science. It’s not feasible for anyone, but it doesn’t mean that you can’t make useful contributions.
In fact, I think it’s a good idea to continue learning something new every day. The way you can learn something, and really remember it, is by using it in your work. Instead of saying, “I need to study these five books so that I will know enough to become a data scientist,” it should be about getting a basic level and foundation. Then, start immersing yourself in a real, applied problem. You will realize what types of methods you need. Then, go and study the books and papers that are relevant for that. You will understand them so much better because they’re in the context of a problem that you care about.
You have to be energetic and work really hard, but not get discouraged just because you don’t know everything. And just because you don’t know everything, it doesn’t mean you can’t contribute useful things while gradually expanding your understanding and knowledge.
16 - John Foreman (VP of Product @ MailChimp)
Talking more about unconventional thinking, you’ve written in the past that “Your model is not the goal; your job is not a Kaggle competition.” Can you talk about why you don’t think Kaggle is where data scientists should be spending their time?
There’s nothing wrong with Kaggle. I think it’s a great idea. If a company’s at that point where they want a model that’s that good and they’re getting a lot of revenue and want to push like Netflix, go for it.
My one criticism is that the way journalists write about it gives a skewed view of what data science is. There was an article on GigaOM where the author said, and I’m paraphrasing, “The main thing data scientists do is build predictive models. That’s how they spend most of their time.” This is a myth that something like Kaggle will perpetuate.
Before you build a model, you need to know what data sources are available to you within the company, what techniques are available to you, what technologies are available, you have to define the problem appropriately and engineer the features. Usually, when you grab data from Kaggle, all of these steps are done for you. You don’t have to go around looking for data. You can’t say something like, “Maybe they left some data behind. Can I come into your company and look around?”
I feel that there’s so many steps before you get to modeling that are crucial. Can I ever ask a Kaggle competition, “Is this the competition this company should actually be having?”
Think about the Netflix prize. They were trying to predict what star rating readers would give a movie given past data, but I think they backed off that a little bit because they noticed it’s not all about five-star movies. For example, I watch garbage. I will give it two stars, and I will watch it anyway. It’s more about moods. A lot of things drive viewership, such as what my friends are watching on Facebook. That’s something Netflix is doing now – and it’s made their original modeling endeavor somewhat irrelevant.
So there’s this notion in data science about whether or not a project should be tackled in the first place that is a priori ignored by Kaggle. And I think a big component of data science is questioning why you’re doing what you’re doing – choosing problems to solve while rejecting other problems that are irrelevant to the business. With Kaggle, for better or for worse, that job is done for you. Kaggle is just an exercise in using a data scientist as model-building machine.
I still think that Kaggle competitions are awesome, and I will never match the intellectual ability of some of the competitors on that platform. I just like to emphasize the other fundamentals of operating in a data science role at a company. I wish there was more focus on them, but those aren’t really sexy to talk about in the media.
17 - Josh Wills (Software Engineer in Search, Learning, Intelligence @ Slack, Former Director of Data Science @ Cloudera)
What did you end up learning through this experience of debugging black box systems?
I don’t think there’s any secret to it: I’m obsessive. I was one of those kids that played with Legos for five or six hours straight. I’m still pretty much like that. I was born in 1979, so I’m borderline millennial. It is unacceptable to me for a computer system to not do what I want it to do. I was willing to beat on the black box hardware for whatever amount of time was required to make it do what I wanted.
I’ve had a few instances in my life where I have worked on a very satisfying problem. A satisfying problem is one where your technical skills are good, but the problem is just a little bit too hard for you. You’re trying to do something slightly more difficult than what you already know how to do, and that is great, great feeling. I can lose myself in those kinds of problems. That’s typically when my personal relationships tend to fall apart, because I’m not really paying attention to anything else.
There was this trend for a while in data science job interviews to have candidates analyze real datasets during the interview. I’m a huge fan of this practice. I had one job interview where they gave me a problem and a dataset and two whole hours of quiet time to just sit and do data analysis. It was maybe the happiest two hours of my entire year. I should do more job interviews just so I can do that.
18 - Bradley Voytek (Associate Professor of Cognitive Science, Neuroscience, Data Science @ UCSD, Former Data Evangelist @ Uber)
You teach different people about many different concepts, and I see that as being a very important package of being an effective professor or data scientist. Can you talk more about this missing aspect of data science that isn’t as heralded, which is the aspect of being able to communicate effectively?
Yes. I always think back to the movie Office Space, which was making fun of the first dotcom industry. In the movie, there’s a great line where they’re trying to figure out who to keep and who to fire in this tech company. They’re talking to this guy who is a product manager. But since he’s a product manager, he’s not a manager per se, so these guys that came in to interview him are asking, “What do you do?” He’s replied, “I talk to the engineers, and I learn what they’re doing. Then, I relay the information clearly up to the management.” They said, “Why can’t we just have engineers talk to management?” And he says, “They need a people person.”
When I started working with Uber, I was thinking about how the data can be used to tell an interesting story. Just like writing code, telling a story effectively takes a lot of practice. That’s a part of the reason why I do a lot of writing on Quora. I teach, and I do a lot of public speaking at elementary schools, junior high schools, high schools, or at a bar to a bunch of drunken aficionados. It’s practice. Just like I have to sit down and practice writing code, I also have to sit down and learn how to communicate the idea.
My wife is actually a very good sounding board. Whenever I write something, I always pass it by her because she’ll read something and say, “You’re making this more complicated than it needs to be. You can explain this in fewer words. You didn’t connect from A to C. You skipped over B.”
I remember the first time I took a programming class. It was an algorithms course. The homework was to write an algorithm for making a sandwich in which you had to explain every step you took to make a sandwich. You realize so many parts you skip over that you think are obvious, but if you had to program a robot to do it, simple things like pulling the knife out of the drawer must be explained. You have to explain exactly how you pull the knife out of the drawer to spread the mayonnaise.
We skip over a lot of stuff that seems obvious, but it’s not always obvious if you’re not the person staring at that data all day long. It’s a good point of practice to try to remember how to be very explicit about every step that you take and connect the dots for people.
19 - Luis Sanchez (Data Strategist @ SGX Analytics, Former Founder & Data Scientist @ ttwich)
What is data science to you?
To me, data science is the art and science of extracting actionable intelligence from sets of data, big or small.
I call it “art” because there is not really a one-size-fits-all technique that can help you answer the questions you want to ask your data. You need to be creative and have imagination to see what others don’t see in the data. If you are anything like me, the best solution to your most challenging problems has come to you when you least expected it, in the form of an inspiration. When that happens, I get in the zone and the solution just comes to me, and I can’t focus on anything else.
I call it “science” because you need to know the theory behind what you do and put in your 10,000 hours of problem-solving so you develop “muscle memory” so to speak, and you acquire the right foundation to become a good data scientist.
One thing I believe but don’t know if other people would agree with: Good data science can’t be 100% theoretical or 100% practical. There has to be a mix.
20 - Michelangelo D’Agostino (Senior Director of Data Science @ ShopRunner, Former Director of Data Science @ Civis Analytics, Former Lead Data Scientist @ Braintree, Former Senior Analyst @ Obama For America)
You mentioned that some of the most useful things you did during your time as a PhD were working on hackathons or working on Kaggle or data sets and working with people. Do you have more to add to that? What was the most useful part of being a post-doc and PhD student for your later data analysis/data science career?
I always tell students that I think the most useful skill you learn in grad school is how to teach yourself stuff and how to figure out things that you don’t know. That’s one thing. The second thing is to be stubborn and beat your head on a problem until you make progress. It’s really those two things.
I feel like grad school gave me confidence. Physicists tend to be a pretty arrogant bunch. They think they can learn anything, but that was the lesson I learned in grad school. I don’t know every programming language in the world, but I’m confident that if I spend a few months, I could pick up a new programming language or pick up some new infrastructure tool or modeling technique. I can teach myself those things. I can go out there and read academic papers, read software manuals, and teach myself the tools I need to get the job done. I think that’s pretty common across grad school fields. Most of the things you learn you don’t learn in the classroom. You learn by completing a project and teaching yourself things. In data science, that’s a crucial skill because it’s a quickly growing field and it encompasses a ton of things. You can’t finish a degree and know all the things you need to know to be a data scientist. You have to be willing to constantly teach yourself new techniques.
That was one of the things I learned in grad school. The other is the ability to work on a hard problem for a long time and figure out how to push through and not be frustrated when something doesn’t work, because things just don’t work most of the time. You just have to keep trying and keep having faith that you can get a project to work in the end. Even if you try many, many things that don’t work, you can find all the bugs, all the mistakes in your reasoning and logic and push through to a working solution in the end.
Having confidence in yourself is another thing. I think that working on a really hard problem like in grad school can help you learn that. And then there are just the technical things like learning how to program, running on large computer clusters. Those are the things that I think are really helpful from grad school, but the advice I give to grad students is: if you feel like you want to leave grad school and do something else, keep that in mind when picking which tools and techniques you use for a dissertation. If you can write your dissertation in Python rather than some obscure language like FORTRAN, it’s probably going to be better for you. Try to be as marketable as possible with the things you learn when you’re doing your PhD.
And the final thing is that it really helps to have experience working with data. The only way to learn how to work with data is to actually work with data. You can read about it, and people can teach you techniques, but until you’ve actually dealt with a nasty data set that has a formatting issue or other problems, you don’t really appreciate what it’s like when you have to merge a bunch of data sets together or make a bunch of graphs to sanity check something and all of a sudden nothing makes sense in your distributions and you have to figure out what’s going on. Having that experience makes you a better data analyst.
21 - Michael Höchster (Director of Data Science @ Stitch Fix, Former Head of Research @ Pandora, Former Director of Data Science @ LinkedIn)
When you are trying to hire people who are coming out of school right now, what are the features you’re looking for?
When I think about hiring, I want people who can code somewhat, although I myself am not a particularly good coder. But we’re more focused on analysis. So when I’m looking for people, the most important qualities that a data scientist should have include having a feeling of how to take a data set and answer a question with it. Figuring out what should I compare? What’s the control? What’s the way of transforming the things that I have available to me to make it reasonable? What am I missing here that I need to go and collect?
This isn’t stuff you learn in school. Some people have it; it comes from experience too. It definitely comes from working with data. So I look for people who have real experience with data – whether it’s in a hard science, a social science, computer science, or statistics. Just understanding theory isn’t enough. You need data sense.
I’m also really looking for the ability to communicate well about things you’ve done, and good judgment. Understanding that when you’re working through a problem you have a series of choices to make and being very aware of the choices you’re making and why you’re making them at every stage. That’s part of data sense too. So these are somewhat intangible factors.
I also look for some facility with coding – you need to be able to get your data and manipulate it. Therefore coding is required. I myself don’t look for super-heavy coding skills because I feel like a lot things in my world have to be picked up. Also, I can’t evaluate it myself when I talk to people.
Then, the more formal statistical inference is the last thing that I look for. Not that it’s unimportant, but it is probably last.
22 - Kunal Punera (Staff Engineer @ Google, Former Co-Founder & CTO @ Bento Labs, Former Data Engineer @ RelateIQ, Former Research Scientist @ Yahoo)
Being able to learn new things really quickly is one of the things we need today more than ever, but there’s an art to doing that. You need to have some sort of foundation, core programming skills, core modeling skills. If you were to decompose those down to the principle skills, what do you feel is most important?
In terms of programming skills, I’m not sure what the curriculum nowadays looks like, but in my undergraduate days, I started by learning C. Actually, I learnt Pascal first. Then, I learnt C. These are pretty low-level languages with few rules and close interaction with the machine. So, I learnt from the very beginning how programming languages manage memory, what pointers are, what an execution stack looks like, etc. I think that experience was useful because now, if I have to learn new concepts, it’s easy for me to go back and reconstruct them from the first principles in my head.
In terms of data modeling, I think I was lucky that I took some good statistics courses. It’s useful to understand the underlying concepts of algorithms. I think a graduate-level optimization course is important, as well.
One of the obstacles I sometimes see engineers running into is confusing the core problem that needs to be solved and the one particular solution to that problem. Sometimes people have one way of solving the problem already in their head, and they might not see that the core problem is not the same thing as their solution. As much as possible, I would encourage people to constantly ask the question “What am I optimizing?” For example, if you want to obtain a clustering of data, it’s useful to first try to determine what properties you would want in a good solution, and then attempt to encode these criteria into a loss function. If one is not careful it is easy to think of clustering data in terms of steps the algorithm should take, or a series of methods that must be implemented. This can sometimes lead the engineer astray in that the preconceived solution might never end up obtaining clusters with the desired properties. Of course, the time constraints in a startup don’t leave data scientists the luxury of carefully thinking of every problem. In these situations, experience helps.
23 - Sean Gourley (CEO & Founder @ Primer AI, Former Co-Founder & CTO @ Quid, Former Research Fellow @ University of Oxford)
Now at this point in your life, I have to say that you have a lot of credibility in doing things you’re unqualified for.
That’s how you learn about your strength. My strength is consistently doing things I’m not very good at and quickly becoming reasonably competent. My hope for grad students coming out of their PhD is to pick whatever job you want in a way that takes advantage of the skills you have as a graduate student. In data science, if you feel you have to conform too much in a box and don’t get the freedom, that job is not for you. Find a place where you can shape a little bit of the world and keep doing that, which will probably involve doing a lot of things that you’re unqualified for and not very good at.
I think we’ve got a very narrow view of what data science is, which has largely been shaped by data analysts working in the big social networks, like Facebook and LinkedIn. So for many of us, data science seems concerned with things like A/B testing to optimize personalized ad recommendations. But data science can be so much more than this. We must recognizing what data can do, what data can’t do, recognizing that it’s messy, that it’s biased, and understanding that it needs a human layer - that it needs stories. Recognizing that it can solve a certain degree of complexity, but it can’t solve any further. Remember that humans have biases and they absolutely need data. We don’t want to move towards naive empiricism – that’s not what data science should be. It’s not what science teaches us, but, at the same time, we don’t need to throw data out the window just because it can’t push a button to solve an equation.
I think a lot of that will come together. Data science will evolve. I think the second piece is data scientists have an obligation to do good things in this world with that data. It’s not enough to just not be evil; it can fundamentally be good. If you come out of science, you are contributing to the world’s knowledge. When you come to the business world, you should also be contributing towards building the tools that help us to live and function in society.
This is an imperative that should be fought for very hard. This should be one of the decisions you make in the jobs that you do. We have a set of technologies that can and will shape our world in ways that are positive and potentially very negative. It ultimately comes down to the mentality of the people that are building the technologies. We can wash our hands and say, “I can’t do anything about it. It’s not in my control.” But we really need to challenge this assumption. You can do something about it! You’re making this or your company is making this stuff. You’re the data scientists that are building this stuff. Of course you can do something about it, and of course you have that responsibility.
If you choose to do it in a different way, you are the one shaping the world. We are the people who are creating this technology, so you can’t just wash your hands and say you’re not part of it. We saw what happened when a bunch of quants on Wall Street, with little regard for the consequences said, “I’m just going to use these algorithms to make money.” This is not good enough. You can’t arbitrage a system and make money for yourself without also having the responsibility to make it better.
So much of data science has been concerned with equations for the optimization of an existing world. But we need to use data science to build and engineer a better world, and that’s where it starts to move beyond black box predictions and basic statistical tools and moves into design. One of the big things for data scientists is to understand that their role is also one of design. If you create algorithms, you shape the behavior of people who interact with these algorithms. So what kind of behavior are you designing?
I think data science is really going to become more of a product design process; actually an algorithm design process. Algorithms take information and direct us; whether it’s the information we read, the music we listen to, the places we drink coffee, the friends we meet, or the updates in our lives.
You are designing algorithms that fundamentally shape humanity, and we do it in on a population scale in the billions. So how we choose to shape this world certainly has a lot of challenges. We can’t just hide behind the imperative to optimize an algorithm for maximum revenue. You designed an algorithm that created a certain kind of behavior – for better or worse – and now this algorithm is potentially impacting the lives of billions of people you have never met. What kind of behavior do we want? I think you need to fall on the line of making humans more human, making them see further, making them see deeper, making them understand and appreciate the nuance. Don’t try to hide the complexity from them, but instead, make them more conscious. Make them smarter. Help make them smarter. I think that’s what you design for. That’s what you use data science for.
24 - Jonathan Goldman (VP of Data @ Chan Zuckerberg Initiative, Former Director of Data Science & Analytics @ Intuit, Former Principal Scientist @ LinkedIn)
Given your own experiences in a PhD program, what advice do you have for our readers who are in a PhD, or just recently finished one, and are looking to start their career in data science?
Find the companies that are aligned with your values, where you get to work on things that are impactful and making a dent in the universe. There’s never going to be a shortage of interesting problems to work on that are massive and impactful. When you’re at that kind of company, it’s easier to take that data and turn the data into transformational business impact.
I think one of the most important things is to learn to be curious. You see something that might spark new questions for future projects. Once you’re curious about something with the data, you’ll figure out how to go solve and answer those questions, regardless of the technique. You need to be able to go back and forth in an iterative manner as businesses don’t always have well-defined problems.
25 - William Chen (Data Science Manager @ Quora)
How do you see data science in terms of it being the intersection of math, statistics and computer science? What weight would you give each in terms of importance?
I would say that the programming and software engineering part is very important because you may be expected to implement models, write dashboards, and pull out data in creative ways. You’ll be the one in charge of hauling your own data. You’ll be the one who owns the end-to-end and the full execution, from pulling out the data to presenting it to the company.
The Pareto principle is in full effect here. Eighty percent of the time is spent pulling the data, cleaning the data, and writing the code for your analysis. I found this true during my internships (especially because I was new to everything). A good coding background is particularly important here, and can save you a lot of time and frustration.
To emphasize: pulling the data and figuring out what to do with it takes an enormous amount of time, and often doesn’t require any statistics knowledge. A lot of this is software engineering and writing efficient queries or efficient ways to move around and analyze your data. Programming is important here.
One interesting thing to note is that the statistics used day-to-day in data science is really different than the kind of statistics you’d read about in a recent research paper. There’s a bias towards methods that are fast, interpretable, and reliable instead of theoretically perfect.
While the statistics and math may not be that complicated, a strong background in math and statistics is still important to gather the intuition you need to distinguish real insights from fake insights. Also, a strong background and experience will give you better intuition on how to solve some of your company’s harder problems. You may have a better intuition on why a certain metric might be falling or why people are suddenly more engaged in your product.
Another benefit of a strong statistics and math background is the contribution to communication. The better you understand the theoretical bases around a certain idea or concept, the better you can articulate what you’re doing and communicate it with the rest of your team. As a data scientist, a large portion of your work is presenting an action that you feel would have an impact. Communication is very important to make that happen.
Some data science roles require a very strong statistical or machine learning background. You might be working on a feed or recommendation engine. Or dealing with problems where you need to know time series analysis, basic machine learning techniques, linear regressions, and causal inference. There are lots of kinds of data for which you’d need a more advanced statistics background to be able to analyze.
Figuring out the balance between computer science, statistics and math will really depend on the role you take, so these are just some of my general observations.
I would highly recommend you to read The Data Science Handbook. The data scientists in the book have helped create the very industry that is now having such a tremendous impact on the world. They discuss the mindset that allowed them to create this industry, address misconceptions about the field, share stories of specific challenges and victories, and talk about what they look for when building their teams.
Data is being generated exponentially and those who can understand that data and extract value from it are needed now more than ever. The hard-earned lessons and joy about data and career navigation from these thoughtful leaders would be tremendously useful if you aspire to join the next generation of data scientists.
If you enjoyed this piece, I’d love it if you can share it on social media so others might stumble upon it. You can find my own code on GitHub and more of my writing and projects at https://jameskle.com/. You can also follow me on Twitter, email me directly or find me on LinkedIn. Sign up for my newsletter to receive my latest thoughts on data science, machine learning, and artificial intelligence right at your inbox!