Datacast Episode 69: DataPrepOps, Active Learning, and Team Management with Jennifer Prendki
The 69th episode of Datacast is my conversation with Jennifer Prendki— the founder and CEO of Alectio, the first startup fully focused on DataPrepOps.
Our wide-ranging conversation touches on her educational background in Physics in France; her transition to data science in Silicon Valley; organizational and operational challenges with enterprise ML; Agile for Data Science teams; her current journey with Alectio to tackle the underinvested Data Preparation sector; her advice for women entering the industry, and much more.
Please enjoy my conversation with Jennifer!
Listen to the show on (1) Spotify, (2) Apple Podcasts, (3) Google Podcasts, (4) TuneIn, (5) RadioPublic, (6) Stitcher, and (7) Breaker.
Key Takeaways
Here are highlights from my conversation with Jennifer:
On Studying Physics
As far as I can remember, I wanted to be a physicist. The great physicists of the past widely inspired me. My entire childhood was geared towards meeting it. I eventually ended up getting a Ph.D. in Astrophysics and Particle Physics at Sorbonne University. During my Ph.D., I also studied matters and anti-matters at the Stanford Linear Accelerator Center. I started realizing that there are many better research opportunities in the US than in Europe.
Unfortunately, I graduated in 2009 — the beginning of the economic recession. I was very peculiar in the type of physics that I want to work on. Another way of studying the principles of matters and anti-matters was via neutrino physics. As a result, I did my Post-Doc study on neutrino physics at Duke University. However, a few months after I joined, there were more restrictions on academic grants, and we received less funding than we thought. It became increasingly clear that I wouldn’t be able to work on the research that I felt passionate about. It was heartbreaking for me, but I wouldn’t be where I am today if that didn’t happen in hindsight.
On Transitioning From Academia to Industry
Even though I was trained as a physicist, I already had a lot of experience collecting data, analyzing data, and building statistical models for my physics experiments. It was natural for me to recycle my skills as a physicist by going into the industry (typically in finance back then).
To be completely honest, I really didn’t like my job working as a Quantitative Research Scientist at Quantlab Financial very much. Fortunately, I was still able to gain relevant technical skills. During the 2nd year, I was able to work on an NLP model to predict what happens in a stock market by analyzing the news. After that, I started to interview for roles doing machine learning in the industry.
On Getting Into Data Science
In finance, I often worked with Black Scholes equation or time series models. I felt the urge to work on more sophisticated modeling methods. Around 2014, data science started to become popular. The big tech companies started to put more emphasis on their data initiatives.
After my time at Quantlab, I tried to move to places where I could do things from the beginning. I decided to go to YuMe since they recently acquired a company whose CEO is John Wainwright, the first customer of Amazon. John also pioneered pure object-based computer languages, which serve as the core behind game development and 3D animation technologies used today. For me, the opportunity to work with John was more than any potential salary or benefits. However, by the time I joined YuMe, John has resigned.
I ended up in a weird situation where I didn’t know exactly what the company wanted me to work on as a data scientist. I was forced into (almost) a managerial position making high-stake decisions for the company when I barely knew about data science myself. I quickly realized that I enjoyed this a lot; for instance, I was very efficient in communicating with engineering teams. I figured that what I was really good at was more data strategy than model development.
YuMe gave me many opportunities that a manager has, but they didn’t give me the budget. It took a bit of time for somebody to trust me completely. I learned back in those days that if it’s not the right place for you, you shouldn’t be afraid to move. For me, it’s crucial never to lose the trust of what I need for my career.
“If I’m not with the right people or in the right environment, I shouldn’t be afraid to make the decision that enables me to grow.”
On Measuring Data Science ROI at Walmart Labs
When I joined Walmart Labs, I started in an individual contributor role as a principal data scientist. Later on, the muscle I wanted to exercise more was working on more large-scale initiatives (which would be easier in a big company).
During this period, a typical data science team comprised of people with Ph.D. degrees researching model development. However, many companies were struggling to convert those models into something that actually made money. There was no communication between the business and research sides in such a large company (like Walmart). The data scientists were completely cut off from the business goals.
“To become an effective data scientist, you need to understand what the C-suite people want to achieve.”
Inside the Metrics-Measurements-Insights team that I managed, we talked to all the stakeholders in different teams, identified ways to measure success for them, found proxies for different measurements, and communicated whether we were moving the needle for them. You have to keep in mind that a company that started a data science team wants to see the ROI. If you are here building models that do not help people, there’s a chance that you would get laid off. Thus, companies must have this sort of initiative.
For me, it’s never too early for anyone who wants to enter the data space to understand that: Data is at the service of the business. Companies invest in big data initiatives to sell more products, attract more customers, make things easier, etc. If you don’t keep in mind that there is a business goal at the end of the day, you’re going to fail.
On Giving Industry Talks
I came from academia, where I was speaking all the time. I considered public speaking as an opportunity to grow, network, and influence the market. When I went to the industry, one heartbreak was: “Will I ever have the same opportunity again to be a thought leader?” That was what’s missing during my prior positions at YuMe and Ayasdi.
While at Walmart Labs, on an accidental occasion, my boss was looking for somebody to deliver a talk at MLconf (one of the largest ML conferences in the US). I got my lucky break and delivered my first industry talk there (“Review Analysis: An Approach to Leveraging User-Generated Content in the Context of Retail”). Given my previous experience as a speaker, the talk went relatively well. From that point on, my organization started trusting me to give more talks.
“Giving talks is a unique way to evangelize and convince people to change things.”
My sense back then was that the industry was doing data science inefficiently and going to face the next AI winter if nobody does anything about this. I wanted to express my own view in front of the market for that exact reason.
On Active Learning
The traditional way of building an ML model is supervised learning: given an acquired dataset, you annotate it and use it for your model. Active learning is basically a specialized way of doing semi-supervised learning. In semi-supervised learning, you work with some annotated data and some un-annotated data. In active learning, you prioritize strategic pieces of data by going back and forth between training and inference. You take a small set of data, annotate this data, train your model with this data, see how well the model performs, and think about what piece of data to focus on next.
A popular way of doing active learning is to look at some measurement of model uncertainty. You train your model with a little bit of data and perform inference on the rest of the dataset (which is not labeled yet). You can say: “It seems that the model is relatively sure that it makes the right predictions on these classes of data, so I won’t need to focus on them.”
I started using active learning at Walmart Labs because of our ridiculously small labeling budget. Furthermore, I realized that a big problem with regular active learning is that it needs to be tuned. Active learning is the principle of doing things in batches (or loops), but practitioners still don’t know quite yet how to pick the right number of batches smartly.
I saw an analogy between active learning today with deep learning 10 years ago.
When I started my career, deep learning started to become popular, but not quite at its peak yet. Many people gave it a try and thought that deep learning did not work. It requires a lot of expertise to choose the right number of epochs, number of neurons, batch size, etc. If you don’t do that accurately, you can fail miserably.
The same happens with active learning. If you have the wrong querying strategy (picking the next batch), you can end a situation where your model is way worse than it could have been with the entire dataset.
Another challenge with active learning is that it is compute-greedy. Although it saves you on the labeling cost, you have to retrain your model regularly. Every time you have to restart from scratch, the relationship between the amount of data and compute is N². Given this tradeoff, you have to see whether it makes sense to waste compute for the sake of saving labels.
“At Alectio, we are building cost-effective active learning strategies.”
On Scaling The Search & Smarts Team at Atlassian
You can imagine building a team like this similar to an intrapreneurship endeavor. The profile of people who want to help you are basically the profile of people you want to hire in a startup. Obviously, you want people with the right type of diploma and relevant skillsets.
“But building an ML team is also not just about hiring the people who can build the models.”
We realized that we didn’t have a centralized place to collect the data early on. Within the Atlassian landscape, we are part of the Platform department. The goal of this department is to enable whatever algorithms that we build to be used in a product-agnostic way.
I followed the same strategy that I used at Walmart: talking with different stakeholders to understand their needs and their data.
I also invested heavily in data engineering. Specifically, I looked for somebody who can prepare the data, work with different databases, build a data lake, etc.
There’s an entire ecosystem around ML, so it takes a special project manager who can manage ML projects. ML is a hybrid of engineering and research, so both the schedule and the scale are different.
Eventually, more than half of my team were engineers rather than researchers.
On Organizational and Operational Challenges with Enterprise ML
What does it take to be successful with ML as an organization?
There are 3 components: the technology, the organization, and the operation. Today, the industry is good at technology. However, it’s evident everywhere I’ve been that we suck at the organizational and operational aspects. I’ve advised a lot of small and large companies alike. Nobody got it right.
The organizational part entails understanding how ML people interact with their colleagues and setting up a data culture. Unfortunately, we still live in a world where the C-suite execs don’t know how and what to do with the data collected.
The operational part mostly includes brute-force approaches to get MVP solutions everywhere. MVPs are cool, but they are often inefficient and too expensive to prototype. I believe we can’t win with ML if we don’t get MLOps right. The good news is that over the last 18 months, there have been many companies popping up and helping practitioners with various tasks of the ML lifecycle development.
When we think about the ML lifecycle, we have data preparation (getting data into the right shape), model development (by default, what people think of when it comes to ML), and model deployment (putting models into the real world). There have been many investments from the VC community in model development and model deployment tools.
“The one thing that has not been operationalized properly thus yet is the data preparation piece.”
Asking any data scientists out there how they spend their time, they will say 75–80% of the time on feature engineering, data labeling, data cleaning, etc. I believe that there are not enough investments and companies bothered about data preparation. It’s also worth noting that data preparation goes beyond data labeling and data storage.
On Agile for Data Science Teams
Agile is already kind of an old concept serving as a response to the waterfall methodology. The core concept behind agile is responding to the unexpected. The end goal is the same, while the path to the end goal might be different. Agile is a brilliant idea for engineering as a mechanism to combat reactivity. Atlassian kind of forced every team to adopt the agile methodology, including the ML teams.
“While everybody understands agile for engineering, people in ML research will tell you that they don’t know how long it’ll take to build their models.”
For me, selfishly as a manager, I did think that machine learning needs something similar to agile. By helping researchers break down their tasks into small pieces, we could incorporate more predictability into their workflow. The ML scientists had to think about data collection, model validation, a list of models that might work for the current tasks, etc.
Another major finding for me was that: a combination of different agile frameworks is still agile. The ML team consists of both the engineers and the scientists, so I used separate frameworks for each. When a scientist comes up with the right model, he/she will assist the engineer with implementing that model at scale. This approach enabled us to respond to the agile requirements and make a huge difference in efficiency. I wished more people would talk about the necessity of project planning for researchers, in general.
On Joining Figure Eight
Figure Eight used to be called CrowdFlower, the first enterprise-grade labeling company. Going back to my previous job at Walmart, I suffered a lot from not having enough budget for data labeling. I felt very passionate about that problem, which led me to look at research in active learning. Even during my time at Atlassian, I spent a lot of time getting the organizational part right, but not on the operational part. I enjoyed my time at Atlassian, but back then, they haven’t been ready yet for large-scale data projects. I wasn’t a very patient person, so that experience was frustrating.
Figure Eight reached out to me, discussing smarter ways of doing the labeling. Even today, most of the labeling is still done manually. With datasets becoming larger, there needs to be an ML component to bring ML as one potential way to automate (at least partially) the labeling process. I believe that we need less data than we have in many circumstances.
“One way to help customers do things more efficiently is not to scale up labeling but to scale down data.”
The executive team at Figure Eight was very interested in this idea, but the board was not onboard. As a labeling company, Figure Eight made money based on the volume of annotated data, which signaled a change in the business model. Figure Eight was already 9 years old when I joined. They received an acquisition offer quickly, which meant that it was way too late for me to pivot the company from a hard-core data labeling company to a data curation company.
Regardless, I enjoyed my time at Figure Eight as I learned a ton about the labeling space. I strongly believe that sometimes you need to take jobs that don’t make sense in a career because there would be serendipitous opportunities as consequences.
On Founding Alectio
I considered myself a reluctant entrepreneur. My original thesis is that we should stop believing big data is the only solution for building better ML systems. I tried to evangelize this thesis at all of my previous employers. Eventually, I realized that nobody was tackling this huge problem. The concept of “Less Is More” is popular in our society, but not so much in big data. There’s no doubt that big data unleashes real opportunities for ML. However, we are currently facing the reverse problem: we build bigger data centers and faster machines to deal with the massive amount of data. To me, this was not the wise approach. From an economic perspective, you can easily understand why some large companies have the incentive to make everybody believe that big data is necessary — the more data, the more money they will make.
I think about the sustainability of ML in two ways: (1) sustainable environments (fewer data centers and less electricity used for servers) and (2) sustainable initiatives from large organizations. Many problems have come from the scale of data that we need to tame. Any dataset is made up of useful, useless, and harmful data:
The useful data is what you want. It contains the information that helps your models.
The useless data is not good nor bad. It wastes your time and money (typically representing a big chunk of your dataset).
The harmful data is the worst piece. You store and train models on it but getting worse performance.
“For me, ML 2.0 is about demanding higher quality from the data.“
However, there is a distinction between the quality and the value of the data. Value depends on the use cases. Data that might be useful for model AI won’t necessarily be useful for model B. Thus, you need to perform data management in the context of what you are trying to achieve with it. None of the data management companies are doing this today.
Alectio’s mission is to urge people to tame down the data and come back to sanity to some extent.
On Responsible AI
We want AI to be fair to the consumers. Delivering fair AI means that everybody has access to the same technology and benefits from the technology in the same way.
“We want AI to be the solution to the unfair society that we live in.”
One thing that scares me with the progress in AI is the disappearance of blue-collar jobs. This is not a bad thing. We want people to move on to different jobs that are not dangerous. However, if we continue our current path, the rich get richer, and the poor get poorer. An incredible example of this is data labeling:
When a rich startup or a large company in Silicon Valley wants to build ML models, they need to annotate their data. They will rely on annotation/labeling companies.
These companies rely on actual people who do that job. Most of these labelers/annotators are based on third-world nations like Kenya, Madagascar, Indonesia, the Philippines, etc.
There are instances where these companies took a large chunk of the money and did not compensate these labelers fairly. In some cases, they are put in a slavery position, forced to annotate without proper salary.
There’s a huge opportunity to have poor people in those countries benefit from the AI economy via data labeling. But we need to ensure that we do not increase social disparity because of AI. I think there must be regulations in terms of how much they get paid.
On Finding Customers
When starting Alectio, I went to trade shows on my own. When I told people: “Did you know that you could build the same model with fewer data?” people would laugh at my face. We were living in an age where people were taught by academics, bosses, and friends that more is better. For me, in particular, there was a lot of education. There were a lot of challenges with people not trusting an early-stage company.
Furthermore, people sometimes told me: “If this could be done, I think somebody could have done it.” Why is nobody else doing it? Or does that mean I (specifically as a woman) can’t do it? It’s fun at the beginning to push my limits and respond to such objections.
On Hiring
Alectio is building an engine to identify useful data in large datasets contextually to a model. This is a meta-learning problem. Essentially, Alectio is a meta-learning platform. It’s incredibly easy to attract talent. I would even say that we are the one true data science company, giving people the opportunity to build models and the tools to diagnose their learning mechanisms. As far as I’m concerned, this is the holy grail in ML.
Oftentimes, the people best suited for the job aren’t the ones that you think. I have often interviewed people with Ivy League degrees and impressive resumes and ended up hiring people with less impressive credentials. Trying to push yourself out of your comfort zone is one of Alectio’s values.
“I hire mostly for the ability to push oneself and the ability to learn new things.”
This is true for any technology job. There were countless situations where I interviewed people with Ph.D. and Post-Docs, and they did worse compared to people with Masters and Bachelors.
On Taking Advice
Advice is hard to generalize.
You should only take advice that is specific to your situation and with a grain of salt.
Ignore any advice before thinking it through.
It’s important to find the right balance between accepting feedback and not taking any.
Take notes on any advice. Don’t take advice for granted by default.
On Navigating Tech As a Woman
A mistake that I have made over and over again is to behave like a man. When stepping into a manager position, you are likely to take your boss as a role model. It’s important to be yourself. This is true for everyone, but particularly for women.
Keep learning things. Never shy away from doing something that you are not good at. Going outside of your comfort zone.
Show Notes
(01:46) Jennifer shared her formative experiences growing up in France and wanting to be a physicist.
(03:04) Jennifer unpacked the evolution of her academic journey in France — getting Physics degrees at Louis Pasteur University, Paris-Sud University, and Sorbonne University.
(06:44) Jennifer mentioned her time as a Postdoctoral Researcher in Neutrino Physics at Duke University, where her research group lacked the funding to carry on scientific projects.
(09:35) Jennifer discussed her transition from academia to industry, working as a Quantitative Research Scientist at Quantlab Financial in Houston.
(13:31) Jennifer went over her move to the Bay Area, working for YuMe and Ayasdi — growing and managing early-stage data science teams at both places.
(19:19) Jennifer recalled her foray into becoming a Senior Data Science Manager of the Search team at Walmart Labs. She managed the Metrics-Measurements-Insights team and the Store-Search team.
(23:59) Jennifer shared the business anecdote that made her obsessed with measuring the ROI of data science.
(28:46) Jennifer reflected on the opportunity to give conference talks and become a thought leader in the data science community (watch her first industry talk, “Review Analysis: An Approach to Leveraging User-Generated Content in the Context of Retail” at MLconf 2016).
(31:10) Jennifer unpacked her interest in active learning and outlined existing challenges of making active learning performant in real-world ML systems.
(36:58) After 1.5 years with Walmart Labs, Jennifer became the Chief Data Scientist at Atlassian. She shared the tactics to grow the Search & Smarts team of scientists and engineers from 3 to 17 people in less than 6 months across 3 locations.
(40:31) Jennifer discussed the organizational and operational challenges with making ML useful in enterprises and the importance of data preparation in the modern ML stack.
(47:24) Jennifer elaborated on the topic of “Agile for Data Science Teams,” which discusses that organizations that invest in ML but do not get the organizational side of things right will fail.
(53:09) Jennifer went over her decision to accept a VP of Machine Learning role at Figure Eight, then a frontier startup that offers enterprise-grade labeling solutions to ML teams.
(57:56) Jennifer went over the inception of her startup Alectio, whose mission is to help companies do ML more efficiently with fewer data and help the world do ML more sustainably by reducing the industry’s carbon footprint.
(01:04:32) Jennifer unpacked her 4-part blog series about responsible AI that calls out the need to fight bias, increase accessibility, and create more opportunities in AI.
(01:09:06) Jennifer discussed the hurdles she had to jump through to find early adopters of Alectio.
(01:11:03) Jennifer emphasized the valuable lessons learned to attract the right people who are excited about Alectio’s mission.
(01:14:38) Jennifer cautioned the danger of taking advice without thinking through how it can be applied to one’s career.
(01:17:09) Jennifer condensed her decade of experience navigating the tech industry as a woman into concrete advice.
(01:19:19) Closing segment.
Jennifer’s Contact Info
Alectio’s Resources
What Is Alectio? (Video)
Is Big Data Dragging Us Towards Another AI Winter? (Article)
Mentioned Content
Talks
The Day Big Data Died (Oct 2020 @ Interop Digital)
The Importance of Ethics in Data Science (Keynote @ Women in Analytics Conference 2019)
Introduction to Active Learning (ODSC London 2018)
Agile for Data Science Teams (Strata Data Conf — New York 2018)
Big Data and the Advent of Data Mixology (Interop ITX — The Future of Data Summit 2017)
The Limitations of Big Data In Predictive Analytics (DataEngConf SF 2017)
Review Analysis: An Approach to Leveraging User-Generated Content in the Context of Retail (MLconf 2016)
Articles
1 — Women vs. The Workplace Series
Gender Discrimination (Oct 2015)
Why Leading By Example Matters (Jan 2017)
Data Scientist: the SexISTiest Job of the 21st Century? (Feb 2017)
The Role of Motherhood in Gender Discrimination (March 2017)
The Biggest Challenges of the Female Manager (May 2017)
Parity in the Workplace: Why We Are Not There Yet (Dec 2017)
The Pyramid of Needs of Professional Women (Dec 2017)
2 — Management Series
The Secrets to Successfully Managing an Underperformer (June 2017)
The Top Secrets to Managing a Rockstar (July 2017)
The Real Cost of Hiring Over-Qualified Candidates in Technology (March 2018)
Team Culture (May 2018)
3 — Responsible AI Series
How We Got Responsible AI All Wrong (Part 1)
Increasing Accessibility to AI (Part 3)
Creating More Opportunities in AI (Part 4)
Book
“Managing Up” (by Rosanne Badowski and Roger Gittines)
Notes
Jennifer told me that Alectio is about to launch a community version that people will be able to compete to get the best model with the minimum amount of data this fall. Be sure to check out their blog and follow them on LinkedIn.
About the show
Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.
Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.
Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:
If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.