Datacast Episode 41: Effective Data Science with Eugene Yan
The 41st episode of Datacast is my conversation with Eugene Yan — an Applied Scientist at Amazon. Give it a listen to hear about his educational background in psychology and organizational behavior, his transition to Data Science at IBM, his experience moving up the data science ranks at Lazada Group, his online education via Georgia Tech, his work on machine learning at uCare.ai, his thoughts on agile/scrum development, culture, writing, and much more.
Listen to the show on: (1) Spotify, (2) Apple Podcasts, (3) Google Podcasts, (4) Stitcher, (5) Overcast, and (6) iHeart Radio!
Key Takeaways
Below are highlights from my conversation with Eugene.
On Working at IBM
I joined IBM as a Data Analyst and moved across projects. I first built dashboards and insights for IBM’s supply-chain processes. I then worked on social media analytics, figuring out the social voice for a couple of IBM’s clients. I was also involved with anti-money laundering, helping the big banks to detect those patterns and coming up with combating strategy.
I transitioned to an internal Data Scientist role within IBM’s Workforce Analytics group (thanks to my background in Psychology). I forecasted which jobs would be demand and then used those predictions to build an internal recommendation engine for IBM.
On Working at Lazada Group
I presented my Kaggle project on product classification at a meetup in Singapore. Coincidentally, people at Lazada saw my talk and were working on a similar problem. They gave me a chance to join as a data scientist.
I can’t tell you how many A/B tests I conducted that lost so much money for Lazada, but my colleagues gave me opportunities to fix them. We had better customer experience and product ranking system after those failed A/B tests.
As a senior data scientist, I stopped putting too much investment into the technical details. Instead, I helped scale the team, increase the outputs, design the right practices, and mentor team members. I had many one-on-one meetings to understand people’s career aspirations and find the right projects that align with their motivations. I found pleasure looking at team morale, customer satisfaction, and employee retention.
As a Vice President of Data Science, I was given 9 months to facilitate the migration of Lazada’s tech stack into Alibaba Group’s platform (due to the acquisition). We had the language barrier, as the Alibaba team mostly conversed in Mandarin and most of my team members didn’t speak Mandarin. We worked the 9–9–6 work routine and flew to China twice a week to meet the deadline.
I had a lot of interviews with people in the data science team to understand how to intently design a strong culture. I looked up to Amazon and Netflix as examples. The 5 cultural practices that I emphasized include ownership, collaboration, communication, research and innovation, and impact.
On Pursuing an MS Degree at Georgia Tech
I suffered from chronic imposter syndrome. While I learned a lot about data science and programming on my own via Coursera, I felt like I was lacking the fundamentals and the structured format of schooling. Georgia Tech is really cheap, the professors are world-class, and the program could be completed part-time.
In order to study and work simultaneously, I allocated 20–40 hours per week (mostly on weekday nights and weekends) for the lectures and assignments. The video lectures are on-demand, so I can watch them asynchronously. Class participation happened in Slack and forum, which I found to be even better than in-person classrooms.
On Working at uCare.AI
In healthcare, if you use data science to recommend a treatment, suggest a diagnosis, or estimate a hospital bill, your customers want to know the reasons leading to that output. There is a higher bar for model explainability and false positives outcomes.
The financial incentive for data science in healthcare is not yet mature. It is very difficult to change patient behavior and reduce the financial burden for the healthcare system.
On Working at Amazon
In 2019, I wanted to get out of my comfort zone and actively applied for overseas roles. In preparing for interviews, I had some struggle with the live coding component, so I used Leetcode and HackerRank for practice. I had a lot of interviews at odd hours back in Singapore and had to fly overseas to do onsite interviews.
At Amazon, I am working on Kindle to help people find books easier via recommendation engines and understand user intent via natural language processing.
On Data Science + Agile Development
It’s important to iterate fast and time-box projects. I rather build a product that may work or may not work in two months, rather than spending two years on it.
It’s important to prioritize roadmap based on business needs. Don’t build something sexy and hope that customers will like it. Start from the problem, then find the solution.
Demos are super fun. People work fast, are eager to share the results, and appreciate feedback for their work.
The retrospective is a process that includes what went well, what didn’t go well, and what was puzzling.
On Maintaining Models In Production
In my experience, it’s not difficult to deploy models in production, thanks to the available tooling. I’m not seeing much discussion on how to maintain models after deployment and minimize the operational burden.
Check the basic statistics of input and output distributions. Make sure to validate model performance using a test set. Set up a UI-friendly tool for the data scientists to interact with.
On Writing
From my conversations with other data scientists, communication is the most important attribute of effective data scientists.
Writing doesn’t actually start at writing, but it starts during the process of consuming information and reading materials. Writing is about collecting notes and tidying up the information to turn them into a blog post.
I use a tool called Roam to take literature notes. Each note includes relevant tags. When I start writing, I collect those notes and use them to craft the piece.
Show Notes
(2:19) Eugene got his Bachelor’s degree in Psychology and Organizational Behavior from Singapore Management University, in which he did a senior thesis titled “Competition Improves Performance.”
(3:29) Eugene’s first role out of school is an Investment Analyst position at Singapore’s Ministry of Trade & Industry.
(4:18) Eugene then moved to a Data Analyst role at IBM, working on projects such as supply-chain dashboards, social media analytics, and anti-money laundering detection.
(5:55) Eugene transitioned to an internal Data Scientist role at IBM, working on job forecasting and job recommendations.
(9:03) Eugene shared the story of how he became a Data Scientist at Lazada Group, which was a small e-commerce startup back in 2015.
(12:08) Eugene explained his decision to go back to school and pursued an online Master’s degree in Computer Science at Georgia Tech.
(19:14) Eugene shared his career milestones, as displayed in his blog post reflecting on his journey from getting a degree in Psychology to leading data science at Lazada.
(22:17) Eugene discussed the unique data science challenges while working at uCare.ai — a startup that aims to make healthcare more efficient and reduce costs.
(25:29) Eugene revealed three useful tips to deliver great data science talks (read his blog post “How to Give a Kick-Ass Data Science Talk” for the details).
(28:29) Eugene talked about his transition to become an Applied Scientist at Amazon — working on Amazon Kindle.
(30:43) Eugene unpacked his post “Commando, Soldier, Police, and Your Career Choices” that provides an interesting metaphor to help guide career decisions.
(33:43) Eugene went meta onto his writing process (read here) and note-taking strategy (read here).
(39:01) Eugene shared the lessons learned from taking on responsibilities in hiring, mentoring, and stakeholder engagement in his second year at Lazada (read his blog post on the first 100 days as a Data Science Lead).
(44:20) Eugene went in-depth into the engineering and cultural challenges throughout Alibaba Group’s acquisition of Lazada Group.
(47:51) Eugene explained Alibaba’s playbook for the technical integration of their acquisitions and the super-apps phenomenon in Asia (check out a summary of his talk on Asia’s Tech Giants).
(53:52) Eugene unpacked the values and essential aspects of Lazada’s data science team culture, as detailed in his post “Building a Strong Data Science Team Culture.”
(57:44) Eugene summarized his thoughts on the topic of data science and agile/scrum development (Read his 3-part blog series: Part 1, Part 2, and Part 3).
(01:03:18) Eugene was heavily involved with the development of product ranking, product recommendations, and product classification models in his first year at Lazada (check out slides to his talk “How Lazada Ranks Products”).
(01:09:08) Eugene helped mentor and empower teams on multiple machine learning systems while acting as VP of Data Science at Lazada (check out slides to his talk “Data Science Challenges at Lazada”).
(01:12:07) Eugene shared the case study of how uCare.ai developed a machine learning system for Southeast Asia’s largest healthcare group that estimates a patient’s total bill at the point of pre-admission.
(01:14:06) Eugene summarized his 2-part series that exposes the challenges after model deployment and yields a practical guide to maintaining models in production.
(01:19:04) Eugene discussed his early-career Product Classification project that uses a public Amazon dataset and builds two APIs for image classification & image search.
(01:22:29) Eugene discussed his 2-part series that implements seven different models on the same Amazon dataset, from matrix factorization to graphs and NLP.
(01:24:42) Closing segment.
His Contact Info
His Recommended Resources
Niklas Luhmann (well-known German sociologist)
Roam Research (note-taking application)
MLflow (A platform for ML lifecycle management)
Amazon Product Review Dataset (big data in JSON format)
Andrej Karpathy (Read “The Unreasonable Effectiveness of RNNs” and “A Recipe For Training Neural Networks”)
Jeremy Howard (Read the “Universal Language Model Fine-tuning for Text Classification paper)
Hamel Hussain (Check out GitHub Actions and fastpages)
“Introduction to Statistical Learning” (by Trevor Hastie and Rob Tibshirani)
“The Pragmatic Programmer” (by Andy Hunt and Dave Thomas)
applied-ml repository
ml-survey repository