James Le

View Original

Datacast Episode 111: Astrophysics, Visualization Recommendation, and Scalable Data Science with Doris Lee

The 111th episode of Datacast is my conversation with Doris Lee, the CEO and Co-Founder of Ponder, a startup delivering scalable, enterprise-ready Pandas that improve the productivity of data teams.

Our wide-ranging conversation touches on her early interest in astrophysics at UC Berkeley, her Ph.D. work developing data science tools to accelerate insight discovery, her current journey with Ponder building a scalable and enterprise-ready version of Pandas, lessons learned from building an open-source project and hiring early-stage startup employees, and much more.

Please enjoy my conversation with Doris!

See this content in the original post

Listen to the show on (1) Spotify, (2) Apple, (3) Google, (4) Stitcher, (5) RadioPublic, and (6) iHeartRadio

Key Takeaways

Here are the highlights from my conversation with Doris:

On Her Academic Interests in Astrophysics

I began my career in physics and astrophysics, where I encountered some of the challenges scientists face when working with data. This was around 2014-2016 when the data tooling landscape looked quite different from today. At that time, data science was still nascent, but I was interested in applying emerging data science techniques to my astronomy research.

It was then that I realized how difficult the job of a data scientist can be. Not only do they need to be experts in their field, but they also need to know statistics, machine learning programming, and how to work with large datasets. Even the most basic task with big data can take weeks, and this challenge fueled some of my initial inspirations for further research.

Source: https://physics.berkeley.edu/research-faculty/astrophysics

Some of the data I worked with came in the form of survey data from telescopes, meaning astronomical images captured by the telescope and simulation data. In theoretical astrophysics, researchers often simulate young stars or black holes, which was one of the topics I worked on. Working with this data, which is on an astronomical scale, requires applying cutting-edge statistical techniques and machine learning methods to extract insights from it.

As an undergraduate, I was interested in learning more about astronomy and found that research was the best way to do so while leveraging cutting-edge tools and working on challenging problems. I enjoyed my time doing research, starting with my first summer and continuing throughout my time at Berkeley.

On Her Ph.D. Pursuit at UIUC

I mentioned earlier how difficult it is for scientists to work with data and gain insights. As a result, I entered the Ph.D. program at UIUC, intending to build tools and systems to make it easier for people to visualize and explore their data. I focused on non-programmers, like myself, who do not have a computer science background. At a high level, I was excited about the potential of unlocking knowledge as a society and driving better decision-making through data accessibility. This theme carried across both my research during the Ph.D. and some of the work I am doing now.

My research interests aligned with the professors and faculty there at the time, which allowed me to pursue this specific line of research. There was a supercomputing center across from the Computer Science division, which was a great way to engage with domain scientists as I built these tools. I worked on this during the first couple of years of my Ph.D.

On Visualization Recommendation

Source: http://zenvisage.github.io/

At the beginning of my Ph.D., I focused on the process of exploratory data analysis. Specifically, many people use visualization to explore their data, resulting in the need to generate many visualizations to find insights. For example, in the case of ZenVisage, we addressed the problem of pattern search for line charts. If an astronomer is interested in a line chart with a specific trend, they might have to sift through thousands of visualizations to find one that matches. This is a tedious and error-prone process that requires generating much analysis.

To address this problem, I worked on building visualization recommendation systems. These systems automatically recommend interesting insights based on statistical analysis or properties the system discovers, helping accelerate users to insights. These systems often address questions like what interesting filters to apply to data and where the differences are meaningful.

While there was a lot of research work in visualization recommendation systems, few of these systems made it into commercial products or scientific tools. Therefore, one of the themes of my research was to apply these principles to tools that people can adopt in their day-to-day work, making them more productive in their analysis.

I do not think there was much commercial adoption initially. At the time, I was working with astronomers, material scientists, and other domain scientists who were struggling with the tools available to them. My focus was on making these tools more useful for these domain scientists. The commercial aspect came later after I moved to Berkeley.

On The RISE Lab and I School at Berkeley

Source: https://rise.cs.berkeley.edu/

I am fortunate to have been a part of both the Rise Lab and the I School (the School of Information) at Berkeley. These research institutions provided very different perspectives that complemented each other well. The School of Information at Berkeley is known for cutting-edge research in highly interdisciplinary fields, while the Rise Lab is focused on creating impactful open-source tools.

My Ph.D. work centered around human-computer interaction, data management, and data science. It was driven by user adoption rather than just creating novel tools. The Rise Lab was unique because it incubated successful open-source startups and provided tools to the public, creating a synergy between industry and academia.

Working at the I School with outstanding researchers like Marti Hearst helped me understand that there were different user-driven evaluation methods beyond just building systems. This allowed me to make the evaluation and analysis of systems and research more holistic.

The Rise Lab was an excellent place for cross-disciplinary research, where everyone brought their strengths and skills. Multiple students were interested in data science, and there was a culture of collaboration and working together to solve problems from different angles. This collaboration led to many successful projects from the Rise Lab, which I thoroughly enjoyed being a part of.

On Her Ph.D. Dissertation

https://medium.com/multiple-views-visualization-research-explained/insight-machines-the-past-present-and-future-of-visualization-recommendation-2185c33a09aa

The focus of my Ph.D. dissertation is to make data exploration and visualization easier and more accessible through automation. Many previous studies focused on obtaining insights more quickly and presenting more interesting insights to users. However, my dissertation delved deeper into the question of adoption. Precisely, how do these tools fit into a user's typical analysis workflow? In collaboration with Tableau, we wrote a paper examining how people ask questions and the following steps they take in their analysis. We then used this information to build a recommendation system.

Other questions I explored included: How can we design these recommendation tools to be embedded in the user's workflow without being intrusive? How can they be helpful along the exploration path? The answers to these adoption-driven questions, and the ways we design these tools, form the basis of our approach at Ponder, which prioritizes user-driven adoption and tool design.

On Developing Lux

Source: https://github.com/lux-org/lux

Visually exploring data with data frames can be a tedious process. In visual data exploration, many decisions are needed to gain insights. Working with data frames adds an extra challenge, especially in a Jupyter notebook environment where people are iterating on their data and trying different transformation and cleaning methods. Even though people can try out many different things in a notebook context, it is still difficult to get insights due to the reasons mentioned earlier. Additionally, when working with data frames, people must write a lot of code to generate a single visualization, which can be time-consuming and lead to many throwaway iterations. This friction in the process hinders how people do their analysis and sometimes introduces challenges.

To lower this friction of exploration, we developed Lux, a visualization tool built on top of the Pandas data frame. Essentially, Lux searches for visual insights to display automatically to the user. When working with a Pandas data frame, users typically see rows and columns of data when they print it out. This raw table is useful but limited. With Lux, users get the table view and a dashboard of visualizations and insights that are automatically surfaced. Users can toggle between the visual and tabular views with a single button. The visualizations are always on and displayed at every point in the analysis without requiring users to change their code or analysis methods. The design principle behind Lux is to give users the superpower of looking at more visualizations and doing their analysis more quickly. This is part of the core inspiration of our work at Ponder, which aims to give data scientists and teams the superpower of running their analyses more effectively.

On Open-Source Engagement For Lux

Working with the open-source community and seeing how users utilize Lux has been one of the most rewarding parts of my Ph.D. As mentioned earlier, Lux underwent many design iterations before reaching its current state. When we initially released it, there was little traction or many downloads because the API was cumbersome and required a lot of code to print out the visualization. It was not until we reached a very minimal design, where you print out the data frame and automatically get all visualizations displayed, that Lux gained popularity. Throughout the development process, we addressed scalability issues and ensured the tool could work with various data types.

Unlike my experiences with research prototypes, working on Lux involved addressing thousands of bugs and edge cases that users can encounter in analysis situations. I spent six to twelve months ensuring that all edge cases and bugs were covered, resulting in a polished and user-friendly experience. The team I worked with, including faculty and students, contributed significantly to the open-source project's success, making it a team effort.

Regarding prioritizing features and bug fixes, we considered user requests and stack-ranked them based on their frequency of use and impact on the user experience. Issues related to getting data into Lux and ensuring compatibility with Pandas were a top priority, while obscure Pandas errors were considered a lower priority. Overall, Lux's success was due to the engineering work that went into its adoption and the team effort that made it possible.

On The Founding Story of Ponder

Source: https://www.ischool.berkeley.edu/news/2022/phd-alum-doris-lee-wants-democratize-data-science-tools

My co-founders and I have been working on open-source projects at the RISE Lab for a while. In all of our work, we focused on the experience of people using the Pandas data frame. As we talked to users in our open-source community, Devin led the Modin project while I led the Lux project. We repeatedly heard from data scientists and data teams about the difficulty of using Pandas on large datasets. For those unfamiliar with Pandas, it is a Python data analysis library often considered the most important tool in data science. Pandas is so amazing because it is flexible and expressive, allowing you to do many things with your data, such as feature engineering, data cleaning, data analysis, and even visualizations. With the growth of AI and machine learning, there is a need to go back to the basics and clean, wrangle, and transform your data before feeding it into a model. This is where we see the strongest pull for Pandas. However, while Pandas is great for operating on small amounts of data in a Jupyter notebook, it does not work well on large datasets.

Inspired by the stories we heard, the three of us built Ponder with the idea that data teams should be able to operate on data at scale using the tools they know and love, which in this case, means working with Pandas. One of the key value propositions of both Lux and Modin is that users do not have to change a single line of code to get scalability and usability benefits. This became the core of what we worked on with Ponder.

Source: https://ponder.io/

We started working on this idea towards the end of our Ph. D.s, a semester before we finished our program. At that time, Devin was working on Modin, and I was working on Lux. We collaborated on various research projects and started talking about scalability issues with Pandas. Even in Lux, we saw many scalability concerns and issues. There was a lot of synergy across the two projects we worked on, and we started thinking about how to make the most impact with the existing traction we had built within the open-source community after we graduated. This sowed the seeds for starting a company around it.

The name "Ponder" was chosen because it reflects the focus on thinking about the data during the exploratory and analysis phases of data analysis. Our tools help people ponder their data. Another interesting play on words is that "Ponder" and "Pandas" can both be abbreviated as "PD." Thus, we thought it was an interesting play on words: "Ponder making enterprise-ready a Pandas reality." Our logo is a "P" and a "D" combined into a square, which you can check out on our website.

On The Fragmentation Challenge Across The Data Stack

Source: https://ponder.io/technology/

Pandas is an extremely popular tool used in virtually every data organization that utilizes Python for data science and analysis. However, Pandas is unfortunately not very efficient and does not scale well. As a result, there is often a divide between the small-scale prototyping and experimentation done with data and the large-scale deployment and production processes. This is primarily due to the challenges of working with large amounts of data that need to be processed. This creates a huge gap between small-scale and large-scale workflows, and as a result, users are often forced to undertake the costly process of translating their Jupyter notebooks with Pandas code into a big data framework such as Spark or SQL. This process can take up a significant amount of time and engineering resources, and it is not an efficient use of the data team's time, as the workload is essentially the same, just with the ability to run on a cluster or handle large amounts of data. It is a matter of rewriting the entire workflow.

Anecdotally, one data team we spoke to recently went through a six-month process of rewriting one of their existing Pandas workloads into Spark. This underscores the inefficiency of the process and the need for a better solution. Virtually all the data teams we spoke to using Pandas have experienced this pain point at some point in their data analysis journey. It is a fundamental issue, as using Pandas is a basic part of the data science process. Unfortunately, it remains an open problem.

On Modin

Source: https://ponder.io/how-do-we-parallelized-600-pandas-functions-with-modin/

At Ponder, we believe that data scientists should be able to focus on their analysis and get insights without worrying about scalability challenges that can slow them down and hinder their work. That is why we developed an open-source tool called Modin at the RISE Lab.

Modin is a faster and more scalable version of Pandas, acting as a drop-in replacement. To use Modin, you only need to change a single line of import: import modin.pandas as pd. With this change, you can continue to use all the same Pandas code and API but operate on much larger amounts of data without running into typical issues you would encounter using Pandas alone. One of the reasons for this is that Pandas is inherently single-threaded and not parallelized, whereas Modin parallelizes over 600 of the Pandas API functions.

Modin's strong community of contributors and users has helped us prioritize and focus on important features. For example, Modin's ability to cover a large percentage of the Pandas API allows users to scale up their workflows without changing anything about their Pandas code. This is achieved by optimizing a small set of core algebra based on principles from relational databases and distributed systems.

Empowering data scientists to scale up their workflows with zero code changes is a complex research and engineering challenge, but we are proud to have accomplished it with Modin.

On Go-To-Market

Source: https://discuss.modin.org/

At Ponder, we invest heavily in the open-source community. We believe in building and serving our community, which is at the core of our mission. Currently, our two open-source projects have over six million downloads and over 13,000 stars. We have seen many companies and data teams adopting these tools from the enterprise and customer side. In fact, Modern and Lux are being used in 10% of the Fortune 100 companies. This emphasizes our core value proposition: that data scientists should not have to change anything about their code or the way they use Pandas to get all the enhancements and superpowers in how they work with data. These projects address a critical pain point that all data scientists using Pandas experience.

As part of our go-to-market strategy, we are working to support our community in using these tools in their data science and data engineering teams. We want to make our users more productive with their work by building the tools that they need.

Our engineers are very active in the open-source community. We pay close attention to the issues raised by our community to ensure that we are always thinking about our engineering roadmap ahead. When support issues arise, it gives us an excellent opportunity to learn about our users' needs and challenges. This helps us prioritize which bugs to address first.

On Finding Customers

Source: https://ponder.io/

A strong and growing customer base organically comes to us from the open-source community. We see the most interest in using these tools from sectors including healthcare and pharmaceutical companies, tech companies, AI/machine learning, and finance. It is an interesting mix of customers and users, and we see various use cases as we work closely with these early customers and design partners.

We are helping them put Modin into production, and in many cases, we have seen success stories of Modin, resulting in significant speedups and performance improvements for our early customers. It has been a rewarding journey, and we have learned a lot from working with these early adopters. The traffic has largely been organic, as we were in stealth mode until March last year. People are coming to us because Modin is the only solution that solves their problem, and they appreciate the ability to work with it without having to change a single line of code.

Our engineering team is working hard on both the open source and product side, with our product roadmap being driven by insights from early design partnerships and customer needs. We have a good feedback loop between our customer base, open-source users, and the enterprise side, which drives our product development and design efforts. We are excited to see what is new over the next couple of months.

On Hiring

Source: https://ponder.io/my-ponder-internship-experience-bill-wang/

When it comes to hiring, my co-founder and I both came from non-traditional paths into tech. As a result, we share a vision of building a company that values the diversity of opinions and ensures that all voices are heard. We believe that people bring great ideas and that there is tremendous potential for what a team can achieve when they listen to and try to understand each other's perspectives. This culture of healthy debate and constructive feedback that we have built is evident to all our hiring candidates and anyone considering joining our team at Ponder. It has been vital to how we have attracted great talent and people who are excited about our mission.

Building a company and thinking about hiring requires a mix of book knowledge and first principles thinking to make the right decisions for the company. In our case, the three co-founders leveraged our past experiences and sat down to define our team's goals and the culture we wanted to build.

During our first wave of hiring, we were mostly in stealth mode, so we hired people from our network. These folks included individuals who had contributed to our open-source communities, people with whom we had worked before, and those from academia or Berkeley.

Now that we are out of stealth mode, we are hiring for engineering positions across the board. We are excited to see the diverse team we have been able to build, and we remain committed to maintaining a culture that supports growth, mentorship, ownership, and mutual support as we continue to grow our company.

On Fundraising

Source: https://ponder.io/company/#investors

The advice I am about to give is often overlooked in the entrepreneurship circle. Founders tend to focus heavily on the first part of fundraising, which involves finding investors, pitching, and creating an impressive slide deck. My advice, however, pertains more to the second aspect of fundraising, which is frequently neglected: aligning values with investors.

It is important to seek out investors who share your values and vision for the company, as they will become long-term partners. This alignment is the foundation for everyone's operations. We have been fortunate to have investors from LightspeedIntel Capital8VC, and the House Fund who share our values and have been immensely supportive in helping us execute and pursue our mission. I am excited to see what the future holds for our team.

On Being A Researcher vs. Being A Founder

As a researcher, I learned when leveraging and building on existing work was necessary. However, it was also crucial to think from the first principles and develop something new that would apply to both the company and research. This mindset is valuable in every aspect of company building, including hiring, fundraising, sales processes, and customer interactions. As a team, we are mission-driven, and this values-driven approach informs the everyday decisions we make to serve our community and customers. Although entrepreneurship and research are often seen as opposite ends of the spectrum, there are many similarities between the two in these aspects.

See this content in the original post

Show Notes

  • (01:30) Doris walked through her time doing research in physics and astrophysics at UC Berkeley and getting involved with data science.

  • (04:11) Doris reflected on her decision to pursue the Ph.D. program in computer science at the University of Illinois, Urbana-Champaign.

  • (05:53) Doris discussed her development of no-code, interactive visualization interfaces accelerating users toward data insight discovery.

  • (10:37) Doris explained how the RISE Lab and I School at UC Berkeley helped shape her thinking around working with end-users and building something to serve the data science community.

  • (16:05) Doris unpacked the focus of her Ph.D. dissertation - which is to make data exploration and visualization easier and more accessible through automation.

  • (17:27) Doris shared the motivation and high-level design behind the development of Lux, a general-purpose visual exploration assistant situated within a computational notebook.

  • (21:25) Doris revealed the recipe for open-source community engagement and roadmap prioritization with Lux.

  • (26:17) Doris shared the founding story of Ponder, whose mission is to improve data science productivity by empowering users to do data science at all scales.

  • (31:02) Doris explained how Ponder helps solve the fragmentation challenges across the data stack.

  • (34:27) Doris provided a brief overview of Modin, which improves the scalability of data frames.

  • (38:41) Doris discussed Ponder's go-to-market strategy to drive more enterprise interest toward the product.

  • (41:23) Doris discussed her team's challenges in finding early design partners across various industries.

  • (44:16) Doris shared valuable hiring lessons to attract the right people who are excited about Ponder's mission.

  • (47:42) Doris shared fundraising advice to founders who are seeking the right investors for their startups.

  • (49:33) Doris highlighted the difference between being a researcher and a founder.

  • (51:06) Closing segment.

Doris' Contact Info

Ponder's Resources

Mentioned Content

Publications

Blog Posts

People

  1. Chip Huyen

  2. Shreyar Shankar

  3. Parul Pandey

Notes

My conversation with Doris was recorded back in May 2022. Earlier this year, Ponder developed the first-of-its-kind technology that allows anyone to run their pandas code directly in your data warehouse, be it Snowflake, BigQuery, or Redshift. With Ponder, you get the same pandas-native experience that you love, but with the power and scalability of cloud-native data warehouses. More details are in this blog post.

Additionally, you can run NumPy commands on your data warehouse as well. This means you can work with the NumPy API to build data and ML pipelines, and let Snowflake / BigQuery / Redshift take care of scaling, security, and compliance. More details are in this blog post.

If you are interested in trying these new capabilities out, sign up here!

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. For inquiries about sponsoring the podcast, email khanhle.1013@gmail.com.

See this content in the original post

Subscribe by searching for Datacast wherever you get podcasts, or click one of the links below:

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

See this gallery in the original post