Datacast Episode 76: Modern Data Collaboration and Social Entrepreneurship with Prukalpa Sankar
The 76th episode of Datacast is my conversation with Prukalpa Sankar — the co-founder of Atlan, a modern data collaboration workspace (like Github for engineering or Figma for design). By acting as a virtual hub for data assets ranging from tables and dashboards to models & code, Atlan enables teams to create a single source of truth for all their data assets and collaborate across the modern data stack through deep integrations with tools like Slack, BI tools, data science tools and more.
Our wide-ranging conversation touches on her education in Singapore; her entrepreneurship story founding SocialCops; the future of data-for-good; her current journey with Atlan solving data collaboration; the DataOps culture code; data catalog, data quality, and data governance; lessons learned from fundraising, engaging with the data community, creating a people moat, building an execution machine; and much more.
Please enjoy my conversation with Prukalpa!
Listen to the show on (1) Spotify, (2) Apple Podcasts, (3) Google Podcasts, (4) TuneIn, (5) iHeart Radio, (6) Stitcher, and (7) Breaker.
Key Takeaways
Here are highlights from my conversation with Prukalpa:
On Her College Experience
I grew up in India and didn’t have a hometown because my dad had a transferable job. So we lived in a bunch of different cities growing up. After high school, I got a scholarship to study in Singapore — where I studied engineering and entrepreneurship. Starting in university, I followed a very traditional path and ended up interning at Goldman Sachs. I got paid very well, but my ambitious colleagues did not love their work.
By chance, I got involved in the startup ecosystem in Singapore and built something called a Singapore Entrepreneurship Challenge — which helped me meet and interact with about 200 entrepreneurs. I enjoyed building things from the ground up. Around this time, I met and collaborated with a classmate named Varun on a couple of side projects during hackathons — some of them later became SocialCops.
On Founding SocialCops
SocialCops started as a random idea during a midnight brainstorm session between Varun and me. We made the decision that we were going to try a startup in our final year of university. The startup was never supposed to be SocialCops, and it was supposed to be something else. The initial idea of SocialCops was to use crowdsourced data to drive better decision-making in the social civic space. For instance, how to get crowdsourced information about broken streetlights and use that to help drive city budgets?
It’s important to think about the magnitude of this because we were scholarship students in Singapore. Moving back to India was never an option. 7 or 8 years ago, the Indian startup ecosystem was not as robust as today. So why not do a startup? We were these kids who were passionate about this thing. Then reality struck. We didn’t have any money at this point.
We launched a crowdfunding campaign. This was when Kickstarter and similar platforms started getting popular. Overnight, we created a welcome video, put it live on Kickstarter, and fell asleep at 4 AM. We woke up the following day, and it had gone wild. Random Facebook posts connected us to different people. We ended up not raising a ton of money ($600), but that experience enabled us to do customer discovery in a unique way.
Now we still had to figure out to fund ourselves. Let’s use our student card! There were business plan competitions everywhere. So we said: “Let’s take part in all of them.” We basically created a Google Sheet, put the name of every single competition worldwide, and sent out applications to all of them. We ended up raising about $25–30,000 from prize money — which was enough as seed capital for us to move back to India and spend one year figuring out SocialCops’ business model.
On The Business Model For SocialCops
We were the extended data teams for our customers. Essentially, we tried to use data to solve the world’s biggest problems (national healthcare, poverty alleviation, education, etc.) We quickly realized that the best way was to partner with organizations with the most massive impact in the space. Whose decision can they drive? If they drive their decision, what kind of impact can it be? So we started working with customers like the United Nations, the Gates Foundation, and several large governments. They didn’t have any data teams, so we started acting as their internal data teams, responsible for the end-to-end implementation of their problems. That’s where I learned about building and running data teams — how complex and chaotic they can be.
On Data-For-Good
The biggest challenge with data-for-good (or for most data practitioners) lies in the data. Outside of big tech companies, most data practitioner struggles and grapples with getting their leaders together in a way that they can start driving insights. This becomes even harder from a data-for-good scenario because no one would be responsible for the outcome. For instance, let’s say a kid dropped out of school. Maybe the kid did not drop out because of the education. Perhaps they needed to work in the agricultural field for money, or maybe there were no gold toilets (which goes into sanitation).
At SocialCops, we drove some of the data-for-good initiatives. For example, with the United Nations, there was an initiative to bring together data about Sustainable Development Goals from 100 different places — such that we can answer questions like how do your education goals link to your economic empowerment goals? The question is how to do this at the world’s scale and the pace that the world needs.
On The Incubation of Atlan
The data team is the most complex team that exists in the old fabric. If you want a data project to be successful, you need an analyst, an engineer, a business consultant, a machine learning researcher, etc. All these diverse roles need to come together and collaborate effectively to make the project successful. Each of them has his/her own tooling preferences and skillsets. On a day-to-day basis, that meant chaos as soon as we hit some scale.
There was one quarter in which we went from analyzing 2 million people to suddenly 500 million people. Things broke left, right, and center. It took 8 hours and 4 people to figure out why a number on the dashboard went wrong. That’s the day in the life of a data practitioner today, right? We got to the breaking point where we realized we couldn’t continue to stay like that. So, we started building an internal tool for ourselves.
Our fundamental thesis was that: If you think about all of these problems, they weren’t technology as much as human collaboration problems. That’s the lens that we took. Over a couple of years, this tool has made our data team 6 times more agile. We went on to build things like the India National Data Platform, which the Prime Minister uses himself.
What’s cool about the project was that: it was built by an 8-member team in 12 months from start to finish — probably one of the fastest of its kind. Out of the 8 members, 4 of them had never posted a line of code in production. That’s when we realized: if we make this tool available to other data teams worldwide and enable them to be twice as fast, what would that do to the world? That’s how Atlan was born.
On Atlan’s DataOps Culture Code
The success of data teams comes down to creating a thriving culture in the team.
Collaboration Is Key: Data teams will always have a variety of roles, each with their own skills, favorite tools, and DNA. Embrace diversity, and create mechanisms for effective collaboration.
Create Systems of Trust: Make everyone’s work accessible and discoverable to break down ‘tool’ silos. Create transparency in data pipelines and lineage so everyone can see and troubleshoot issues. Set up monitoring and alerting systems to know when things break proactively.
Create A Plug-and-Play Data Stack: Embrace tools that are open and extensible. Leverage a strong metadata layer to tie diverse tooling together.
Optimize For Agility: Reduce dependencies between business, analysts, and engineers. Enable a documentation-first culture. Automate whatever is repetitive.
Treat Data, Code, Models, and Dashboards As Assets: Assets should be easily discoverable, reusable, and maintained.
User Experience Defines Adoption Velocity: Teams at Airbnb famously said, “Designing the interface and user experience of a data tool should not be an afterthought. Invest in simple and intuitive tools. Software shouldn’t need training programs.
On Data Catalog 3.0
Data Catalog 3.0s will not look and feel like their predecessors in the previous generations. Instead, Data Catalog 3.0s will be built on the premise of embedded collaboration that is key in today’s modern workplace, borrowing principles from Github, Figma, Slack, Notion, Superhuman, and other modern tools that are commonplace today.
Data assets > tables: Nowadays, BI dashboards, code snippets, SQL queries, models, features, and Jupyter notebooks are all data assets. The 3.0 generation of metadata management will need to be flexible enough to intelligently store and link all these different types of data assets in one place.
End-to-end data visibility, rather than piecemeal solutions: The Data Catalog 3.0 will help teams finally achieve the holy grail, a single source of truth about every data asset in the organization.
Built for a world where metadata itself is “big data”: Data Catalog 3.0 should be more than just metadata storage. It should fundamentally leverage metadata as a form of data that can be searched, analyzed, and maintained the same way as all other types of data. Today the fundamental elasticity of the cloud makes this possible like never before. For example, query logs are just one kind of metadata available today. By parsing through the SQL code from query logs in Snowflake, it’s possible to create a column-level lineage automatically, assign a popularity score to every data asset, and even deduce the potential owners and experts for each asset.
Embedded collaboration comes of age: Because of the fundamental diversity in data teams, data tools need to be designed to integrate seamlessly with teams’ daily workflow. This is where the idea of embedded collaboration comes alive. Embedded collaboration is about work happening where you are, with the least amount of friction. Embedded collaboration can unify dozens of micro-workflows that waste time, cause frustration, and lead to tool fatigue for data teams, and instead make these tasks delightful!
On Data Quality
There are three different steps to deal with data quality:
Detection: You figure out that you have a data quality issue in the first place — using data profiling to detect missing values or spikes.
Prevention: You see the integration between data quality and pipelining/orchestration engine. You think about writing unit tests for your data pipelines.
Cure: You need to fix the problem. That’s where the data quality ecosystem plays with other ecosystems, such as data preparation. I don’t think there’s any material innovation in that space yet.
That’s the broad gallons of data quality. As with everything else, there’s depth in each of these steps. Take detection, for example: Detecting missing values is pretty basic. Using ML algorithms for anomaly detection is more advanced. To answer your question about strong practices to ensure data quality, 80% of the problem can be solved with 20% of the work. I know that there’s a ton of exciting work happening in anomaly detection (and other techniques), but that’s not where 80% of the problem is for most businesses.
I’m excited about ecosystems like great_expectations, making it easy for you to write unit tests as part of your pipeline. At Atlan, we profile data quality and allow users to perform basic alerts and monitoring on top of that. Allowing a business user to write a business rule is not rocket science, and you don’t need deep learning to do that. You just need to apply that work to high-scale data. The critical thing is to start measuring your data and working towards improvement.
On Data Governance
Data governance is specific to the organization. There are two kinds of organizations in the world:
Relatively traditional organizations in which governance has existed as a concept. There’s a governance office, a governance council, and all the other things that governance was meant to be in the last decade. They struggle because even though they have excellent governance practices, no one is using the data.
Other organizations in which governance is not the starting point. Making sense of the data and democratizing them is the starting point. They realize that, as soon as they have a little bit of scale, they lose productivity because they do not trust their data without proper governance practices.
I think data governance itself is being redefined. When we think about the concept of governance, it sounds risky and makes you less urgent, right? But governance simply means having processes and a foundation in place that help your team go faster. That’s what governance should become. We are in a new age of governance — in which the start of governance is not compliance. The start of governance is making things more agile and helping companies become more data-driven. That’s how data governance 2.0 will look like. How can governance be an offense strategy rather than a defense strategy? That’s the type of question our industry still struggles with. The good thing is that no one has figured it out yet. If you’re thinking about this as a problem, that’s great. As an industry and a broader community, we all need to share investigative practices and work together to create standards that help the next set of organizations make sense of their data.
On Modern Data Platforms
In general, every building block of a modern data stack is still under-invested.
There’s more maturity in the data ingestion, cloud data warehouses, and BI space:
Companies need a warehouse in order to start the data stack. Companies need a BI tool to extract business insights.
The launch of Amazon Redshift in 2012 started the entire ecosystem in some way. Then the growth of Snowflake in 2016 led to more innovation in the data storage and processing layers.
The transformation layer is making progress with the rise of dbt.
Other layers such as data logging, data access governance, data privacy, and data science are still under-invested.
The data ecosystem will see the next stage of innovation: In some spaces, the first winner has been created (Looked for BI or Snowflake for Cloud Data Warehouse). But the overall ecosystem is still early and will be here for the next 30–40 years. Because of that, we are seeing the second set of vectors: BI notebooks to disrupt BI, new data exploration tools, next-gen data warehouses, etc. So there will be the second wave of tooling generation. For the under-invested layers, we are starting to see the first wave of innovation.
On Fundraising
Fundraising is not the goal nor the milestone. It is the necessary evil in some cases that you need to do to build your company. Some companies don’t need to be built as venture-backed startups. If you can build a successful and profitable business, there’s a lot of respect and pride for that. Our industry talks way too much about how much money a company has raised and way too little about real milestones like customer love, customer retention, and tangible impact. As an entrepreneur, you should build a company to solve a problem, not to raise money. It’s hard to wear that hat because our society tends to hype up that kind of thing.
You are the one building something amazing in the world, burning the midnight oil, and working 80–90-hour/day and weekend. Every person who works with you should feel lucky to work with you (even though it might not feel that way all the time). Therefore, pick who you work with very wisely. These people will work with you for a big chunk of your life, so pick those you trust and value. This means different things to different people. Founders need to understand what matters to them and reference check extensively. Speak to as many references as you can to understand how an investor will be during bad times. You want to surround yourself with investors who support you in both good times and bad times.
On Community Engagement
Atlan’s mission is to help the humans of data become more productive. Teams need the ability to learn from each other. Thus, we have the Humans of Data blog to share our learnings about data teams' structure, DNA, and culture. Open-sourcing these ideas fit as a quadrant of our broader mission. Many of these projects have been picked up by team members and pushed out to the broad data community. Community engagement is not our strategy. It just evolves from what we enjoy doing.
On Hiring
We realized that an equal part of building an amazing experience for our users is building an amazing team. I researched what it will take to build a great team and found that many of the companies that have endured were not the big tech companies. As I read more books about this topic, I stumbled on a fascinating book about McKinsey.
McKinsey is just a fascinating firm.
You see some of the world’s leading CEO came from this firm. You see people leaving McKinsey, going to new companies, and bringing more McKinsey people to their new companies.
McKinsey partnered with Harvard Business School back in the 1930s to drive the MBA curriculum. As a result, the best MBA graduates went to work at McKinsey, and the best companies hire from McKinsey/HBS for management roles.
I spent time thinking about whether we could create that kind of loop for a company. The best talent has an amazing journey and grows exponentially when they work with you. When they leave, they become alumni and still contribute to the overall ecosystem and success of the company. This means thinking about building talent similar to the way of building SaaS products. HubSpot talks about the concept of the customer flywheel, and everything centers around one unit of a customer. Why can’t we think about an employee the same way?
That laid the foundation for how Atlan thinks about talent. Honestly, we haven’t fully implemented this, given the constraint of a startup. But I constantly think about how to attract the best to work with us, to grow them when they are with the company, to enable them to refer people to the company, etc. The more we can do that, the more chances that we can build a moat-to-people.
On Public Recognition
I am thankful for all these recognitions. The critical thing to recognize is that I did none of this myself. As a society, we tend to glorify entrepreneurs a little bit more than we need to. All of this came down to the ability to build an amazing team capable of doing amazing things.
Recognition to me is not a milestone nor a goal. If you do good work, it happens to you. The one thing that I’ve begun to realize is that such recognition can inspire the next generation of companies and founders. That you can be a 21-year old and make an impact on the world.
Timestamps
(02:13) Prukalpa discussed her upbringing in India and studying Engineering at the Nanyang Tech University in Singapore.
(03:52) Prukalpa shared the key learnings from her summer internship as an Investment Banking Analyst at Goldman Sachs.
(05:37) Prukalpa went over the seed idea for SocialCops (Read her Quora answer on the fundraising story).
(11:27) Prukalpa gave a brief overview of the business model at SocialCops.
(12:45) Prukalpa unpacked her talk called “How Big Data Can Influence Decisions That Actually Matter” at TEDxGateway 2017 related to the data-for-good initiatives that SocialCops facilitated.
(15:23) Prukalpa shared her thoughts on the future of the Data-for-Good movement.
(17:49) Prukalpa discussed the challenges that SocialCops’ data teams faced and the founding story behind Atlan.
(21:38) Prukalpa went over the trust-based culture that enabled SocialCops’ 8-member data team to build out India’s National Data Platform.
(27:00) Prukalpa dissected the six principles of Atlan’s DataOps Culture Code.
(31:37) Prukalpa unpacked the notion of Data Catalog 3.0, which is a key value prop of the Atlan platform.
(36:01) Prukalpa provided the 3-level framework to ensure data quality (detect -> prevent -> cure) and strong practices to maintain high-quality data.
(40:19) Prukalpa revealed the challenges that organizations face when starting their data governance initiatives.
(45:35) Prukalpa talked about the under-invested building blocks of modern data platforms.
(49:24) Prukalpa raised the importance of integration for Atlan to work well with the rest of the modern data stack.
(50:39) Prukalpa recapped the trends that Chief Data Officers needed to watch out for in 2021.
(54:01) Prukalpa gave fundraising advice for founders currently seeking the right investors for their startups.
(58:42) Prukalpa discussed Atlan’s outreach initiatives to engage with the broader data community actively.
(01:01:03) Prukalpa went over Atlan’s hiring philosophy based on the concept of People-as-a-Moat to attract, engage, and grow top talent — as inspired by the McKinsey advantage.
(01:05:22) Prukalpa shared Atlan’s Go-To-Market initiatives in the US this year and emphasized the importance of building an execution machine.
(01:08:53) Prukalpa described the state of the data community in India.
(01:10:25) Prukalpa shared entrepreneurship books that have deeply impacted her startup journey.
(01:12:16) Prukalpa briefly mentioned what public recognition means to her in the pursuit of democratizing data for the world.
(01:14:23) Closing segment.
Prukalpa’s Contact Info
Mentioned Content
Atlan (Twitter | LinkedIn | Facebook | Instagram | YouTube | Documentation)
“Empowering Organizations to Become Masters of Their Data” (Video)
Atlan Labs (Open-Source Projects)
Humans of Data Interviews (Interviews)
The DataOps Culture Code (Document)
The Data Catalog Primer (EBook)
Blog Posts
Voices In The Head of a Middle-Class Aspiring Startup Founder (July 2013)
SocialCops: What We Actually Do (Oct 2016)
People-as-a-Moat: What Startups Can Learn From McKinsey About Building A Strong Company (Aug 2018)
Going from Great People to Greater Teams: How We Think About Growth at Atlan (August 2018)
Onwards and Upwards: Chapter 2 for SocialCops (July 2019)
What is data quality? (Jan 2021)
Top 5 Data Trends For CDOs to Watch Out For In 2021 (Feb 2021)
Data Catalog 3.0: Modern Metadata for the Modern Data Stack (Feb 2021)
We Failed To Setup a Data Catalog 3x. Here’s Why (March 2021)
The Building Blocks of a Modern Data Platform (March 2021)
Books
“The Hard Things About The Hard Things” (by Ben Horowitz)
“Hatching Twitter” (by Nick Bilton)
“The McKinsey Way” (by Ethan Rasiel)
“How Google Works” (by Eric Schmidt and Jonathan Rosenberg)
“The Mom Test” (by Rob Fitzpatrick)
“Disciplined Entrepreneurship” (by Bill Aulet)
“Big Data” (by Mayer-Schnonberger and Cukier)
Talks
Game of Life (TEDxIIMShilong — March 2014)
How Big Data Can Influence Decisions That Actually Matter (TEDx Gateway — April 2017)
Better Villages Through Big Data (TED Talks India — December 2017)
The power of data science to measure unmeasured parameters in Emerging Markets (PyData Dehli — Oct 2019)
The Girl Who Thinks In Numbers: Data Warrior Prukalpa Sankar (Feb 2020)
Notes
My conversation with Prukalpa was recorded back in April 2021. Since the podcast was recorded, a lot has happened at Atlan!
They raised a $16M Series A led by Insight Partners, with participation from Sequoia Capital, Waterbridge Ventures, and amazing angels such as the founding teams of Snowflake and Looker.
They got mentioned in Gartner’s Inaugural Market Guide for Active Metadata Management.
They announced a partnership with Snowflake.
Prukalpa has written more content. I’d recommend checking out:
About the show
Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.
Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.
Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:
If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.