The 62nd episode of Datacast is my conversation with Gordon Wong — a data modeling fanatic, data warehouse architect, and multi-hyper-growth startup veteran and team builder.

Our wide-ranging conversation touches on his foray into the database world, his interest in consulting, the evolution of data warehousing and business intelligence platforms, how to choose data tooling vendors, what it means to be data-driven, effective collaboration for data teams, data “hierarchy of needs”, data for social impact, and much more.

Listen to the show on (1) Spotify, (2) Apple Podcasts, (3) Google Podcasts, (4) TuneIn, (5) RadioPublic, (6) Stitcher, (7) Breaker, and (8) iHeart Radio.

Key Takeaways

Here are highlights from my conversation with Gordon:

On Getting Into Database

Frankly, I got fortunate. I had degrees in Psychology and Philosophy and not a lot of jobs required those degrees back then. I ended up getting a job at a Geographical Information Systems (GIS) lab with a modest beginning — spending the first year doing data entry. This job was very repetitive, and I wanted to go faster. One day, I borrowed the DBA password and taught myself SQL to improve my entry rate. Learning SQL snowballed my career from there.

There are practical and emotional challenges dealing with impostor syndrome. With liberal arts degrees in a field that expects you to be technical, you can suffer the delusion that you do not know enough. For many years, I had the intention to go back to school and get that technical degree. But at some point, I realized that the technology was not the thing that holds me back, but more so how I apply the technology.

On Consulting

Consulting seemed challenging to me: how do I learn in a new environment, define a problem, and be successful in a short period of time?

AB Initio Software was a self-funded company with a great customer base and some of the smartest/talented people I have met. They had a maniacal focus on customer success and solving problems at the fundamental level. These are two things that appeal to me.

My core skill is building solutions that help analysts thrive inside and drive better decisions.

On Data Warehousing

Smarter Travel Media is one of the earlier companies that get into travel search. They had an easy-to-use portal, particularly catered to non-technical people, where you could enter your criteria for finding flights/cars/hotels and get back results. While the business was about attention arbitraging, the customer value was about helping them travel easily. As you can imagine, there were a lot of data behind that product, and I started learning about clicks, impressions, actions, etc.

This was my first opportunity to build a department from the bottom up. I got recruited by the founders of the company. They were not doing any deliberate analytics (all via organic spreadsheets).

I started with the business requirements — interviewing the department heads, asking them about their pain points/bottlenecks, understanding what would make them happy in the short term. Then I tried to build a concrete plan moving forward from day one without any lofty vision.
I had already seen the 2-year data warehousing projects that were absolute failures at this point in my career. The way to build a data warehouse is starting with user needs, delivering value as soon as possible, and iterating on that. You REALLY don’t know the requirements upfront.

From a technical perspective, I got the opportunity to dive into Microsoft’s SQL Server stack:

The integration service was an early ETL tool that separates data flow and task flow (a critical distinction).
The analysis service could perform multi-dimensional queries.
The reporting service delivered dashboards and reports.

Being a one-person shop building the whole thing from top to bottom taught me not just how to build the data warehouse but also support it in production and engage in continuous delivery. It’s not enough to just populate databases and write reports. Those reports have to be accurate, reliable, and usable. In fact, I built my own data quality system to test the data based on a simple binary test. I went to users directly to onboard them and made sure they successfully used our dashboards and reports.

On Columnar Databases

After Smarter Travel, I got recruited by an old boss from an earlier company to join ClickSquared and build a multi-tenant campaign management platform. This was the first time I got to see columnar databases and their advantages over traditional databases. I also wrestled with the problem of dealing with multiple tenants/clients in a single warehousing instance.

Previous to the columnar world, relational database stores data in rows and writes them into disks next to each other. In the analytics world, we frequently do not want details but insights. Columnar database solves the problem of only bringing back the data needed to answer the input queries. They were successful because they were very efficient against optimizing constraints and led to 100x speed improvement. At the end of the day, there is still compute, memory, and storage. It’s just how we organize the data to optimize for certain problems and remove the constraints.

On Choosing Data Tooling Vendors

This is an educated guess, but I always go back to the fundamentals: what outcomes that I try to drive? Over what time frame? And what resources are required (or will hold me back)?

Source: https://www.fitbit.com/global/uk/products/smartwatches/versa

At Fitbit, we had the challenges of (1) understanding our customer actions to improve the product, (2) understanding our potential prospects to add more customers, (3) understanding the devices themselves to improve the product from that direction. Given those questions, I understood that my users were in marketing, firmware engineers, and product groups. Working backward, I thought through how their questions could be formulated, what data they need, and how the data can be processed.

I realized fairly early that we would be constrained by hardware as we were struggling to write even basic SQL queries. Redshift (our warehouse at the time) was not able to scale in a cost-efficient manner. As luck would have it, a little startup called Snowflake came along and got my attention when they started talking about separating compute and storage. Fast forwarding, I was working with a great sales team over there and became impressed with the product the first time that I used it. Two of the founders came from Oracle and had developed a product that I used before, so there was an instant connection there.

I believe that Fitbit was Snowflake’s biggest customer in 2016 (as my friends in the sales department told me). We put a petabyte of data in there, which seems big even now. I had one table with a trillion rows. I had to encourage my team to start using scientific notation when talking about the size of the databases.

I look for vendors interested in solving my pains and addressing the constraints that my customers have in terms of taking action.

I often tell vendors upfront that I look for somebody to partner with within a long-term relationship. A lot of vendors appreciate that. We don’t have to have antagonistic relationships with them.
My father was in the restaurant business. He had respect for his vendors and understood that they had to make a living, but they need to give him good products.
I had that same feeling. I expect the vendors to treat me with respect and try to solve my problems. Then, I will treat them with respect and recognize their problems. It’s a better way to do business.

On Being Data-Driven

It’s never too early to start on analytics and measurement. Real improvements come from introspection — recognizing where you are and where you want to go. If I have a small startup, I need to know what my customers are doing and how they are using my product to improve it from an empirical fashion.

Being data-driven, funny enough, is very threatening.

To be data-driven, you are making a commitment to being empirical, logical, and scientific. This means that you don’t get to guess as much, express yourself, and be the hero.

My recommendation is to commit to building a culture where you are constantly curious about your product and yourself to improve, as you will naturally seek data. Start small, think about the business problems, figure out the most important question, and answer that question to identify the constraints. Don’t launch yourself into a 1-year project to build a big data solution. You just don’t know what you need yet.

On Team Collaboration

At Fitbit, I managed the 3 teams of data engineering, data warehousing, and data analytics at once. I needed to figure out how these teams could work together. If you studied agile development, you are probably familiar with feature teams and component teams:

Feature teams go end-to-end on a product. They are made up of heterogeneous members who are not necessarily specialized.
Component teams are made up of members with similar skillsets, maybe different seniorities, who are good at one thing.

If my 3 teams are highly specialized, I will have a problem with communication, queuing, and the distance between engineering and business problems. The visibility between component teams and the business problems is really thin. Instead, I focus on building feature teams to put together multiple people with different skill sets and communication styles to solve user problems. I realized that I had to make things simple. That means giving these teams stability in vision and mission, as well as reasonable deadlines to move the needle forward.

Because I did not have a product management tea, I created this notion of “fractional product managers” — engineers who volunteer to understand specific department needs and become their advocates. For those people who engage with the users and solve their problems, their careers benefit from that.

On Interviewing Data Engineers

Data engineers gather the data and get them into the database. They need to interact with a variety of upstream sources programmatically. Thus, they need traditional software engineering skills to accomplish that. Furthermore, excellent engineers are professional about creating pull requests, code commits, and code reviews. In the database world, that was missing for a long time because a lot of practitioners were not traditional software engineers. Data engineers need to build solutions that are reliable, consistent, and maintainable.

One way that I interviewed data engineers sometimes is to take a piece of paper and draw a line in the middle. The idea is on one side, and the implementation is on the other side. I want to know about their ideas as well as their ability to implement them. I did not do the trivia hunt through resumes.

On “Data Hierarchy of Needs”

I came up with this notion to better explain to my customers how to evaluate the data analytics maturity curve. To define maturity, we need to define the fundamentals mentioned here. Maslow’s hierarchy of needs heavily influences this.

Source: https://blog.getdbt.com/data-testing-why-you-need-it-and-how-to-get-started/

If we try to build a predictive model or a mature dashboard, what are the fundamental things that enable such products?

The first requirement for any company working with data should be protecting their customers. Security has to be your first priority. By protecting your customers, you protect yourself.
The second pillar is data quality. How do I keep noise out of the signal? How do I protect myself from bad decisions? Real sustained growth and velocity come from avoiding disasters as much as going faster. Data quality is an investment in terms of risk mitigation by protecting yourself from bad decisions. I encouraged my teams to lean into test-driven development so that we can test for data quality right in the beginning.
The third pillar is reliability. Any good solution has to be dependable. This is DataOps/DevOps nuts-and-bolts: defining SLAs, understanding user needs, and measuring how well you perform from a reliability perspective.
The fourth pillar is usability. I believe that engaging in analytics is a creative enterprise. It’s about asking questions and using the answers (to those questions) to either take action or ask better questions. Whenever you engage in a creative endeavor, if you struggle with your tools, creativity is constrained. If the analysts struggle with formulating queries or defining objects, they won’t be able to come up with insightful questions. User experience becomes critical.
The last pillar is coverage. Do we have the information that describes the event we try to understand? If you don’t resolve the first four, there is no point in giving people data. You also want to keep the scale and scope under control. Constrain your scope, target the most valuable question to answer, and work backward from there.

On Data For Social Impact

We have a technology-focused culture in our society, where we constantly try to go faster and climb higher. Sometimes we forget people and how to drive our solutions sideway. If you don’t have the information to make an informed choice, you really don’t have democracy. So how can we use data to improve society and drive democracy?

By making information more accessible and driving insights in social areas at the local level, we can help people understand each other better, drive our empathy, and improve the lives of everyone. That sounds idealistic in some sense but actually is pragmatic. Think about the outcomes that you want. Think about what matters to you. Think about what is constraining you. And then find the answers to take better actions.

I’d like to see more efforts in enabling not just data scientists and experiment analysts but ordinary people to answer questions that matter to them and make better decisions for themselves.

On Team Dynamics

Empathy, honesty, and optimism are the three most important traits for a manager.

I definitely embrace servant leadership. I started with the mission of how to help my employees be happy and successful in their roles. And I can’t do that without those three traits.

Empathy: I need to care about their careers and understand challenges from their perspectives.
Honesty: I need to be able to give honest feedback and give the information they need to improve their performance.
Optimism: I have to believe that people can do better if they fall short of a goal.

Moreover, I think the team members have to have empathy, honesty, and optimism for each other. Clearly, cooperative teams are more performant than non-cooperative teams. Being cooperative comes from being empathetic.

On Snowflake

If you know SQL, you can use Snowflake. This is fantastic because SQL is a familiar paradigm for most people in the data world.
Snowflake scales according to your constraints. More specifically, your compute and storage can scale on-demand independently. Users pay for utilization, so they don’t have to pre-buy a huge amount of capacity and anticipate what their customers might do in the future.

Source: https://www.snowflake.com/blog/beyond-modern-data-architecture/

From a compute perspective, Snowflake is great at parallelizing queries. Its performance tends to scale linearly with the size of the data. As engineers, we use to be frugal and parsimonious with computing resources. With Snowflake, we can bring more resources to bear to solve the problem that we have right now essentially for free.

On The ETL Tooling Landscape

Data tools will become more horizontal.

At the moment, there are tools that focus on specific components of the data stack, such as data ingestion, data transformation, dashboard creation, etc. As we can solve these problems well enough, we will go into a higher level of abstraction. Remember that the objective is to deliver insights by trials and actions. I believe practitioners will lean in DataOps and analytics engineering. We might even see terms such as “InsightOps” or “DecisionOps” to identify the constraints of delivering insights.

Getting data from different sources and bringing them into a data warehouse will be commoditized. Data integration vendors need to make data better known and understandable by mapping out the data terrains, creating data typography, and building a stable data foundation. Let’s start having mature conversations about the end-to-end knowledge graph within organizations.

New Pod🎙️#Datacast 62nd eps features Gordon Wong. We discuss:
-Columnar Databases
-Data Vendors
-Data Hierarchy of Needs
-Data For Social Impact
-Servant Leadership

Gordon brings his wisdom leading orgs through analytics transformations. Enjoy!https://t.co/3GKeh86yst
[1/15]
— James (@le_james94) April 29, 2021

Show Notes

(02:09) Gordon briefly talked about his undergraduate studying Psychology and Philosophy at Rutgers University in the early 90s.
(03:24) Gordon reflected on the first decade of his career getting into database technologies.
(05:34) Gordon discussed his predilection towards consulting, specifically his role in the professional services team at AB Initio Software in the early 2000s.
(08:02) Gordon recalled the challenges of leading data warehousing initiatives at Smarter Travel Media and ClickSquared in the 2000s.
(13:14) Gordon emphasized the advantage of a multi-tenant database over a traditional relational database.
(18:30) Gordon recalled his one-year stint at Cervello, leading business intelligence implementations for their clients.
(21:59) Gordon elaborated on his projects during his 3 years as the director of business intelligence infrastructure at Fitbit.
(26:09) Gordon dived into his framework of choosing data tooling vendors while at Fitbit (and how he settled with a tiny startup called Snowflake back then).
(30:02) Gordon provided recommendations for startups to be data-driven.
(33:24) Gordon recalled practices to foster effective collaboration while managing the 3 teams of data engineering, data warehousing, and data analytics at Fitbit.
(36:44) Gordon went over his proudest accomplishment as the director of data engineering at ezCater, making substantial improvements to their data warehouse platform.
(38:59) Gordon shared his framework for interviewing data engineers.
(41:39) Gordon walked through his consulting engagement in analytics engineering for Zipcar and data warehousing for edX.
(46:17) Gordon reflected on his time as the Vice President of business intelligence at HubSpot.
(50:50) Gordon unpacked his notion of “Data Hierarchy of Needs,” which entails the five pillars — data security, data quality, system reliability, user experience, and data coverage.
(56:55) Gordon discussed current opportunities for driving better social outcomes and empowering democracy through data.
(59:48) Gordon shared the key criteria that enable healthy team dynamics from his hands-on experience building data teams.
(01:02:13) Gordon unpacked the central features and benefits of Snowflake for the un-initiated.
(01:06:25) Gordon gave his verdict for the ETL tooling landscape in the next few years.
(01:08:33) Gordon described the data community in Boston.
(01:09:52) Closing segment.

Gordon’s Contact Info

LinkedIn

Mentioned Content

People

Tristan Handy (co-founder of Fishtown Analytics and co-creator of dbt)
Michael Kaminsky (who coined the term “Analytics Engineering”)
Barr Moses (co-founder and CEO of Monte Carlo, who coined the term “Data Observability”)

Book

“Start With Why” (By Simon Sinek)

Related Episodes

Datacast

Datacast Episode 132: Big Data Engineering, Data Culture from First Principles, and Reimagined Metadata with Suresh Srinivas

Datacast

Datacast Episode 131: Data Infrastructure for Consumer Platforms, Algorithmic Governance, and Responsible AI with Krishna Gade

Datacast

Datacast Episode 130: Towards Accessible Data Analysis with Emanuel Zgraggen

Datacast

Datacast Episode 125: The Next Wave of Developer Platforms, Data Products, and Software Infrastructure with Sakib Dadi

Datacast

About the show

Datacast features long-form conversations with practitioners and researchers in the data community to walk through their professional journey and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths - from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:

If you're new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.