Datacast Episode 52: Graph Databases In Action with Dave Bechberger
The 52nd episode of Datacast is my conversation with Dave Bechberger — a Graph Database Subject Matter Expert currently working for Amazon Neptune. Give it a listen to hear about his 20+ years developing, managing, and consulting on software projects; his pragmatic approach for implementing large-scale distributed data architectures for big data analysis and data science workflows; his book “Graph Database in Action”; and more.
Listen to the show on (1) Spotify, (2) Apple Podcasts, (3) Google Podcasts, (4) Stitcher, (5) iHeart Radio, (6) Radio Public, (7) Breaker, and (8) TuneIn
Key Takeaways
Here are the highlights from my conversation with Dave:
On Being A Tech Lead at Expero
Expero is a boutique consulting firm specializing in complex distributed data systems and UI/UX development. When I started there, they were mainly working with major oil and gas companies in bleeding-edge oil exploration. These projects were my first introduction to massively distributed big data applications. We are talking about massive data files distributed across servers and sub-businesses within that organization. I also learned effective patterns to collaborate with scientists and academics: the ability to translate complex technical ideas into more relatable concepts.
Expero has a strong focus on user experience, and I also worked with amazingly talented UX folks there. I learned how to approach problems from a user’s perspective and turned those problems into actionable technical outputs.
I also dipped my toes into the big-data world of applications by learning Cassandra and Spark. Additionally, I also got my hands dirty with graph databases, which were considered novel technologies back in the early 2010s.
On Being The Chief Software Architect at Gene by Gene
There were certainly technical challenges in dealing with massive genetic data (in the form of consumer test DNAs). Moving the data to the cloud was prohibitively expensive. Furthermore, running computationally complex algorithms over the data in a scalable and efficient manner was challenging.
Another challenge came in the form of organizational changes, which I underestimated initially. There was an organizational resistance to move into a new data platform. As a result, I spent a considerable amount of time providing inputs and training internal teams to transition from an old platform to a new platform.
On Common Patterns of Using Graph Databases
You want to make sure that your problem is the right one for graph databases. It is not the right tool for every problem that you come across. When you want to extract the strength, quantity, or quality of the relationships between entities in your data, that’s a sweet spot for graph databases.
Many people think that graph databases are great at joining data. It’s more nuanced than that. You want to use those joins to navigate the data and find the connections.
You need to understand what you want to achieve from the data.
On The 3 Buckets of Graph Technologies
A graph-computing engine handles data processing using graph algorithms, but not how the data is persisted to the database.
A graph database handles both the processing and the persistence of the graph algorithms. There are two portions under the graph database bucket:
The RDF TripleStore is based on the concept of a subject-predicate-object. When you store a triplet into a store, each piece can be uniquely referenced by a URI or IRI. These stores tend to have complex rules-based engines that allow inference and reasoning of the graph’s edges. The RDF TripleStore is optimized to infer new data and works well in industries where publicly available data is stored in that same format.
The Label Property Graph stores data as vertices, edges, and properties. Vertices represent the domain entities, edges represent the relationships between those entities, and properties represent the attribute associated with either a vertex or an edge.
Edges in most graph databases are first-class citizens. They are as important as the vertices and the properties. Additionally, they can have properties of their own.
To sum it up, graph-computing engines are built to work on a huge scale of data. They are an optimal choice when you want to run graph algorithms to calculate global graph properties in batch processes. Graph databases are more suitable for real-time interactions: (1) the RDF database can store data in a defined format and infer new data, while (2) the property graph can traverse known relationships inside your graph and your data.
On Being A Solutions Architect at DataStax
DataStax is the company behind the open-source Apache Cassandra project. They provide commercial support for Cassandra's installation. They also have an enterprise product that uses Cassandra underneath and search/analytics/graph capabilities built on top — to serve big data use cases.
I came on board to work with Dr. Denise Gosnell as she started the Global Graph Practice team. We helped DataStax’s customers built and adopt graph technologies. These customers need to store and analyze data that cannot fit into a single machine. They also need to have their data stored and replicated automatically across multiple data centers for redundancy and high-availability purposes.
On Being A Senior Graph Architect at AWS
It’s fun to work at AWS to connect different lego pieces and build complex solutions to problems. Besides my core work on Amazon Neptune, I also integrated it with other AWS’s AI technologies like Comprehend and Lex to solve building a full-stack application.
On Writing “Graph Databases In Action”
I underestimated the time and effort required to create a good teaching manuscript for graph databases. This book is my passion project. For the amount of time that I put in, it’s unlikely to see the same amount of return on investment.
This book is everything a relational database developer needs to know to build graph-backed applications. The switch from entity-only mindset to entity-and-relationship mindset is important. My hope with the book is to make the adoption of graph databases much more common.
On Big Data and Distributed Systems Trends
The first trend is graph database.
I am excited to see the next phase of the technology. For the past few years, graph databases have been hyped up as far as what sort of problems they can solve. Seeing the real use cases accepted in the enterprise is exciting.
What goes along with this phase is the standardization and maturation of the property graph database space.
The second trend is data privacy.
From the societal aspect, how can we balance our desire for hyper-personalized experiences with our desire to maintain and control our information?
The third trend is the convergence of big data/graph database/distributed systems with machine learning.
I am interested in explainable AI — making systems where their behavior is more intelligible by humans.
Graphs can play a good part in providing context in both the inputs and the outputs of the model. They also help maintain where the provenance of the data comes from.
Show Notes
(2:10) Dave talked briefly about his Electrical Engineering study at Rensselaer Polytechnic Institute back in the late 90s.
(4:03) Dave commented on his career phase working as a software engineer across various companies in Bozeman, Montana.
(7:38) Dave discussed his work as a senior architect and tech lead at Expero, a Houston-based startup that develops custom software exclusively for domain-expert users.
(11:26) Dave briefly defined common big data frameworks (Hadoop, Apache Spark) and databases (Apache Cassandra, Apache Kafka).
(13:37) Dave went over the challenges during his time as a chief software architect at Gene by Gene, a biotech company focusing on DNA-based ancestry and genealogy.
(20:00) Dave shared the common patterns and anti-patterns of using graph databases (in reference to his talk “A Practical Guide to Graph Databases”).
(26:16) Dave walked through the three categories of graph technologies: Graph Computing Engine, RDF TripleStore, and Labeled Property Graph (in reference to his talk “A Skeptics Guide to Graph Databases”).
(33:03) Dave discussed his move to DataStax’s Global Graph Practice team as a solutions architect and graph database subject matter expert.
(36:00) Dave explained the design of DataStax’s enterprise solution called Customer 360, which collapses data silos to drive business value.
(41:16) Dave talked about his current experience as a Senior Graph Architect at AWS.
(43:51) Dave mentioned the challenges while writing “Graph Databases In Action” (published last October).
(47:25) Dave explained the open-source Apache TinkerPop framework and the Gremlin language used in the book for the uninitiated.
(51:04) Dave discussed trends in big data and distributed systems that he is most excited about.
(55:06) Closing segment.
His Contact Info
His Recommended Resources
“Graph Databases In Action” (Associated Code Repository)
Martin Fowler (Founder of ThoughtWorks)
Martin Kleppmann (Author of “Designing Data-Intensive Applications”)
Andrew Ng (Professor at Stanford, Co-Founder of Google Brain and Coursera, Ex-Chief Scientist at Baidu)
“Pragmatic Programmer” (by Andy Hunt and Dave Thomas)
“The Five Dysfunctions Of A Team” (by Patrick Lencioni)
“How To Observe Scientific Advice for Common Real-World Problems” (by Randall Munroe)
This is the 40% discount code that is good for all Manning’s products in all formats: poddcast19.
These are 5 free eBook codes, each good for one copy of “Graph Databases In Action”:
gdadcr-E55F
gdadcr-B896
gdadcr-8C53
gdadcr-AAE1
gdadcr-39F0
About the show
Datacast features long-form conversations with practitioners and researchers in the data community to walk through their professional journey and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths - from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.
Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.
Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:
If you're new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.