Big Data

Fugue - Reducing Spark Developer Friction

Fugue - Reducing Spark Developer Friction

This is a guest article written by Han Wang and Kevin Kho, in collaboration with James Le. Han is a Staff Machine Learning Engineer at Lyft, where he serves as a Tech Lead of the ML Platform. He is also the founder of the Fugue Project. Kevin is an Open Source Engineer at Prefect, a workflow orchestration framework, and a contributor to Fugue. Opinions presented are their own and not the views of their employers.

Datacast Episode 52: Graph Databases In Action with Dave Bechberger

Datacast Episode 52: Graph Databases In Action with Dave Bechberger

Dave Bechberger is known for his expertise in distributed data architecture and being a Graph Database SME.  He is known for his pragmatic approach to data architectures and for implementing large-scale distributed data architectures for big data analysis and data science workflows using various SQL and NoSQL data technologies. He is the author of "Graph Database in Action" by Manning publications and has spoken both nationally and internationally at conferences on subjects related to distributed data and graph databases.


Dave spent 20+ years developing, managing, and consulting on software projects and is currently a member of the Amazon Neptune service team. He works with both customers and engineering teams to simplify and speed the adoption of graph technologies.

Datacast Episode 46: From Building Recommendation Systems To Teaching Online Courses with Frank Kane

Datacast Episode 46: From Building Recommendation Systems To Teaching Online Courses with Frank Kane

Frank Kane is the owner of Sundog Education, teaching machine learning and data science online to over 500,000 students worldwide. Before Sundog, Frank spent nine years at Amazon as a senior engineer and senior manager, specializing in recommender systems and running IMDb's engineering department. Frank also worked in the early days of video game development, dating back to the adventure games of Sierra Online in the early '90s, and has also developed computer graphics software for flight simulators and military simulators around the world. Today Frank is focused on the world of online education, living in the Orlando Florida area with his family.

What I Learned From Attending #SparkAISummit 2020

What I Learned From Attending #SparkAISummit 2020

One of the best virtual conferences that I attended over the summer is Spark + AI Summit 2020, which delivers a one-stop-shop for developers, data scientists, and tech executives seeking to apply the best data and AI tools to build innovative products. I learned a ton of practical knowledge: new developments in Apache Spark, Delta Lake, and MLflow; best practices to manage the ML lifecycle, tips for building reliable data pipelines at scale; latest advancements in popular frameworks; and real-world use cases for AI.

An Introduction to Big Data: Distributed Data Processing

An Introduction to Big Data: Distributed Data Processing

This semester, I’m taking a graduate course called Introduction to Big Data. It provides a broad introduction to the exploration and management of large datasets being generated and used in the modern world. In an effort to open-source this knowledge to the wider data science community, I will recap the materials I will learn from the class in Medium. Having a solid understanding of the basic concepts, policies, and mechanisms for big data exploration and data mining is crucial if you want to build end-to-end data science projects.

An Introduction to Big Data: Clustering

An Introduction to Big Data: Clustering

This semester, I’m taking a graduate course called Introduction to Big Data. It provides a broad introduction to the exploration and management of large datasets being generated and used in the modern world. In an effort to open-source this knowledge to the wider data science community, I will recap the materials I will learn from the class in Medium. Having a solid understanding of the basic concepts, policies, and mechanisms for big data exploration and data mining is crucial if you want to build end-to-end data science projects.

An Introduction to Big Data: Data Integration

An Introduction to Big Data: Data Integration

This semester, I’m taking a graduate course called Introduction to Big Data. It provides a broad introduction to the exploration and management of large datasets being generated and used in the modern world. In an effort to open-source this knowledge to the wider data science community, I will recap the materials I will learn from the class in Medium. Having a solid understanding of the basic concepts, policies, and mechanisms for big data exploration and data mining is crucial if you want to build end-to-end data science projects.

An Introduction to Big Data: Data Querying

An Introduction to Big Data: Data Querying

This semester, I’m taking a graduate course called Introduction to Big Data. It provides a broad introduction to the exploration and management of large datasets being generated and used in the modern world. In an effort to open-source this knowledge to the wider data science community, I will recap the materials I will learn from the class in Medium. Having a solid understanding of the basic concepts, policies, and mechanisms for big data exploration and data mining is crucial if you want to build end-to-end data science projects.

Datacast Episode 9: Diving into Data Engineering with Mark Sellors

Datacast Episode 9: Diving into Data Engineering with Mark Sellors

Mark Sellors is the Head of Data Engineering at Mango Solutions, a UK based Data Science consultancy. He has more than a decade’s experience working with analytical computing environments, DevOps and Unix/Linux. He uses his experience to help Mango’s customers transform their analytic capabilities to ensure they can make the most of their data.