This semester, I’m taking a graduate course called Introduction to Big Data. It provides a broad introduction to the exploration and management of large datasets being generated and used in the modern world. In an effort to open-source this knowledge to the wider data science community, I will recap the materials I will learn from the class in Medium. Having a solid understanding of the basic concepts, policies, and mechanisms for big data exploration and data mining is crucial if you want to build end-to-end data science projects.
Here’s the roadmap for this introductory post:
Overview of database engines
1 - Overview of Database Engines
So why should we use a database? Well, the first reason is that a database gives a lot of useful abstractions. Secondly, it also has these properties known as ACID (Atomicity, Consistency, Isolation, Durability).
Atomicity: Operations executed by the database will be atomic / “all or nothing.” For example, if there are 2 operations, the database ensures that either both of them happen or none of them happens.
Consistency: Anyone accessing the database should see consistent results.
Isolation: If there are multiple clients trying to access the database, there will be multiple transactions happening simultaneously. The database needs to be able to isolate these transactions.
Durability: When writing a result into the database, we should be guaranteed that it won’t go away.
In a database engine, there are 2 main components: the storage manager and the query processor.
The storage manager is the interface between the database and the operating system. It is responsible for authorization, interaction with the OS file system (accessing storage and organizing files), and efficient data storage/modification (indexing, hashing, buffer management).
One very important piece of the storage manager is the transaction manager. It ensures the database is consistent (if a failure occurs) and atomic. It also does concurrency control to make sure multiple operations result in a consistent database.
For those who are not familiar, transactions are collections of operations for a single task. Examples include:
Assume a constraint balance > 0
Deduct 50 from A
Add 50 to the balance of B
Store the new balance
On the other hand, the query processor is responsible for 3 major jobs: parsing and translation, optimization, and evaluation. The diagram below gives an overview of the query processor:
Of course, all components must work together. The storage manager must make sure transactions are durable. The query processor uses indexes managed by the storage manager. And the transaction manager must provide consistent data to query processor.
Let’s look at how we actually interface with our database. As seen below, different users require different interfaces: app UX for naive users, app programs for app programmers, query tools for analysts, and admin tools for database admins.
For most of the time, we can think of our database as a black box, as seen in the diagram below (the SQL engine). We ask queries of our database (via SQL API), and the database gives us the answer. The front end that we see includes SQL user interface, forms interface, report generation tools, data mining/analysis tools…
2 — Data modeling
Let’s look at different ways that we can do modeling of data. There are usually 3 levels of abstraction that we can look at:
Physical layer — how data is stored on hardware (actual bytes, files on disk, etc.)
Logical layer — how data is stored in the database (types of records, relationships, etc.)
View layer — how applications access data (hiding record details, more convenience, etc.)
A data model is a bunch of tools for describing what our data looks like, the relationship between the data, what the data means, and constraints against our data. There are many examples of data model, including relational model, entity-relationship model, object-based model, semi-structured model, and network model.
Before looking at the relational model, we need to have a way to think about what our database needs to store. Many conceptual models exist that are independent of how a particular database stores data. A common choice is the ER (Entity-Relationship) model, which does not specify how data will actually be stored. It also does not specify the interface we will use to access the data. ER model is very useful for collecting requirements.
3 — Entity-relationship modeling
Let’s dig deeper into the main components of an ER model. The image below shows an example of an entity set for a doctor example:
An entity set (represented by a rectangle) is a type of thing in the real world. It is distinguishable from other types and also has a set of properties or attributes possessed by things of the same type.
Each attribute has an associated type which is normally atomic. One or more attributes called the primary key can uniquely identify an entity. The set of valid values for an attribute is called the domain. Lastly, attributes may be simple or complex.
A relationship (represented by the diamond) is used to document the interaction between 2 entities. In the example above, a patient has a primary doctor. Each relationship has a cardinality or a restriction on the number of entities. This helps implicitly define a role for each entity set in the relationship. For example, if a patient is supervised by a doctor, then the patient has a supervisee role and a doctor has a supervisor role.
Relationships may also have attributes. For example, in the diagram below, a patient (entity) can be insured by his/her policy number (relationship) with an insurance company (entity):
Again, cardinality refers to the maximum number of times an instance in one entity can relate to instances of another entity. There are 3 cardinalities that define the relationships between entity sets (explained by the diagram):
One-To-One: Each visit corresponds with one bill.
One-To-Many: One doctor can have many patients as their primary doctor.
Many-To-Many: Patients are allowed to pay multiple bills in one payment, and each bill may have multiple payments associated with it.
Another important concept in entity-relationship modeling is inheritance. For example, in the diagram below, both doctor and patient inherit the attributes of the person entity. That means we can identify any doctor and any patient by his/her unique SSN, first/middle/last name, phone number, birth date, gender, email, and occupation.
Each entity in an entity set must have some type of key. This is usually a subset of the attributes associated with an entity. Furthermore, the key should never or rarely change.
There are 3 main types of keys:
Super key is sets of keys that uniquely identify the entity.
Candidate key is a super key that guarantees to be unique.
Primary key is the candidate key that we actually pick to use in database design.
But what if there is no candidate key?
One solution is to generate an artificial ID attribute and ensure that a unique value is assigned. Some examples are order number, customer ID…
Another solution is to use a weak entity set. In the diagram below, the diamond ‘Attends’ represents a weak relationship and the ‘Visit’ is a weak entity set. If we use the SSN of the patient in addition the scheduled date & time of his/her visit, we will be able to identify a viable candidate key.
When designing an ER model, here are a couple of criteria to consider:
Whether you should choose attributes or entity sets?
Whether you should use entity sets or relationships?
Whether you should select strong or weak entity sets?
Remember that the ER model is conceptual and not what a database actually uses. We need a more concrete model to actually implement our application. Thus, let’s talk about the relational model.
4 — Relational Model
Database systems don’t use the ER model directly. We need to move on to the next stage and pick a logical model. The first we’ll explore is the relational model. Relational model is very common among modern database systems in the industry, including MySQL, Microsoft SQL Server, IBM DB2, Microsoft Access, Oracle DB, and PostgreSQL.
A relation is a group of related attributes like in an entity set. In a relational database, these are represented as tables. Each relation should have a primary ket. In the tables below, both Patient and Doctor tables have SSN as primary keys. Relations may also have foreign keys or attributes which refer to other relations. In the example below, the foreign key of the Patient table is the primaryDoctor that references the Doctor table.
A powerful function in relational database is the join function that can join two tables together according to a similar key, as seen in the example below.
To convert an ER model into a relational model, attributes of strong entity sets become attributes of the relation. The primary keys are maintained. For weak entity sets, we create a relation table and link that to our strong entity sets.
To deal with weak relationship sets, we can simply discard these since the relationship is captured by the weak entity set. In the example below, the Attends relationship is captured by the Visit relation created from the weak entity set Visit.
How about strong relationships? In the old ER model, Patient is insured by an Insurance Company by a policy number. In the relational model, we create 3 separate tables: Patient, InsuredBy, and InsuranceCompany. In the InsuredBy table, the patient attribute is used as a foreign key to reference the Patient table and the company attribute is used as a foreign key to reference the InsuranceCompany table.
Let’s look at a way to optimize our relational database design. In the diagram below, we don’t need to have a separate table for Primary. Instead, we only need Patient and Doctor because each patient can have at most one primary doctor, so the primaryDoctor attribute can be used a foreign key in the Patient table to reference the Doctor table.
Lastly, how can we deal with inheritance? Remember earlier, inheritance in ER model means that two or more entity sets have a lot of similar attributes. There are 3 approaches to convert them in relational model, and I’ll demonstrate them using the Patient & Doctor example above:
Whole hierarchy: Essentially, we can create 3 separate entity sets — Person, Patient, and Doctor; and link Patient and Doctor to Person. The Person entity set have ssn as its primary key, along with other attributes including first name, middle name, and last name. The Patient’s ssn and Doctor’s ssn are foreign keys that link to Person’s ssn.
Top hierarchy: There is only one entity set — Person. With primary key ssn, Person has all the other attributes of Patient and Doctor.
Bottom hierarchy: Only 2 entity sets — Patient and Doctor — are needed. We keep all the existing attributes for both of them.
If you’re interested in this material, follow the Cracking Data Science Interview publication to receive my subsequent articles on how to crack the data science interview process.