How and why the Relational Model works for databases

This is a note on, the Turing Award laureate, Ted Codd's revolutionary paper — A Relational Model of Data for Large Shared Data Banks. In this post, I will review the paper and add my comments with a perspective from modern distributed databases.

Tight coupling

How users used to interact with databases were tightly coupled with implementation details — e.g. how bits are managed and represented on physical hardware. Users might expect to get replies in certain order because data is sorted in a specific order on disk (Ordering Dependence) without explicitly expressing reply ordering requirements. Indices on data were exposed directly to users, which makes changing them (especially removing them) in the future difficult (Indexing Dependence). Data were organized in tree structures (like folders) — e.g. employees are nested under companies, and children nested under employees. The structure has to be exposed to users, who follow this path for data access. This means changes can't be made to the tree structures (Access Path Dependence).

Abstraction

Many problems in computer science are solved by introducing another level of indirection or abstraction. What if instead of leaking data store order, indices, or how we structure the data storage, we introduce a language that just describes the data itself. It would be completely declarative, decoupling how users would reason about the data and how it's actually organized on disk. In Ted Codd's own words,

... the independence of applicaiton programs and terminal activities from growth in data types and changes in data representation ... (section 1.1)

By introducing an abstraction, it hides more complexities from the users and shifts more responsibilities to the database itself. It sounds really nice. But do we describe data exactly?

All data can be described as Relations

(Given domains S1, S2,... Sn, a relation) R is a subset of the Cartesian product S1 x S2 x ... x Sn. (section 1.3)

Relations can be of different degrees. Relations of a single domain is unary, degree 2 binary, and degree n n-ary. Relations are usually represented using an array of tuples. This is only for expository purposes. Since we are only using relations to describe the data itself and data only, "an array of tuples" implies nothing about how the data would actually be organized (e.g. column-oriented or whatnot). This is a profound insight by Dr. Codd in 1970!

For example, employment is a relation of domain employees and companies. The existence of a tuple in this relation e.g. <Bob, CakeCompany> describes a specific employment relationship between Bob and a company named CakeCompany. Relation is a set, hence duplicated entries are removed. It makes sense intuitively. There's no point describing the employment relationship between Bob and the CakeCompany more than once. Notice that tuples in a relation are also not sorted. It doesn't make sense to say that <Alice, IceCreamCompany> should come before <Bob, CakeCompany>. Otherwise, we would fall back to Order Dependence again. It's the job of query to perform ordering, and others (and use indices to speed up the performance).

What is a relation in English? Actually the dictionary definition works pretty well in this case. The following definition comes from Merriam Webster.

an aspect ... that connects two or more things or parts as being or belonging or working together

The dictionary definition excludes unary relations, which we can address. A relation is an aspect that connects things together. Its definition is so general that it can be used to describe all data. There's no formal proof of the relational model's almightiness when it comes to describing data. However we can go though a thought experiment to convince ourselves that it indeed can describe all data.

Imagine in a vacuum, there's absolutely nothing. Then comes data, one after another. There are no meanings associated with these data (imagine they are just a set of randomly generated bits). All these data are within domain S — which represents all possible data. There is our first unary relation. Even though the data itself is meaningless, and randomly generated by the universe, we can run meaningful queries against it — e.g. has the same random bits been generated before. We can create a binary relation for this randomly generated data as well — e.g. former and latter to capture the order relationship between elements.

As time goes, we have different kinds (domains) of data. Some are numbers, some are text, images, videos, or anything that can be stored. If they all only work in complete isolation – no relations across these domains — as if they are in separate universes, it would just be the same as our initial scenario. If data from multiple domains (not necessarily distinct) start to connect, more relations are formed. The meaning of connection here is in the most general sense — everything, everything in this universe are connected. E.g. You, who are reading this post at this moment, is absolutely connected with me in a relation defined as people who read this post. E.g. whatever JWST observes in the future is also connected with you and me in a relation defined as things in this universe (or things under the influence of the General Relativity, or you name it).

Relation can describe data in isolation as well as connections among data. All data can be described as relations. From this perspective, all data in a database is just a "collection of time-varying relations".

Ted Codd then went on in details about how to describe data in relations. There he defined primary key, foreign key, a process of normalization, and linguistic aspects of this declarative language for describing relations. All these database aspects remained virtually unchanged since 1970, which is absolutely remarkable. Thanks to the NoSQL movement, we know it's not because of a lack of trying.

Relational vs. k/v and graph model

A key/value store is just a binary relation. A graph model is again nothing but a degeneralized relational model. For each node, it can be a relation for all relevant attributes. Each edge is nothing more than a binary relation. You can expand the degree to be beyond 2, if you store data on the edge itself. Comparing any data model against the relational model is nonstarter.

However, in practice, people are not just comparing data models but actual databases hosting these data models. Simple key/value stores and graph stores, by restricting their data models, are generally easier to scale in a distributed deployment. Describing data in terms of relations is incredibly powerful, but in a distributed system, you have to decide where to store a relation, how to shard a relation, etc. Here, the relational model's generality becomes a challenge.

Operations on Relations

Now we have a relation that describes that data we have, which is great. However, a database needs to be queried to be useful (otherwise it's indistinguishable from just writing to /dev/null). Dr. Codd then went on in details about a set of operations can be performed on relations — projection(π), natural join(*), and restriction(|).

A projection is an operation that selects a few columns from the operand relation. It's notation is

πi1,i2,...,ik(R)

, for selecting i1, i2, ... columns from relation R.

The result of a natural join of binary relation R with binary relation S is defined as

R*S = {(a, b, c): R(a,b) ∧ S(b, c)}.

For restriction,

RL|MS (the L,M restriction of R by S) is the maximal subset R' of R such that πL(R') = πM(S).

E.g. let R be a relation for employment, πL(R) be all the employee names, and πM(S) be one tuple <Bob>. Then RL|MS will give me Bob's employment tuple. Restriction is more general of course as it can be applied to multiple columns.

Relational Algebra was not invented in Dr. Codd's 1970 paper, as he was applying "elementary relation theory". On Wikipedia, you can find other relational operations e.g. selection, outer join, etc. Some of these can be expressed by just using projection, natural join, and restriction.

My previous example for restriction is essentially a selection for Bob's employment tuple. Some selection operations might be more general, e.g. selecting employee who's older than 30. Well that can be expressed as

RL|1S, where R is the employment relation, πL(R) returns the age column of R, and S = {x | x > 30}.

Again these operations are completely decoupled from how query implementations. Users describe data in relations, which do not exist on disk. Users describe operations upon these relations, which also do not imply how these operations would be implemented and translated to machine instructions. My previous restriction case is a good example. No where it says for S = {x | x > 30}, we have to have all the integers greater than 30 stored on disk. It's just describing all integers greater than 30.

Redundancy and Consistency

A set of relations is redundant if a relation that has a projection which can be derived from other relations. E.g. a relation employee (id#, name, manager_id#, managername) is redundant because

π34(employee) = π12(employee)1|1π3(employee).

Having redundancy means there are multiple copies of the same data in various forms. Keeping them consistent would be a challenge. Dr. Codd called it out as a challenge at the last section. The good news is that Jim Gray, another Turing Award laureate, later came up transactions which solved the consistency problem and more.

50+ years and still counting

It's mind boggling how much of Ted Codd's paper published in 1970 is still true and unchanged after more than 50 years, in a field that almost everything else is rapidly changing. It really speaks to the declarative power of the relational model. It offers paramount expressiveness about data, and completely decouples user data model from physical data representation (e.g. column-based database is a good example). It's one of the greatest examples of solving a hard computer science problem by introducing another level of indirection/abstraction.