Codd's "completeness" paper may be the source of the "simple data" misconception about relational systemsThirty Years of Relational - Codd's Relational AlgebraIn 1972, Codd published another famous paper, "Relational Completeness of Data Base Sublanguages".3 Note that now he's talking about "data bases" instead of "data banks," yet he hasn't yet quite made it all the way to database as one word .(That didn't happen until 1979). In this paper, Codd gives formal definitions for both a relational algebra and a relational calculus; he also defines the notion of relational completeness, gives an algorithm for transforming an arbitrary calculus expression into an equivalent expression of the algebra (thereby proving that the relational algebra is relationally complete), and argues in favor of calculus over algebra as a basis for a practical database language. This is quite a lot of territory to cover for such a comparatively short paper (only 36 double- spaced pages, plus an abstract). An OverviewThis paper--which I'll refer to from now on as the "completeness paper"--is perhaps Codd's most formal. It's certainly the most difficult for the average reader. However, it's fundamental because it rounds out the material from the first two papers1, 2 to complete the definition of what we now think of as the original relational mode. Similar to its predecessors1, 2 the completeness paper continues to talk as if the relational model consisted of structural aspects only: "In previous papers ... we proposed a relational model of data ..." Codd writes. "[Now] we define a collection of operations on relations ..." In other words, you're not supposed to see those operations as part of the relational model per se. Nowadays, we definitely do regard the operations as an integral part of the model; what the completeness paper does is propose alternative but equivalent formalisms as a basis for that part. What does the completeness paper achieve? To quote the abstract: "In the near future, we can expect a great variety of [proposals for database languages]. This paper [provides] a theoretical basis which may be used to determine how complete a selection capability is provided in [such a language]." That theoretical basis is the "relational completeness" of the paper's title; a language is relationally complete if it's as powerful as the relational calculus (speaking a trifle loosely). Later in the paper, Codd asserts categorically that every general-purpose database language should be at least this powerful; in particular, he observes that in such a language, you can formulate queries without using "programming loops or any other form of branched execution--an important consideration when interrogating a data base from a terminal" (and important for access from within application programs, as we all now know). Languages should be "at least" relationally complete because according to Codd, such completeness is only a basic measure of a language's power. Practical languages need other capabilities as well. There are five principal sections in the paper: 1. Introduction 2. A Relational Algebra 3. Relational Calculus 4. Reduction 5. Calculus vs. Algebra Following the introduction, the paper gives a formal definition of a relational algebra--often now called Codd's relational algebra in order to distinguish it from other algebras that have been defined at various times--and a relational calculus. The paper then defines relational completeness in terms of the relational calculus. Next, it presents an algorithm ("Codd's reduction algorithm") for transforming an arbitrary expression of the calculus into an equivalent algebraic expression, proving that the algebra is at least as powerful as the calculus and is relationally complete. Finally, Codd offers some opinions regarding the relative merits of algebra and calculus as a basis for practical database language design. Relational Algebra OperatorsAccording to Chambers Twentieth Century Dictionary, an algebra is "any of a number of systems using symbols and involving reasoning about relationships and operations." More specifically, an algebra consists of a set of objects and a set of operators that satisfy certain axioms or laws (such as closure, commutativity, or associativity). The word "algebra" ultimately derives from the Arabic al-jebr, meaning a resetting (of something broken) or a combination. In relational algebra, the objects are relations; the operations are things such as restriction, projection, and join, and there are several axioms or laws, including closure - the crucially important one. The result of any relational operation is always another relation. As I've explained elsewhere,4 closure has the important consequence that lets us write nested relational expressions. The algebraic operations are all read only. (They all serve to derive some relation from other relations.) Therefore, they operate very specifically on relation values. From relational algebra's point of view, there's no need for -- and no such thing as -- a relation variable. Of course, a real database language will also require update operations and will require relation variables, too, but such matters are not part of the relational algebra per se. For "notational and expository convenience," Codd's paper assumes that "domains of relations" -- he really means attributes or columns -- are identified by ordinal position, not by name (though he does recognize that names are better in practice). As a result, he doesn't get into the question of column names of results, which is actually an important aspect of closure, and he doesn't discuss the need for a column rename operator. (See reference 4 for a detailed discussion of these issues.) In what follows, I'll use column names rather than numbers. The paper -- very unfortunately, in my opinion! -- says that "for data base purposes, we are concerned with data consisting of integers and character strings." This remark might possibly be the source of the misconception that relational systems are capable of dealing only with simple data such as numbers and strings. (The paper does also say that "other types of primitive elements may be included," but this remark merely brings us back to the thorny question of what "primitive," or "atomic," data might be.)8 The strange thing is that the paper nowhere relies on the "simple data" assumption, at least not in any significant way. Codd defines the following algebraic operations:
I don't propose to give definitions of all of these operations here because you can find such definitions in many other places (see reference 4 for an example). However, I will just note a few differences between Codd's definitions and those usually adopted today, as well as comment on certain related matters of interest. Cartesian product: Strictly speaking, the Cartesian product of two relations is a set of pairs of the form (a,b), where a is a tuple from the first operand relation and b is a tuple from the second. (The completeness paper seems to be the first in which Codd uses "tuple" as an abbreviation for n-tuple.) Because of the closure objective, however, we want the result to be a relation specifically. Codd defines an expanded version of Cartesian product, which produces a set of tuples instead of a set of pairs. More precisely, where the regular Cartesian product would have produced the pair (a,b), the expanded version produces a tuple consisting of all components of a together with all components of b. The expanded product operation --unlike the regular one -- is commutative and associative, meaning we can talk unambiguously of the product of n relations for any n. (We can even let n be one or zero, though Codd doesn't discuss these possibilities explicitly.) Union, intersection, difference: This paper was the first to mention the specifically relational versions of these operations, in which the operands are required to be "union-compatible" (so that again the result is always another relation). Union and intersection (but not difference) are both commutative and associative. O-restriction: "O" stands for any of the usual comparison operations (=, <). If A and B are attributes of relation R, the O-restriction of R on A and B is defined to be a relation with the same attributes as R and containing just those tuples of R for which the condition "A O B" is true. Note that this definition is different from the one given in reference 2. Note also that attributes are allowed to be "compound"--for example, the simple attributes STREET, CITY, STATE, and ZIP might be regarded as a compound attribute ADDR--though Codd doesn't address the question of what a comparison such as "A < B" might mean if A and B are compound in this sense. Projection: Codd makes the important observation that projection provides an algebraic counterpart to the existential relational calculus quantifier. Note: The paper adopts the convention that the projection of a relation R over no attributes at all is simply R. This convention is unfortunate, however, because projecting over no attributes at all should actually yield a nullary relation--specifically, either TABLE_DEE or TABLE_DUM.5 O-join and natural join: Codd defines natural join i terms of O-join (more precisely, as a projection of an equijoin). Today, we tend to regard natural join as being the more fundamental operation (so much so that the unqualified term "join" is usually taken to mean the natural join specifically). Indeed, you might have noticed that "O-joins" weren't even mentioned in Codd's first two papers.1, 2 Further, Codd does note that you can define O-join in terms of O-restriction, so it's a not a primitive operation. (Actually, the opposite is true--that is, you can define O-restriction in terms of O-join. Which collection of operations we choose to regard as primitive is somewhat arbitrary depending on where we stand. One commonly accepted primitive collection is restriction, projection, product, union, and difference.) Like the (expanded) product, union, and intersection operations, natural join is both commutative and associative. Division: The completeness paper was the first to mention this operation; in fact, it's clear that Codd introduced it here to serve as "an algebraic counterpart to the universal quantifier" in his reduction algorithm. He says that it isn't primitive; you can define this operation in terms of the operations already described. (As a matter of fact, the definition is not quite the same as the one we use today, but the differences are minor. It's possible to define a version of the operation that -- unlike the one defined in the paper -- allows any relation to be divided by any relation, as I've explained elsewhere.)6 Have you ever wondered why the operation is called "division?" The following identity shows why: ( R TIMES S ) DIVIDEBY S = R Division is a kind of inverse of Cartesian product, and it isn't quite the counterpart to the universal quantifier that it was meant to be. It suffers from problems having to do with empty sets and related matters.6 In fact, Codd offers an example that illustrates the problem: Given a relation SP { S#, P#, ... } showing which suppliers supply which parts, he claims that the expression SP { S#, P# } DIVIDEBY SP { P# } will give numbers for suppliers who provide all parts. However, if there aren't any parts, this expression gives the wrong answer. (It gives no supplier numbers, yet it should give them all). Relational comparisons provide a better basis for dealing with the kinds of problems that division was intended to solve -- but the relational model as originally defined by Codd didn't include such comparisons at all.7 Factoring: This operation (which today is more usually called nesting) converts a normalized relation to unnormalized form. Given the normalized relation EMP, for example, with attributes EMP# and DEPT#, you could use the operation to produce an unnormalized relation with attributes DEPT# and SET_OF_EMPS, in which each tuple contains a department number and a set of all corresponding employee numbers. This operation isn't used at all in the body of the paper; Codd relegates its definition to an appendix, suggesting it might be useful "for presentation purposes." Our understanding of normalization's true nature has improved since 1971; in fact, we now regard all relations as normalized. "Unnormalized relation" is a contradiction in terms. However, it's true that relations involving relation-valued attributes are often contraindicated. References1. E. F. Codd: "Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks." IBM Research Report RJ599 (August 19th, 1969). 2. E. F. Codd: "A Relational Model of Data for Large Shared Data Banks." CACM 13, No. 6 (June 1970). Republished in Milestones of Research--Selected Papers 1958-1982 (CACM 25th Anniversary Issue), CACM 26, No. 1 (January 1983). 3. E. F. Codd: "Relational Completeness of Data Base Sublanguages" (presented at Courant Computer Science Symposia Series 6, "Data Base Systems," New York City, N.Y., May 24th-25th, 1971). IBM Research Report RJ987 (March 6th, 1972). Republished in Randall J. Rustin (ed.), Data Base Systems: Courant Computer Science Symposia Series 6. Englewood Cliffs, N.J.: Prentice-Hall (1972). 4. C. J. Date: An Introduction to Database Systems (6th edition). Reading, Mass.: Addison-Wesley (1995). 5. C. J. Date: "Tables with No Columns." Database Programming & Design 6, No. 3 (March 1993). Republished in Relational Database Writings 1991-1994. Reading, Mass.: Addison-Wesley (1995). 6. C. J. Date: "Divide--and Conquer?" Database Programming & Design 7, No. 4 (April 1994). Republished in Relational Database Writings 1991-1994. Reading, Mass.: Addison-Wesley (1995). 7. C. J. Date: "Relations Beyond Compare." Database Programming & Design 7, No. 5 (May 1994). Republished (as "Relational Comparisons") in Relational Database Writings 1991-1994. Reading, Mass.: Addison-Wesley (1995). 8. C. J. Date: "Nested Relations." Part 1, Database Programming & Design 8, No. 3 (March 1995); Part 2, Database Programming & Design 8, No. 4 (April 1995).
C. J. Date is an independent author, lecturer, researcher, and consultant specializing in relational database systems. His most recent books are Foundation for Object/Relational Databases: The Third Manifesto (1998), coauthored with Hugh Darwen, and Relational Database Writings 1994-1997, both published by Addison-Wesley in 1998. You can reach him at iemagazine@mfi.com. |
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|




