Welcome Guest. | Log In| Register | Membership Benefits

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Home
Digital Library
Events
RSS | Newsletters
Webcasts


March 30, 1999 Volume 2 Number 5

The beginnings of relational dependency theory

Thirty Years of Relational: The First Three Normal Forms

By C.J.Date

As if the invention of an entirely new theory of data (the relational model) wasn't contribution enough, Codd went on to use that theory as a basis for developing another completely new, and important, set of theoretical ideas -- namely, the ideas of further normalization. The field of further normalization (now more usually referred to as dependency theory, though Codd himself didn't use that term) can be regarded as a separate discipline, one that sits on top of the relational model but is not itself part of that model. As is well known, the most obvious practical application of dependency theory is in database design, but its usefulness is certainly not limited to that area alone. In fact, the "dependencies" of dependency theory are really a special case of integrity constraints in general, and many of the ideas developed as part of dependency theory turned out to be relevant to other kinds of integrity constraints as well.

By the way, there's no truth in the story (so far as I know) that Codd introduced the terminology of "normalizing relations" because Richard Nixon was trying to "normalize relations" with China at the time Codd was doing his research.

The Normalization Papers
Codd wrote three papers on the topic of further normalization. In chronological order, they are:

1. "The Second and Third Normal Forms for the Relational Model"2

2. "Further Normalization of the Data Base Relational Model"3

3. "Normalized Data Base Structure: A Brief Tutorial"5

The first of these papers, a brief (four-page) and rather terse technical memo, contains the first published definitions of second and third normal form -- though it's worth noting that it defines those concepts in terms of collections of relations, not in terms of relations per se:

"A collection C of relations is in second normal form if ... every relation R in C [is in second normal form as we now understand that term] ... A collection C of relations is in third normal form if ... every relation R in C [is in third normal form as we now understand that term]."

The other two papers, in contrast, frame their definitions and discussions in terms of individual relations, not collections.

I remark without further comment that this first paper unfortunately continues to talk in terms of domains instead of attributes; it also continues to talk as if the phrase "the relational model" referred to a view of the data in some specific database. (In fact, this latter criticism applies to the other two papers, as well.)

The second paper is the most formal of the three, and the one I want to concentrate on in this installment. It's really the primary source on the subject; indeed, it's often referred to, informally, just as "Codd's normalization paper." All quotes in what follows are taken from this paper unless otherwise indicated.

The third paper is, as its title makes clear, primarily tutorial in nature. It concerns itself not so much with the higher normal forms per se, but rather with the idea that relations in general can represent anything that other data structures -- hierarchies, networks (here called "plexes"), and so on -- can; moreover, they do it better! ("[Users often have] occasion to require tables to be printed out or displayed. What could be a simpler, more universally needed, and more universally understood data structure than a table? Why not permit ... users to view all the data ... in a tabular way?") The paper does go on to discuss second and third normal forms briefly, but its coverage is essentially limited to giving a single -- and very informal -- example in each case.

Note: Like the relational completeness paper4, the third paper also -- very unfortunately, in my opinion -- says that "each item [in a relation] is a simple number or a character string." As I wrote in this series previously11, the misconception that relational systems are capable of dealing only with very simple data such as numbers and strings remains widespread to this day, and remarks such as the one just quoted must be held partly responsible for that misconception.

Incidentally, while I'm on the subject of what can go inside relations, I should mention that (oddly enough) the sole appearance of a definition of first normal form in any of the three papers2,3,5is tucked away in an appendix to reference3:

"A relation is in first normal form if ... none of its domains has elements which are themselves sets. An unnormalized relation is one which is not in first normal form."

As I've mentioned before10, I don't quite accept this definition because I don't fully accept Codd's notion of data value atomicity. However, I don't want to get into that particular discussion here; if you're interested, you can find more details in a paper by Hugh Darwen7.

One last point: The three papers don't use the abbreviations 1NF, 2NF, 3NF; however, those abbreviations have since become virtually standard in the industry, and I'll use them myself in what follows.

Advantages of Further Normalization

Codd's "Further Normalization of the Data Base Relational Model"3 gives the following list of advantages that accrue from adopting a level of normalization higher than just first. (Actually, the paper refers to these advantages as objectives -- for further normalization, that is -- rather than as advantages as such. Please note that I've reworded them all somewhat here.)

1. It frees the database from certain insert, update, and delete anomalies.

2. It reduces the need to restructure the database as new kinds of data are introduced, thereby increasing application lifetimes.

3. It makes the database more informative to users.

4. It avoids biasing the database design in favor of certain queries at the expense of others.

The paper also goes on later to suggest that another advantage is that "the [higher] normal forms ... tend to capture some aspects of the semantics (minor, of course)." I love that parenthetical remark! What a contrast to some of the overblown claims we so often encounter regarding "semantic modeling" and the like elsewhere in the literature.

Note: The third (tutorial) paper mentions two further advantages of normalization (noting, however, that -- unlike the ones already quoted -- these two apply to mere first normal form rather than to 2NF and 3NF as such). Again I've paraphrased the original material somewhat:

5. It allows any relation to be represented as a table (basically because there are no repeating groups).

6. It allows the operations needed for data access to be simpler than they would otherwise have to be.

The first two papers also mention a couple possible disadvantages of further normalization: "[It incurs] the penalty of extra relation names ... [Also, some] queries will... need to employ more join terms than might otherwise be the case." However, they do also go on to say: "This potential burden... can be eased by [providing predefined views] for heavily used... queries."

Major New Concepts

Codd's normalization paper includes the first published definitions of an impressive number of highly important concepts. I've listed those concepts here, with commentary in each case; I've omitted most of the definitions, however, because as a database professional you should be thoroughly familiar with them already.

* Functional dependence

As Codd says, "[this concept] ... is fundamental to ... data base design." How true! Note: The concept was mentioned in the first paper also (necessarily so), but was there called internal dependence. It wasn't mentioned at all in the third paper. By the way, along with the functional dependence concept as such, Codd introduced the (fairly obvious) idea of dependency diagrams, although he didn't actually use that term. And, of course, he also introduced the standard notation "R.A - R.B". Here A and B are sets of attributes of relation R, and the notation is read "B is functionally dependent on A," or "A functionally determines B," in R. The notation can be simplified to just A - B if R is understood.

* Full dependence

Given the functional dependence A - B, B is said to be fully dependent on A if B isn't functionally dependent on any proper subset of A (remember that A and B are both sets of attributes). Note: "Full" isn't really the mot juste here; I prefer the term irreducible myself (but it's not a big deal). The first paper uses the term minimal.

* Transitive dependence

I assume you know the meaning of this term. However, I should mention that Codd also defines what he calls strict transitive dependence: Given a relation R, R.C is strictly transitively dependent on R.A if there exists some R.B such that R.A - R.B and R.B - R.C and R.C -/ R.B and R.B -/ R.A. This concept is used in the definition of "optimal 3NF," which I'll discuss in the next installment. Note: The first paper uses the term immediate in place of nontransitive (and, presumably, nonimmediate in place of transitive).

* Trivial dependence

A functional dependence A - B is said to be trivial if the set of attributes B is a subset (not necessarily a proper subset) of the set of attributes A.

* Candidate key

A candidate key K is required to be unique and nonredundant. (The latter term means that no attribute can be discarded from K without destroying the uniqueness property; again, I prefer the term irreducible for this concept, but again it's not a big deal.) Codd makes the important observations that (a) each attribute of relation R is functionally dependent on each candidate key of R, and (b) within any given candidate key, no proper subset of the attributes is functionally dependent on any other.

Note: It follows immediately from the first of these two observations that each attribute of R is functionally dependent on each superkey of R. (A superkey is a superset -- not necessarily a proper superset -- of a candidate key; in other words, we can obtain a definition of "superkey" by simply dropping the irreducibility requirement from the definition of "candidate key.)

Codd then goes on to say: "For each relation R in a data base, one of its candidate keys is arbitrarily designated as the primary key of R. The usual [sic!] operational distinction ... is that no tuple is allowed to have an undefined value for any of the primary key components, whereas ... components of [other candidate keys] may have an undefined value. This restriction is imposed because of the vital role played by primary keys in search algorithms."

You might recall that Codd said something similar to the foregoing in his 1970 paper, too1. Like many people, I've always been a bit bothered by the fact that the primary key is chosen "arbitrarily"; indeed, I've argued elsewhere9 that there are cases where we shouldn't have to choose a primary key at all. (The only hard requirement is that we must have at least one candidate key.) And I don't care very much for Codd's "usual operational distinction" between primary and nonprimary keys. (As you probably know, that distinction -- which was first mentioned, so far as I know, in the paper under discussion -- was later elevated into the entity integrity rule.6) Furthermore, I'm a little suspicious of Codd's justification for that "usual distinction," which seems to me to smack of implementation concerns.

I note in passing that the first and third papers in question both talk in terms of primary keys only, not alternate keys. (An alternate key is a candidate key that's not the primary key.) That is, in both papers Codd tacitly assumes that every relation has just one candidate key, which can thus be regarded, harmlessly, as the primary key. (Indeed, Codd doesn't use the term "alternate key" at all; in fact, I think the term is probably due to me -- I introduced it in the third edition of my book An Introduction to Database Systems8.)

* Insert/Update/Delete Anomaly

Actually it was the third paper that introduced the term "anomaly" -- the second paper used "dependency" -- but "anomaly" is the term that's passed into general use. Neither term was formally defined, but the concept is so important that it should be included in this list anyway. Note: Oddly enough, Codd nowhere points out explicitly that the anomalies in question are by redundancy; yet I know from my experience of teaching this material that when people first come to it, they immediately grasp the idea that low levels of normalization lead to redundancy (and "everyone knows" that redundancy is bad), whereas they have rather more difficulty in directly grasping the anomalies such redundancy can cause.

* Second Normal Form

Here's Codd's definition of 2NF: A relation R is in second normal form if it's in first normal form and every nonprime attribute is fully dependent on each candidate key of R. (A prime attribute is an attribute that participates in at least one candidate key -- not necessarily the primary key, despite the terminology! -- of the relation in question; a nonprime attribute is, of course, one that isn't prime.)

Now, what Codd was trying to do in introducing the prime attribute concept was head off the argument that, for example, the familiar shipments relation SP {S#,P#,QTY} wasn't in 2NF because attributes S# and P# weren't fully dependent on the candidate key {S#,P#}. As you probably know, however, the concepts did lead to some trouble later, especially in connection with 3NF. I'll have more to say on this topic in the next installment.

Incidentally, the second paper was the first one to make explicit the important point that the further normalization process is, precisely, a process of taking projections, and further that we can recover the original relation by taking the natural join of those projections. In other words, the process is reversible, or -- equivalently -- the decomposition is nonloss. Note, therefore, that dependency theory involves all three aspects of the original relational model (structure, integrity, manipulation), not just the structural aspect alone.

Note: The second paper doesn't really discuss the actual process of further normalization in any detail. To be more specific, it doesn't explicitly spell out the general rules regarding which projections to take, nor does it offer any proof of reversibility. However, a first cut at such rules and such a proof appeared in a more or less contemporary -- though much overlooked -- paper by Ian Heath.12

* Third Normal Form

Again, here's Codd's definition: A relation R is in 3NF if it's in 2NF and every nonprime attribute is nontransitively dependent on each candidate key of R. Note again the reference to nonprime attributes. Again, I'll have more to say on this topic next time.

C. J. Date is an independent author, lecturer, researcher, and consultant, specializing in relational database systems. His most recent books are Foundation for Object/Relational Databases: The Third Manifesto, coauthored with Hugh Darwen, and Relational Database Writings 1994-1997, both published by Addison-Wesley in 1998. Correspondence may be sent to him in care of Intelligent Enterprise, 411 Borel Ave. #100, San Mateo, CA 94402.

 

References

1. Codd, E. F. "A Relational Model of Data for Large Shared Data Banks." CACM 13(6), June 1970. Republished in Milestones of Research -- Selected Papers 1958-1982 (CACM 25th Anniversary Issue), CACM 26(1), January 1983.

2. Codd, E. F. "The Second and Third Normal Forms for the Relational Model." IBM technical memo (October 6th, 1970).

3. Codd, E. F. "Further Normalization of the Data Base Relational Model." (Presented at Courant Computer Science Symposia Series 6, "Data Base Systems," New York City, May 24th-25th, 1971.) IBM Research Report RJ909 (August 31st, 1971). Republished in Randall J. Rustin (ed.), Data Base Systems: Courant Computer Science Symposia Series 6. Prentice-Hall, 1972.

4. Codd, E. F. "Relational Completeness of Data Base Sublanguages." (Presented at Courant Computer Science Symposia Series 6, "Data Base Systems," New York City, May 24th-25th, 1971). IBM Research Report RJ987 (March 6th, 1972). Republished in Randall J. Rustin (ed.), Data Base Systems: Courant Computer Science Symposia Series 6. Prentice-Hall, 1972.

5. Codd, E. F. "Normalized Data Base Structure: A Brief Tutorial." Proc. 1971 ACM SIGFIDET Workshop on Data Description, Access, and Control, San Diego, November 11th-12th, 1971.

6. Codd, E. F. "Extending the Relational Database Model to Capture More Meaning." IBM Research Report RJ2599 (August 6th, 1979). Republished in ACM Transactions on Database Systems 4(4), December 1979.

7. Darwen, Hugh. "Relation-Valued Attributes; or, Will the Real First Normal Form Please Stand Up?", in C. J. Date and Hugh Darwen, Relational Database Writings 1989-1991. Addison-Wesley, 1992.

8. Date, C. J. An Introduction to Database Systems (6th edition). Addison-Wesley, 1995. (Note: The third edition of this book [mentioned in the body of this article] had a copyright date of 1981.)

9. Date, C. J. "The Primacy of Primary Keys: An Investigation." InfoDB 7(3), Summer 1993. Republished in C. J. Date, Relational Database Writings 1991-1994, Addison-Wesley, 1995.

10. Date, C. J. "The Birth of the Relational Model" (Part 3). Intelligent Enterprise 1(4), December 15, 1998.

11. Date, C. J. "Codd's Relational Algebra." Intelligent Enterprise 2(1), Online Only Feature, January 5, 1999.

12. Heath, I. J. Heath. "Unacceptable File Operations in a Relational Database." Proc. 1971 ACM SIGFIDET Workshop on Data Description, Access, and Control, San Diego, November 11th-12th, 1971.





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address