Welcome Guest. | Log In| Register | Membership Benefits

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Home
Digital Library
Events
RSS | Newsletters
Webcasts


December 21, 1999, Volume 2 - Number 18

Auxiliary Angst

SQL Techniques for Manipulating Tables

Every now and then I like to do a column which is strictly for programmers and geeks. It gives me a chance to get back to my roots and pass along tricks that might be useful to people doing real work.

There is a technique in SQL that I call auxiliary tables. The idea is that SQL is a lousy computational and procedural language, but it is good for doing table manipulations. Ergo, we need to play up the strength of the language.

I do not see this as a flaw, but the correct design decision. Mark Twain once said "The pumpkin is largely a failure as a shade tree" in one of his humorous essays. Fortran compiler writers worried about how to generate the tightest, most accurate code for doing algebra, while SQL compiler writers worry about the best code for doing joins. It is the programmer's fault if he keeps trying to use his pumpkin as a shade tree.

If you are really old, like me, you might remember that before calculators, you had to look up functions in tables in the back of your textbooks. Engineers looked up trig functions, finance majors looked up Net Present Values, and so forth. This is what an auxiliary table does.

I use the term "auxiliary" to stress that the table is not really part of the schema in the sense of representing an entity or relationship in the data model. The auxiliary table sits over to the side, and calculates a function for the queries.

The most common auxiliary tables are sequence tables and calendar tables of two types. The sequence table has the general declaration:

CREATE TABLE Sequence 
(nbr INTEGER NOT NULL PRIMARY KEY 
CONSTRAINT non_negative_nbr 
CHECK (nbr > 0) 
cardinal VARCHAR(100) NOT NULL, 
ordinal VARCHAR(100) NOT NULL, 
... 
CONSTRAINT numbers_are_complete 
CHECK ((SELECT COUNT(*) FROM Sequence) = 
(SELECT MAX(nbr) FROM Sequence)); 

with data like:

nbr cardinal ordinal ... 
======================================== 
1 'one' 'first' 
2 'two' 'second' 
3 'three' 'third' 
... ... ... 
101 'One hundred and one' 'One hundredth and first' 
... ... ... 

The name tells us everything. It is a list of all integers from 1 to large numbers. The other columns are simply examples of handy things that you might want to do with an integer, such as turn it into English words, which would be difficult in pure SQL.

I have found that is it a bad idea to start with zero, even though that seems more natural for computer programmers. The reason for omitting zero is that this auxiliary table is often used to provide row numbering by being CROSS JOIN-ed to another table and the zero would throw off the one-to-one mapping. As an example of the Sequence table in practice, assume that you are given a table with the monthly sales data shown as an attribute (the monthly amounts have to be NULL-able to hold missing values for the future), thus:

CREATE TABLE AnnualSales1 
(salesman CHAR(15) NOT NULL PRIMARY KEY, 
jan DECIMAL(5, 2), 
feb DECIMAL(5, 2), 
mar DECIMAL(5, 2), 
apr DECIMAL(5, 2), 
may DECIMAL(5, 2), 
jun DECIMAL(5, 2), 
jul DECIMAL(5, 2), 
aug DECIMAL(5, 2), 
sep DECIMAL(5, 2), 
oct DECIMAL(5, 2), 
nov DECIMAL(5, 2), 
"dec" DECIMAL(5, 2)); 

As a side note, the reason that "dec" is inside double quotes is that it is a reserved word in SQL-92 and the double quotes allow us to use it as a variable name (strings are enclosed in single quote marks). The goal is to "flatten" it out so that it looks like this:

CREATE TABLE AnnualSales2 
(salesman CHAR(15) NOT NULL PRIMARY KEY, 
month CHAR(3) NOT NULL 
CONSTRAINT valid_month_code 
CHECK (month IN ('Jan', 'Feb', 'Mar', 'Apr', 
'May', 'Jun', 'Jul', 'Aug', 
'Sep', 'Oct', 'Nov', 'Dec'), 
amount DECIMAL(5, 2) NOT NULL, 
PRIMARY KEY(salesman, month)); 

The trick is to build a VIEW of the original table with a number beside each month:

CREATE VIEW NumberedSales 
AS SELECT salesman, 
1 AS M01, jan, 
2 AS M02, feb, 
3 AS M03, mar, 
4 AS M04, apr, 
5 AS M05, may, 
6 AS M06, jun, 
7 AS M07, jul, 
8 AS M08, aug, 
9 AS M09, sep, 
10 AS M10, oct, 
11 AS M11, nov, 
12 AS M12, "dec" 
FROM AnnualSales1; 

Now you can use the auxiliary table of sequential numbers or you can use a VALUES table constructor to build one. The flatten VIEW is:

CREATE VIEW AnnualSales2 (salesman, month, amount) 
AS SELECT S1.salesman, 
(CASE WHEN A.nbr = M01 THEN 'Jan' 
WHEN A.nbr = M02 THEN 'Feb' 
WHEN A.nbr = M03 THEN 'Mar' 
WHEN A.nbr = M04 THEN 'Apr' 
WHEN A.nbr = M05 THEN 'May' 
WHEN A.nbr = M06 THEN 'Jun' 
WHEN A.nbr = M07 THEN 'Jul' 
WHEN A.nbr = M08 THEN 'Aug' 
WHEN A.nbr = M09 THEN 'Sep' 
WHEN A.nbr = M10 THEN 'Oct' 
WHEN A.nbr = M11 THEN 'Nov' 
WHEN A.nbr = M12 THEN 'Dec' 
ELSE NULL END), 
(CASE WHEN A.nbr = M01 THEN jan 
WHEN A.nbr = M02 THEN feb 
WHEN A.nbr = M03 THEN mar 
WHEN A.nbr = M04 THEN apr 
WHEN A.nbr = M05 THEN may 
WHEN A.nbr = M06 THEN jun 
WHEN A.nbr = M07 THEN jul 
WHEN A.nbr = M08 THEN aug 
WHEN A.nbr = M09 THEN sep 
WHEN A.nbr = M10 THEN oct 
WHEN A.nbr = M11 THEN nov 
WHEN A.nbr = M12 THEN "dec" 
ELSE NULL END) 
FROM AnnualSales AS S1 
CROSS JOIN 
(SELECT nbr FROM Sequence WHERE nbr <= 12) AS A(nbr); 

If your SQL product has derived tables, this can be written as a single VIEW query.

The Calendar Table

The Common Era calendar, formerly known as the Gregorian or Christian calendar before it became an ISO standard, is fairly irregular to start with. Furthermore, fiscal years and business days do not follow a pattern that is easy to calculate. This makes the calendar ideal for an auxiliary table.

The basic form of this table is given below; I will explain the columns in a minute.

CREATE TABLE Calendar 
(cal_date DATE NOT NULL PRIMARY KEY, 
julian_day INTEGER NOT NULL 
CONSTRAINT valid_julian_day 
CHECK (julian_day BETWEEN 1 AND 366), 
business_day INTEGER NOT NULL 
CONSTRAINT bus_day_flag 
CHECK (business_day IN (0, 1)), 
three_business_days DATE NOT NULL, 
fiscal_month INTEGER NOT NULL 
CONSTRAINT valid_month_nbr 
CHECK (fiscal_month BETWEEN 1 AND 12), 
fiscal_year INTEGER NOT NULL, 
...); 

But some events have a duration over a period of days. That leads to a second form of the Calendar table that maps a start and stop date pair into some value and has the general declaration:

CREATE TABLE EventCalendar 
(event VARCHAR(30) NOT NULL PRIMARY KEY, 
start_date DATE NOT NULL, 
end_date DATE, -- null means no ending date
..., 
CONSTRAINT started_before_endded 
CHECK (start_date <= end_date));

This is the most useful auxiliary table because the calendar is so irregular and there are so many different fiscal calendars in use.

As a demonstration of the irregularity, consider the SEC (Security and Exchange Commission) rule that a brokerage transaction must close within three business days. A business day does not include Saturdays, Sundays or holidays declared by the New York Stock Exchange. Oh, did I mention that in times of national emergency, the Federal Government can also close the exchange? You can compute the occurrences of Saturdays and Sundays with a "day of the week" library function (also known as Zeller's Congruence) in many SQL products, but not the holidays.

The data for the calendar table can be built with the help of a good spreadsheet, since spreadsheets usually have more temporal functions than databases.

Once we have a calendar table, what do we do with it? Let's assume that we have a table of sales slips that looks like this:

CREATE TABLE SalesSlips 
(salesman CHAR(15) NOT NULL PRIMARY KEY, 
sales_date DATE NOT NULL, 
item_nbr INTEGER NOT NULL REFERENCES Inventory(item_nbr), 
amount DECIMAL(10, 2) NOT NULL, 
...);

and we want to get a report on sales personnel performance based on the fiscal year. The query is simply a join:

SELECT S1.salesman, C1.fiscal_year, SUM(S1.amount) 
FROM SalesSlips AS S1. Calendar AS C1 
WHERE S1.sales_date = C1.cal_date 
GROUP BY S1.salesman, C1.fiscal_year;

A trick worth mentioning is the use of the NULL for an eternity token in the ending date of a duration. You will see this used in data warehouse applications where historical data is kept. For example, to find the price that was in effect for items on a particular date, you might write this query.

SELECT P1.item, P1.sales_date, P1.price 
FROM Price_History AS P1, Sales AS S1
WHERE S1.sales_date 
BETWEEN P1.start_date 
AND COALESCE (P1.end_date, S1.sales_date);

In effect, the COALESCE() function will return the last date that we know a particular price was in effect.

The Partitioning Table

The power of an auxiliary table can be illustrated with a problem that Francisco Moreno of Columbia sent me. We have a table of patient id numbers and the months, shown as numbers from 1 to 12, in which they visited us this past year.

CREATE TABLE Visits_1999 
(patient_nbr INTEGER NOT NULL 
REFERENCES Patients(patient_nbr), 
visit_month INTEGER NOT NULL 
CONSTRAINT valid_month_nbr 
CHECK (fiscal_month BETWEEN 1 AND 12)); 
INSERT INTO Visits_1999 VALUES (10, 1); 
INSERT INTO Visits_1999 VALUES (10, 2); 
INSERT INTO Visits_1999 VALUES (10, 3); 
INSERT INTO Visits_1999 VALUES (10, 4); 
INSERT INTO Visits_1999 VALUES (10, 5); 
INSERT INTO Visits_1999 VALUES (10, 8); 
INSERT INTO Visits_1999 VALUES (20, 2); 
INSERT INTO Visits_1999 VALUES (20, 3); 
INSERT INTO Visits_1999 VALUES (20, 5); 
INSERT INTO Visits_1999 VALUES (20, 6); 
INSERT INTO Visits_1999 VALUES (20, 7); 
INSERT INTO Visits_1999 VALUES (20, 7); 
INSERT INTO Visits_1999 VALUES (20, 8); 
INSERT INTO Visits_1999 VALUES (30, 6); 
INSERT INTO Visits_1999 VALUES (30, 7); 
INSERT INTO Visits_1999 VALUES (30, 9); 
INSERT INTO Visits_1999 VALUES (30, 9); 
INSERT INTO Visits_1999 VALUES (30, 11); 
INSERT INTO Visits_1999 VALUES (40, 1); 
INSERT INTO Visits_1999 VALUES (40, 2); 
INSERT INTO Visits_1999 VALUES (40, 5); 
INSERT INTO Visits_1999 VALUES (40, 5); 
INSERT INTO Visits_1999 VALUES (40, 5); 
INSERT INTO Visits_1999 VALUES (40, 8);

In fairness, this would probably be a VIEW and not a table in a real situation, but bear with me. We need a query that finds the id_numbers of the patients that have been to see us more than two times in any two consecutive months.

In the sample data, we can see that patient 10 is not selected because he had exactly two visits for any pair of consecutive months. Patient 20 saw us three times in months 6 and 7, and he also has three visits in months 7 and 8. He gets selected. Patient 30, had two visits in month 9, but he was not in to see us in month 10 or month 8. He must not be selected, since he does not meet the consecutive months criteria. Patient 40 he has three visits in month 5, although he was not in month 4 nor month 6, he has to be selected because he had more than two consultations, no matter if they were in the same month.

Moreno found several solutions, using Oracle. His first solution uses a UNION to build pairs of months to either side of the present month. Since the present month has at least one visit, the bordering months need to have more than two visits themselves.

SELECT DISTINCT patient_nbr 
FROM Visits_1999 AS a 
WHERE 2 
< ANY (SELECT COUNT(*) 
FROM Visits_1999 AS V2 
WHERE V1.patient_nbr = V2.patient_nbr 
AND V2.visit_month 
IN (V1.visit_month, V1.visit_month - 1) 
UNION 
SELECT COUNT(*) 
FROM Visits_1999 AS V2 
WHERE V1.patient_nbr = V2.patient_nbr 
AND V2.visit_month 
IN (V1.visit_month, V1.visit_month + 1));

His second solution is similar to solution number one, but avoids the UNION:

SELECT DISTINCT patient_nbr 
FROM Visits_1999 AS a 
WHERE 2 < (SELECT CASE 
WHEN (COUNT (CASE 
WHEN V2.visit_month 
= V1.visit_month + 1 
THEN NULL ELSE 1 END)) 
>= (COUNT (CASE 
WHEN V2.visit_month 
= V1.visit_month - 1 
THEN NULL ELSE 1 END)) 
THEN (COUNT (CASE 
WHEN V2.visit_month 
= V1.visit_month + 1 
THEN NULL ELSE 1 END)) 
ELSE (COUNT (CASE 
WHEN V2.visit_month 
= V1.visit_month - 1 
THEN NULL ELSE 1 END)) 
FROM Visits_1999 AS V2 
WHERE V1.patient_nbr = V2.patient_nbr 
AND V2.visit_month 
IN (V1.visit_month - 1, 
V1.visit_month, 
V1.visit_month + 1)); 

In the original code, Moreno used the GREATEST() function which is an Oracle extension. I translated it into the outermost CASE expression to make the code SQL-92 compliant. Frankly, the LEAST() and GREATEST functions are very useful and I wish they were standard.

Moreno's best answer in terms of performance was this one.

SELECT patient_nbr 
FROM Visits_1999 
GROUP BY patient_nbr 
HAVING COUNT (CASE WHEN visit_month IN (1,2) 
THEN 1 ELSE NULL END) > 2 
OR COUNT (CASE WHEN visit_month IN (2,3) 
THEN 1 ELSE NULL END) > 2 
OR COUNT (CASE WHEN visit_month IN (3, 4) 
THEN 1 ELSE NULL END) > 2 
OR COUNT (CASE WHEN visit_month IN (4,5) 
THEN 1 ELSE NULL END) > 2 
OR COUNT (CASE WHEN visit_month IN (5,6) 
THEN 1 ELSE NULL END) > 2 
OR COUNT (CASE WHEN visit_month IN (6, 7) 
THEN 1 ELSE NULL END) > 2 
OR COUNT (CASE WHEN visit_month IN (7, 8) 
THEN 1 ELSE NULL END) > 2 
OR COUNT (CASE WHEN visit_month IN (8, 9) 
THEN 1 ELSE NULL END) > 2 
OR COUNT (CASE WHEN visit_month IN (9, 10) 
THEN 1 ELSE NULL END) > 2 
OR COUNT (CASE WHEN visit_month IN (10, 11) 
THEN 1 ELSE NULL END) > 2 
OR COUNT (CASE WHEN visit_month IN (11, 12) 
THEN 1 ELSE NULL END) > 2;

He had a fourth answer, which was a theoretical exercise to see if a self-join would work. It did, but performance was horrible and the code was both highly proprietary and convoluted, so I will not translate and reproduce it. Instead, I will offer a prize book for the first three solutions I get in my email which use a self-join and is not too proprietary.

Let's go back and look at Moreno's best answer. The chain of COUNT() functions are really looking at a partition of the months into pairs and counting the visits in each partition. Hey, we can put all of that into an auxiliary table, like this.

CREATE TABLE TwoMonths 
(start_visit_month INTEGER NOT NULL, 
end_visit_month INTEGER NOT NULL, 
period CHAR(7) NOT NULL, 
CHECK (start_visit_month < end_visit_month); 
INSERT INTO TwoMonths VALUES (1, 2, 'jan-feb'); 
INSERT INTO TwoMonths VALUES (2, 3, 'feb-mar'); 
INSERT INTO TwoMonths VALUES (3, 4, 'mar-apr'); 
INSERT INTO TwoMonths VALUES (4, 5, 'apr-may'); 
INSERT INTO TwoMonths VALUES (5, 6, 'may-jun'); 
INSERT INTO TwoMonths VALUES (6, 7, 'jun-jul'); 
INSERT INTO TwoMonths VALUES (7, 8, 'jul-aug'); 
INSERT INTO TwoMonths VALUES (8, 9, 'aug-sep'); 
INSERT INTO TwoMonths VALUES (9, 10, 'sep-oct'); 
INSERT INTO TwoMonths VALUES (10, 11, 'oct-nov'); 
INSERT INTO TwoMonths VALUES (11, 12, 'nov-dec');

The name of the periods is informative and while the original problem does not call for it, displaying it could be useful. This gives us the final query.

SELECT A1.patient_nbr, T1.period
FROM Table_a AS A1, TwoMonths AS T1 
WHERE A1.visit_month 
BETWEEN T1.start_visit_month AND T1.end_visit_month
GROUP BY A1.patient_nbr, T1.period 
HAVING COUNT (T1.period) > 2;

If you want to show only the patient numbers, then change the SELECT clause to read "SELECT DISTINCT A1.patient_nbr" and you have it.


Joe Celko is an Atlanta-based independent consultant. He is the author of three books on SQL -- SQL For Smarties (Morgan-Kaufmann, 1995), SQL Puzzles and Answers (Morgan-Kaufmann, 1995), and Instant SQL Programming (Wrox Press, 1997) -- and wrote the SQL for Smarties column for DBMS magazine. You can contact him via email at www.celko.com or 71062.1056@compuserve.com.





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address