|
December 21, 1999, Volume 2 - Number 18 Auxiliary AngstSQL Techniques for Manipulating TablesEvery now and then I like to do a column which is strictly for programmers and geeks. It gives me a chance to get back to my roots and pass along tricks that might be useful to people doing real work.
I do not see this as a flaw, but the correct design decision. Mark Twain once said "The pumpkin is largely a failure as a shade tree" in one of his humorous essays. Fortran compiler writers worried about how to generate the tightest, most accurate code for doing algebra, while SQL compiler writers worry about the best code for doing joins. It is the programmer's fault if he keeps trying to use his pumpkin as a shade tree. If you are really old, like me, you might remember that before calculators, you had to look up functions in tables in the back of your textbooks. Engineers looked up trig functions, finance majors looked up Net Present Values, and so forth. This is what an auxiliary table does. I use the term "auxiliary" to stress that the table is not really part of the schema in the sense of representing an entity or relationship in the data model. The auxiliary table sits over to the side, and calculates a function for the queries. The most common auxiliary tables are sequence tables and calendar tables of two types. The sequence table has the general declaration:
CREATE TABLE Sequence (nbr INTEGER NOT NULL PRIMARY KEY CONSTRAINT non_negative_nbr CHECK (nbr > 0) cardinal VARCHAR(100) NOT NULL, ordinal VARCHAR(100) NOT NULL, ... CONSTRAINT numbers_are_complete CHECK ((SELECT COUNT(*) FROM Sequence) = (SELECT MAX(nbr) FROM Sequence)); with data like:
nbr cardinal ordinal ... ======================================== 1 'one' 'first' 2 'two' 'second' 3 'three' 'third' ... ... ... 101 'One hundred and one' 'One hundredth and first' ... ... ... The name tells us everything. It is a list of all integers from 1 to large numbers. The other columns are simply examples of handy things that you might want to do with an integer, such as turn it into English words, which would be difficult in pure SQL.
I have found that is it a bad idea to start with zero, even though that seems
more natural for computer programmers. The reason for omitting zero is that
this auxiliary table is often used to provide row numbering by being
CREATE TABLE AnnualSales1 (salesman CHAR(15) NOT NULL PRIMARY KEY, jan DECIMAL(5, 2), feb DECIMAL(5, 2), mar DECIMAL(5, 2), apr DECIMAL(5, 2), may DECIMAL(5, 2), jun DECIMAL(5, 2), jul DECIMAL(5, 2), aug DECIMAL(5, 2), sep DECIMAL(5, 2), oct DECIMAL(5, 2), nov DECIMAL(5, 2), "dec" DECIMAL(5, 2)); As a side note, the reason that "dec" is inside double quotes is that it is a reserved word in SQL-92 and the double quotes allow us to use it as a variable name (strings are enclosed in single quote marks). The goal is to "flatten" it out so that it looks like this:
CREATE TABLE AnnualSales2
(salesman CHAR(15) NOT NULL PRIMARY KEY,
month CHAR(3) NOT NULL
CONSTRAINT valid_month_code
CHECK (month IN ('Jan', 'Feb', 'Mar', 'Apr',
'May', 'Jun', 'Jul', 'Aug',
'Sep', 'Oct', 'Nov', 'Dec'),
amount DECIMAL(5, 2) NOT NULL,
PRIMARY KEY(salesman, month));
The trick is to build a
CREATE VIEW NumberedSales AS SELECT salesman, 1 AS M01, jan, 2 AS M02, feb, 3 AS M03, mar, 4 AS M04, apr, 5 AS M05, may, 6 AS M06, jun, 7 AS M07, jul, 8 AS M08, aug, 9 AS M09, sep, 10 AS M10, oct, 11 AS M11, nov, 12 AS M12, "dec" FROM AnnualSales1;
Now you can use the auxiliary table of sequential numbers or you can use a
CREATE VIEW AnnualSales2 (salesman, month, amount) AS SELECT S1.salesman, (CASE WHEN A.nbr = M01 THEN 'Jan' WHEN A.nbr = M02 THEN 'Feb' WHEN A.nbr = M03 THEN 'Mar' WHEN A.nbr = M04 THEN 'Apr' WHEN A.nbr = M05 THEN 'May' WHEN A.nbr = M06 THEN 'Jun' WHEN A.nbr = M07 THEN 'Jul' WHEN A.nbr = M08 THEN 'Aug' WHEN A.nbr = M09 THEN 'Sep' WHEN A.nbr = M10 THEN 'Oct' WHEN A.nbr = M11 THEN 'Nov' WHEN A.nbr = M12 THEN 'Dec' ELSE NULL END), (CASE WHEN A.nbr = M01 THEN jan WHEN A.nbr = M02 THEN feb WHEN A.nbr = M03 THEN mar WHEN A.nbr = M04 THEN apr WHEN A.nbr = M05 THEN may WHEN A.nbr = M06 THEN jun WHEN A.nbr = M07 THEN jul WHEN A.nbr = M08 THEN aug WHEN A.nbr = M09 THEN sep WHEN A.nbr = M10 THEN oct WHEN A.nbr = M11 THEN nov WHEN A.nbr = M12 THEN "dec" ELSE NULL END) FROM AnnualSales AS S1 CROSS JOIN (SELECT nbr FROM Sequence WHERE nbr <= 12) AS A(nbr);
If your SQL product has derived tables, this can be written as a single The Calendar TableThe Common Era calendar, formerly known as the Gregorian or Christian calendar before it became an ISO standard, is fairly irregular to start with. Furthermore, fiscal years and business days do not follow a pattern that is easy to calculate. This makes the calendar ideal for an auxiliary table. The basic form of this table is given below; I will explain the columns in a minute.
CREATE TABLE Calendar (cal_date DATE NOT NULL PRIMARY KEY, julian_day INTEGER NOT NULL CONSTRAINT valid_julian_day CHECK (julian_day BETWEEN 1 AND 366), business_day INTEGER NOT NULL CONSTRAINT bus_day_flag CHECK (business_day IN (0, 1)), three_business_days DATE NOT NULL, fiscal_month INTEGER NOT NULL CONSTRAINT valid_month_nbr CHECK (fiscal_month BETWEEN 1 AND 12), fiscal_year INTEGER NOT NULL, ...); But some events have a duration over a period of days. That leads to a second form of the Calendar table that maps a start and stop date pair into some value and has the general declaration:
CREATE TABLE EventCalendar (event VARCHAR(30) NOT NULL PRIMARY KEY, start_date DATE NOT NULL, end_date DATE, -- null means no ending date ..., CONSTRAINT started_before_endded CHECK (start_date <= end_date)); This is the most useful auxiliary table because the calendar is so irregular and there are so many different fiscal calendars in use. As a demonstration of the irregularity, consider the SEC (Security and Exchange Commission) rule that a brokerage transaction must close within three business days. A business day does not include Saturdays, Sundays or holidays declared by the New York Stock Exchange. Oh, did I mention that in times of national emergency, the Federal Government can also close the exchange? You can compute the occurrences of Saturdays and Sundays with a "day of the week" library function (also known as Zeller's Congruence) in many SQL products, but not the holidays. The data for the calendar table can be built with the help of a good spreadsheet, since spreadsheets usually have more temporal functions than databases. Once we have a calendar table, what do we do with it? Let's assume that we have a table of sales slips that looks like this:
CREATE TABLE SalesSlips (salesman CHAR(15) NOT NULL PRIMARY KEY, sales_date DATE NOT NULL, item_nbr INTEGER NOT NULL REFERENCES Inventory(item_nbr), amount DECIMAL(10, 2) NOT NULL, ...); and we want to get a report on sales personnel performance based on the fiscal year. The query is simply a join:
SELECT S1.salesman, C1.fiscal_year, SUM(S1.amount) FROM SalesSlips AS S1. Calendar AS C1 WHERE S1.sales_date = C1.cal_date GROUP BY S1.salesman, C1.fiscal_year;
A trick worth mentioning is the use of the
SELECT P1.item, P1.sales_date, P1.price FROM Price_History AS P1, Sales AS S1 WHERE S1.sales_date BETWEEN P1.start_date AND COALESCE (P1.end_date, S1.sales_date);
In effect, the The Partitioning TableThe power of an auxiliary table can be illustrated with a problem that Francisco Moreno of Columbia sent me. We have a table of patient id numbers and the months, shown as numbers from 1 to 12, in which they visited us this past year.
CREATE TABLE Visits_1999 (patient_nbr INTEGER NOT NULL REFERENCES Patients(patient_nbr), visit_month INTEGER NOT NULL CONSTRAINT valid_month_nbr CHECK (fiscal_month BETWEEN 1 AND 12)); INSERT INTO Visits_1999 VALUES (10, 1); INSERT INTO Visits_1999 VALUES (10, 2); INSERT INTO Visits_1999 VALUES (10, 3); INSERT INTO Visits_1999 VALUES (10, 4); INSERT INTO Visits_1999 VALUES (10, 5); INSERT INTO Visits_1999 VALUES (10, 8); INSERT INTO Visits_1999 VALUES (20, 2); INSERT INTO Visits_1999 VALUES (20, 3); INSERT INTO Visits_1999 VALUES (20, 5); INSERT INTO Visits_1999 VALUES (20, 6); INSERT INTO Visits_1999 VALUES (20, 7); INSERT INTO Visits_1999 VALUES (20, 7); INSERT INTO Visits_1999 VALUES (20, 8); INSERT INTO Visits_1999 VALUES (30, 6); INSERT INTO Visits_1999 VALUES (30, 7); INSERT INTO Visits_1999 VALUES (30, 9); INSERT INTO Visits_1999 VALUES (30, 9); INSERT INTO Visits_1999 VALUES (30, 11); INSERT INTO Visits_1999 VALUES (40, 1); INSERT INTO Visits_1999 VALUES (40, 2); INSERT INTO Visits_1999 VALUES (40, 5); INSERT INTO Visits_1999 VALUES (40, 5); INSERT INTO Visits_1999 VALUES (40, 5); INSERT INTO Visits_1999 VALUES (40, 8);
In fairness, this would probably be a In the sample data, we can see that patient 10 is not selected because he had exactly two visits for any pair of consecutive months. Patient 20 saw us three times in months 6 and 7, and he also has three visits in months 7 and 8. He gets selected. Patient 30, had two visits in month 9, but he was not in to see us in month 10 or month 8. He must not be selected, since he does not meet the consecutive months criteria. Patient 40 he has three visits in month 5, although he was not in month 4 nor month 6, he has to be selected because he had more than two consultations, no matter if they were in the same month.
Moreno found several solutions, using Oracle. His first solution uses a
SELECT DISTINCT patient_nbr FROM Visits_1999 AS a WHERE 2 < ANY (SELECT COUNT(*) FROM Visits_1999 AS V2 WHERE V1.patient_nbr = V2.patient_nbr AND V2.visit_month IN (V1.visit_month, V1.visit_month - 1) UNION SELECT COUNT(*) FROM Visits_1999 AS V2 WHERE V1.patient_nbr = V2.patient_nbr AND V2.visit_month IN (V1.visit_month, V1.visit_month + 1));
His second solution is similar to solution number one, but avoids the
SELECT DISTINCT patient_nbr FROM Visits_1999 AS a WHERE 2 < (SELECT CASE WHEN (COUNT (CASE WHEN V2.visit_month = V1.visit_month + 1 THEN NULL ELSE 1 END)) >= (COUNT (CASE WHEN V2.visit_month = V1.visit_month - 1 THEN NULL ELSE 1 END)) THEN (COUNT (CASE WHEN V2.visit_month = V1.visit_month + 1 THEN NULL ELSE 1 END)) ELSE (COUNT (CASE WHEN V2.visit_month = V1.visit_month - 1 THEN NULL ELSE 1 END)) FROM Visits_1999 AS V2 WHERE V1.patient_nbr = V2.patient_nbr AND V2.visit_month IN (V1.visit_month - 1, V1.visit_month, V1.visit_month + 1));
In the original code, Moreno used the Moreno's best answer in terms of performance was this one.
SELECT patient_nbr FROM Visits_1999 GROUP BY patient_nbr HAVING COUNT (CASE WHEN visit_month IN (1,2) THEN 1 ELSE NULL END) > 2 OR COUNT (CASE WHEN visit_month IN (2,3) THEN 1 ELSE NULL END) > 2 OR COUNT (CASE WHEN visit_month IN (3, 4) THEN 1 ELSE NULL END) > 2 OR COUNT (CASE WHEN visit_month IN (4,5) THEN 1 ELSE NULL END) > 2 OR COUNT (CASE WHEN visit_month IN (5,6) THEN 1 ELSE NULL END) > 2 OR COUNT (CASE WHEN visit_month IN (6, 7) THEN 1 ELSE NULL END) > 2 OR COUNT (CASE WHEN visit_month IN (7, 8) THEN 1 ELSE NULL END) > 2 OR COUNT (CASE WHEN visit_month IN (8, 9) THEN 1 ELSE NULL END) > 2 OR COUNT (CASE WHEN visit_month IN (9, 10) THEN 1 ELSE NULL END) > 2 OR COUNT (CASE WHEN visit_month IN (10, 11) THEN 1 ELSE NULL END) > 2 OR COUNT (CASE WHEN visit_month IN (11, 12) THEN 1 ELSE NULL END) > 2; He had a fourth answer, which was a theoretical exercise to see if a self-join would work. It did, but performance was horrible and the code was both highly proprietary and convoluted, so I will not translate and reproduce it. Instead, I will offer a prize book for the first three solutions I get in my email which use a self-join and is not too proprietary.
Let's go back and look at Moreno's best answer. The chain of
CREATE TABLE TwoMonths (start_visit_month INTEGER NOT NULL, end_visit_month INTEGER NOT NULL, period CHAR(7) NOT NULL, CHECK (start_visit_month < end_visit_month); INSERT INTO TwoMonths VALUES (1, 2, 'jan-feb'); INSERT INTO TwoMonths VALUES (2, 3, 'feb-mar'); INSERT INTO TwoMonths VALUES (3, 4, 'mar-apr'); INSERT INTO TwoMonths VALUES (4, 5, 'apr-may'); INSERT INTO TwoMonths VALUES (5, 6, 'may-jun'); INSERT INTO TwoMonths VALUES (6, 7, 'jun-jul'); INSERT INTO TwoMonths VALUES (7, 8, 'jul-aug'); INSERT INTO TwoMonths VALUES (8, 9, 'aug-sep'); INSERT INTO TwoMonths VALUES (9, 10, 'sep-oct'); INSERT INTO TwoMonths VALUES (10, 11, 'oct-nov'); INSERT INTO TwoMonths VALUES (11, 12, 'nov-dec'); The name of the periods is informative and while the original problem does not call for it, displaying it could be useful. This gives us the final query.
SELECT A1.patient_nbr, T1.period FROM Table_a AS A1, TwoMonths AS T1 WHERE A1.visit_month BETWEEN T1.start_visit_month AND T1.end_visit_month GROUP BY A1.patient_nbr, T1.period HAVING COUNT (T1.period) > 2;
If you want to show only the patient numbers, then change the
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|




