Welcome Guest. | Log In| Register | Membership Benefits

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Home
Digital Library
Events
RSS | Newsletters
Webcasts


January 20, 2000 Volume 3 - Number 2


DBAs and application programmers might rip out less of their hair

SQL-99’s New OLAP Functions


Richard Winter                

Have you ever noticed how difficult it is to write a SQL query that computes the percentage change in values between, say, this month and a year ago?

How about one that gets multiple rankings from the same set of query results (such as ranking sales by territory within one country and also globally?). Have you been frustrated by the difficulty of writing SQL to compute moving averages, cumulative sums, and other statistical functions that managers routinely request? Do these problems make you grab handfuls of your hair and think about yanking hard?

Well, don’t sign up with the Hair Club just yet. ANSI has adopted a set of online analytic processing (OLAP) functions as an amendment to SQL-99 that will enable these calculations as well as many others that used to be impossible or impractical within SQL.

IBM and Oracle jointly proposed these useful extensions in early 1999. Thanks to ANSI’s uncommonly rapid (and praiseworthy) actions, they’re already part of the standard. Just as remarkably, IBM implemented portions of the specifications in DB2 UDB 6.2, which was commercially available in some forms as early as mid-1999. Oracle8i version 2 and DB2 UDB 7.1, both released in late 1999, contain beefed-up implementations. Also significant is that other vendors got involved early in the process; contributors included tool vendors Brio, MicroStrategy, and Cognos and database vendor Informix, among others. In fact, it’s noteworthy that IBM and Oracle managed to put competitive differences aside in order to jointly start up the whole effort, which looks like a major win for users and a range of industry participants.

This refreshing development resulted from a team comprising various companies’ people, but Dr. Hamid Pirahesh of IBM’s Almaden Research Laboratory played a particularly important role. Pirahesh both exercised a notable technical influence on the concepts and took the lead in fostering cooperation between the companies, which resulted in rapid standardization and commercialization. After his team had researched the subject for about a year and come up with an approach to extending SQL in this area, he called Oracle. The companies then learned that each had independently done some significant work. With Andy Witkowski playing a pivotal role at Oracle, the two companies hammered out a joint standards proposal in about two months. They then invited input from leading tool vendors and approached ANSI. As work progressed during the year, more vendors lent their support, and some skilled work within the ANSI committees, by Jim Melton and others, resulted in improvements and rapid adoption.

Meanwhile, it bears mentioning that Red Brick was actually the first to implement several of the essential functions, albeit in advance of a standard and in a less complete form than what has recently been standardized.

I drew the following example, as well as much of the technical information in this article, from the ANSI document “Introduction to OLAP Functions” by Fred Zemke, Krishna Kulkarni, Andy Witkowski, and Bob Lyle. I made a few changes to the examples to suit my own style of presentation. Conclusions and opinions, of course, are my own.

Example Query

To set up an example, suppose that a business has several years’ sales history data with a row for each sales region for each month. Examples of regions are “Northeast,” “Northwest,” and “North Central.” Months are represented by the four-digit year followed by the two-digit month. (“199806” is June 1998). So a row in the sales history would look like: Northeast, 199806, 25, indicating that 25 units sold in the Northeast region in June 1998.

Now suppose you need to know, for each month and region, the sales for the month and the average sales for the three-month period ending with that month. For example, for the Northeast in June 1998, you want the sales for June and the average sales for the April-June period. (This measure is called the “moving average.”) The query results would therefore look like Figure 1.


FIGURE 1 Results for example query.

It is painful to contemplate the SQL previously required to produce this result. But with the new OLAP functions in SQL-99 as amended, you can express it thus:

SELECT Sh.Region, Sh.Month, Sh.Sales,

AVG (Sh.Sales)

OVER (PARTITION BY Sh.Region

ORDER BY Sh.Month ASC

ROWS 2 PRECEDING )

AS Moving_average

FROM Sales_history AS Sh

ORDER BY Sh.Month ASC;

Here, AVG (Sh.Sales) OVER (PARTITION BY…) is an OLAP function. The construct inside the parentheses defines the “window” of data to which AVG is applied.

Any SQL aggregate function could be used in place of AVG here. Thus, SUM, COUNT, and a wide variety of other aggregate functions are permitted.

The WINDOW construct defines a partitioned set of rows to which the aggregate function is applied. Conceptually, the WINDOW construct is saying to take the set of rows in Sales_History and:

1. Partition it by region

2. Order it by month

3. Group each row with the two preceding rows in the same region.

Thus, first it takes all the rows for the Northeast region and brings them together to form a partition. If there was sales history data for the years 1990-1999, there would be 120 rows in the partition, one for each month of that period. After the ordering is complete, these rows would start with January 1990 and end with December 1999. All of this should sound familiar so far.

The third step, though, is not a familiar SQL concept. Its effect is to replace the 120 rows in the query output with 120 groups of rows. Each of these groups includes one of the original rows plus the two preceding it. Therefore, the group corresponding to March 1990 includes the February and January rows.

(Note that these “steps” I describe here are not necessarily carried out by the database engine. Rather, the query is processed to get the same result as if they had been carried out.)

So, the WINDOW construct defines 120 groups of rows and an average is computed over each of those groups. That average is output with each row of the query result as the item named Moving_Average.

The Window Construct

The innovative WINDOW construct has three main parts: partitioning, ordering, and aggregation grouping.

A set of column names specifies the partitioning, which is applied to the rows that the preceding FROM, WHERE, GROUP BY, and HAVING clauses produced. If no partitioning is specified, the entire set of rows composes a single partition and the aggregate function applies to the whole set each time. Though the partitioning looks like a GROUP BY, it is not the same thing. A GROUP BY collapses the rows in a partition into a single row. The partitioning within a WINDOW, though, simply organizes the rows into groups without collapsing them.

The ordering within the WINDOWconstruct is like the ordering elsewhere in SQL: It includes a list of sort keys and indicates whether they should be sorted ascending or descending. The important thing to understand is that ordering is applied only within each partition.

The aggregation grouping is the most novel element of the specification, in my view. With aggregation grouping you can define a set of rows on which the aggregate operates for each row in the partition. Thus, in our example, for each month, you specify the set including it and the two preceding rows.

There are two main types of aggregation groups: physical and logical. In physical grouping, you count a specified number of rows — forward, backward, or both — in the ordering to form your group. The preceding example used physical grouping. In logical grouping, you include all the data in a certain interval, defined in terms of a quantity that can be added to, or subtracted from, the sort key. For instance, you create the same group whether you define it as the current month’s row plus:

• The two preceding rows or

• Any row containing a month no less than two months earlier.

Physical grouping works well for contiguous data; logical grouping works better for data that is sparse or irregular in frequency. Physical grouping works for a larger variety of sort-key types than logical grouping, because it does not require arithmetic operations on sort-key values; logical grouping works only if you can do arithmetic on the sort-key values (such as numeric quantities and dates).

There are other points of comparison, but in general, these two approaches complement each other well. In most situations, one or the other is going to work neatly. The specification provides an extensive and powerful set of options for specifying aggregation groupings that ought to be sufficient to support broad application of the window capability.

I could say much more about the OLAP functions. The ANSI document introducing the concepts runs 38 pages. The ANSI proposal for the specification runs 60. So, this is not a simple little change — it is a powerful aggregation (if you’ll excuse the pun) of capabilities. And there is some work involved in fully understanding them and how they relate to other concepts in SQL.

Implications

I believe that the OLAP functions are a major step forward for SQL and therefore for database users. The functions let you perform commercially important statistical operations in a single query within the database engine, when before, you either had to run multiple queries or perform them outside the engine entirely. This change has the following implications:

It becomes practical for many more users to specify the calculations and therefore get them done. In the past, it was often impractical or impossible to get these calculations specified at all, unless a high level of expertise was available; now, because the language constructs are there, some users will master them and use them directly. More important, many more tools will implement them, presumably providing end users a way of readily visualizing and specifying them, perhaps graphically.

It becomes practical for database engines to perform the calculations much more efficiently because the analytic functions are specified at a high level of abstraction. Nonprocedurally, the engine can recognize the overall intent of the operation, optimize data access and calculation, and parallelize. Also, it has the opportunity to integrate the analytic function with other elements of the query, therefore achieving yet higher degrees of optimization.

These efficiency benefits are more critical when data volumes are large. When large volumes of data have to be moved out of the engine and analyzed by end-user tools or applications, serious problems result. Generally, outside the database engine, the optimization and parallelization techniques are not as comprehensive or advanced. Statistical functions often reduce data, so it’s poor architecture to move a huge volume of data out of the engine to reduce it when it could more efficiently be reduced in place. The OLAP functions therefore eliminate or reduce an important class of scalability problems.

Norm Sun, VLDB Engineer at MicroStrategy, welcomes the OLAP functions. Operations that otherwise would be performed in the MicroStrategy analytic engine can now been done inside the database, with greater efficiency. In principal, MicroStrategy should be able to build and maintain less code because of the functions the database engines provide. In addition, some calculations that would be multipass in the outboard analytic engine can be performed in one pass within a database. Mr. Sun notes that it was particularly helpful that IBM and Oracle, in formulating their initiative, sought input from MicroStrategy and other tools companies early in the process.

Because these functions are now standard, they’re now much more economical for tool and application vendors to exploit. Thus, a vendor can generate SQL that uses an OLAP function, expecting that leading database products, if not ultimately all database products, will implement it — and do so in accordance with the standard.

Perhaps most important of all is the point that the OLAP functions in the SQL-99 Amendment 1 have significant value in the management of virtually all types of enterprise activity. These functions facilitate activities involving ranking, percentage increases, moving averages, and cumulative sums, to name some simple examples. OLAP functions are the stuff of everyday business, important to many users. And thanks to adept management of this technical advance by IBM, Oracle, ANSI, and a cast of supporting players, we have them available in a standard and a rapidly growing assortment of products.

Richard Winter is a specialist in large data base technology and implementation, and is president of Boston-based Winter Corp. You can reach him via email at Richard.Winter@wintercorp.com or by fax at 617-338-4499.

RESOURCES

 

ANSI: www.ansi.org
Brio: www.brio.com
Cognos: www.cognos.com
IBM: www.ibm.com
Informix: www.informix.com
Introduction to OLAP Functions:
ftp://jerry.ece.umassd.edu/isowg3/x3h2/1999docs, catalog# 99-154
MicroStrategy: www.microstrategy.com
Oracle: www.oracle.com

 





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address