|
|
|||
|
http://www.intelligententerprise.com/010216/webhouse1_1.jhtml
|
|||
>Few queries and reports perform any calculation fancier than summing or counting, with the occasional ratio thrown in for excitement. Query and reporting tool vendors have done an excellent job of making the simple things easy. They do a nice job of providing functionality that is missing from SQL, such as subtotals and market contributions. Some can perform more complex calculations, but doing so requires extracting data from the database onto a client desktop or middle-tier server.
In a lot of cases, this multitiered design is not a problem. Many companies have good internal network bandwidth, and power analysts seem to prefer having local data sets that they can play with as they wish. Yet to make greatest use of the power analysts' efforts, the output from their work must be fed back into the operations of the enterprise systems. Such closed-loop systems are becoming increasingly common in CRM, webhouses, system operations, and many other applications.
We are led down this path by the predictable choice of SQL as the query and analysis language for the data warehouse. But SQL is not rich and flexible enough to do analysis on its own, so our best analysts use it to pull data out of the warehouse and into their preferred tools. This creates the multitiered architecture.
The OLAP extensions to SQL-99 reduce this problem, but SQL is fundamentally not an analytic language. We need an analytic engine and language that are tightly integrated with the data store, yet robustly designed for scalability and programmability. In this way, we can avoid convoluted programming and the awkward multitiered architecture.
Can SQL do the job for production reporting applications? Is SQL an analytic language? No, no, and again no. It requires convoluted syntax to perform trivial "analytics," such as market share, moving average, rank, percentile, delta, or standard deviation. SQL, as its name (Structured Query Language) says, is a query language.
SQL-92 is fairly flexible, and can perform more computations than most people realize. As I'll describe later, a SQL expert could write a query that computes market share and moving averages. However, it seems highly unlikely that this is something you'd really choose to do, and you'd have to write the query by hand, as no query tool is going to provide much help. Let's choose a simple example.
As everyone who has looked at an investment Web site must know, moving averages are a common tool in financial analysis. A moving average, very simply, is the average of a measure over a rolling window. For example, to calculate a three-day moving average of a stock price you'd average the price for today, yesterday, and the day before. Standard deviation, a measure of volatility, is also a common measure in investment analysis. As a starting point for our discussion, assume you want to compute the moving average and standard deviation of stock prices over appropriate time intervals.
A fundamental characteristic of a relational database is that the order of rows in a table has no meaning. SQL can order rows on output, but there is no syntax for ordering rows so as to operate on them in a query. To perform a moving average calculation in SQL-92, you'd need to join the fact table to itself three times. A 200-day moving average would require a 200 way self join, which is ridiculous.
Analysts have been forced to abandon SQL, using it only to extract data from the warehouse and feed it into their personal systems.
What about the much anticipated OLAP extensions to SQL-99? They are a genuine improvement to the query language. The primary feature of the OLAP extensions is the WINDOW clause. The WINDOW clause is designed to solve exactly this sort of problem: It's a way to specify in the query that you want to perform an action over the set of rows {today, yesterday, day before}.
Listing 1 shows the basic query drawn from the schema shown in Figure 1 using SQL-99 syntax.
The PARTITION BY clause is similar to a GROUP BY clause. It tells the "window" to start over when it sees a new stock. Without this clause, you'd be intermingling data from different companies. The ORDER BY clause ensures that you include the correct rows in the window. Remember that the relational database is inherently agnostic about the ordering of rows.
The ROWS 2 PRECEDING clause specifies that you want to perform some action on the current row and the two previous rows. A 200-day window, then, would simply replace the ROWS phrase with ROWS 199 PRECEDING. Finally, it's the AVG and STD functions in the SELECT list that indicate what action you want to perform on sets of rows you've grouped together in the window.
The SQL-99 syntax supports multiple windows in a query. Extending this query to compute moving average over one range, and standard deviation over a second range, would be as straightforward as defining a second WINDOW clause.
Just because your database engine implements the SQL-99 OLAP extensions doesn't mean it does so well. All it means to conform to the ANSI syntax standards is that your database engine accepts the syntax and returns correct results. Underneath the covers, the engine might be joining the fact table to itself.
What if you wanted to look at investment returns rather than prices? The desired output is a series of each stock's daily return and standard deviation, computed relative to the beginning of the year. Consider the query in Listing 2 (see page 26), derived from Figure 1, which employs a subquery to get the starting price used to calculate the investment return.
There are several additional common computations that the new SQL-99 OLAP extensions support. The following items are excruciatingly difficult to generate without the WINDOW clause:
It's important to note that neither example illustrated in this column is what I would call a complex question. These are complex queries that answer very simple questions.
The addition of the WINDOW clause to the SQL standard greatly increases the usefulness of the SQL language for numerical analysis. Analysts will receive the syntax gratefully, and I expect the burden of computation to shift somewhat toward the database engine and away from the end-user tool. But there are several types of analytic problems that are not addressed by the standard; two of which are:
These are second-order issues. The fundamental problem with using SQL - even SQL-99 - to perform complex computations is that it is just too hard to formulate the queries. Subqueries and HAVING clauses have been part of SQL for many years, yet the syntaxes are poorly supported by query tools. Unfortunately, I suspect it will be several years before we see many products with an intuitive interface for building ad hoc queries that use the new SQL-99 OLAP extensions.
The answer is to place an analytic tier between the data store and the users. This analytic tier will provide the kinds of functionality that Ralph Kimball described in his "dimensionally friendly" criteria. (See his Intelligent Enterprise columns "Is Your Dimensional Data Warehouse Expressive?" May 15, 2000, and "Rating Your Dimensional Data Warehouse," April 28, 2000.) For the purposes of the current discussion, I'm agnostic about whether the underlying data store is in the relational database or another format, although if the underlying database is relational, it should support the OLAP extensions. The analytic criteria should include:
The OLAP products on the market today address these criteria with varying, but generally good, success. OLAP technology is most widely known for solving the "aggregate navigation" problem, but it also provides an effective platform for interesting analytics. All OLAP products provide significantly greater flexibility for analysis than SQL does, at a cost: Each uses its own proprietary query or calculation language.
For example, Microsoft's analytic language MDX (from "multidimensional expressions") provides a facility for the power analyst to store an analytic expression in the Analysis Services database. The computed member, as these stored expressions are called, can be staggeringly complex: They can even be defined to call out to an external module that would perform multivariate time series analysis. All the functionality I discuss in this column is natively supported.
Returning to the business problem in Listing 1, the power analyst would define in the database a computed member, Price Moving Average as:
Avg(Time.CurrentMember.Lag(2):Time.
CurrentMember, [Closing Price])
The casual user of the system can drag this measure into a query. It will work correctly regardless of the requested level of aggregation and no matter which dimensions are displayed. You could create a more complex expression, transparently creating a 50-day moving average if the measure is viewed on the day level; a 10-week moving average at the week level; and a three-month moving average for data viewed by month.
Listing 3 (see page 28) shows a full query that uses this predefined measure.
No matter how complex the MDX of the computed measure, its use is straightforward. The MDX-enabled query and reporting tool can be designed to graphically present elements on rows, columns, and pages; the numeric measures that fill in the grid are computed by the Analysis Services code.
I don't mean to trivialize the complexity of MDX and the other analytic languages. They are difficult syntaxes to master. I have seen MDX that is breathtakingly complex, and there are few tools to help develop analytic measures. Again, the saving grace is that the diff icult work needs be done only once, and is easily shared with many.
Looking to the future, we can expect to see a broader use of analytic systems in closed-loop applications. Query and reporting requirements are well served by languages and programming interfaces that were designed to support ad hoc analysis. Similarly, the "graduate level" statistical requirements of closed-loop systems call for analysis systems vendors to incorporate data-mining APIs into their products. And the data-mining algorithm developers must take this movement seriously, or risk having their tool set dismissed as too difficult to use and too awkward to integrate into operational systems.
Joy Mundy (joy@microsoft.com) was one of those "power users" until the early 1990s when she joined data warehousing on the IT side. She has built, managed, analyzed, consulted, and productized data warehouses since that time. Joy currently works on Microsoft's SQL Server product team.
RESOURCESSQL-99 specification (SQL-3):www.anis.orgOLAP extensions discussion:ftp://jerry.ece.umassd.edu/isowg3/x3h2/1999docs,document99-154 |