Welcome Guest. | Log In| Register | Membership Benefits

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Home
Digital Library
Events
RSS | Newsletters
Webcasts




June 30, 2003

Real Time: Get Real, Part II

Start by discarding your current concepts of ETL

by Neil Raden
edited by Ralph Kimball

Making the assumption that the primary database for the real-time data warehouse (RTDW) is a relational database, updating one in real time doesn't, on the face of it, seem like a big problem. After all, relational databases have been tuned for OLTP for years, allowing tens of thousands of updates per second. How could this be a problem for a data warehouse?

It is, because these databases designed for high-performance OLTP don't have active queries running against them. Indexing is light or nonexistent. Their highly normalized schemas often make operations affect only one table. The operations themselves are typically a single record, even if there are thousands of them per second. In addition, the two-phase commit process requires expensive overhead in maintaining a rollback log to recover from failures.

And, most important, the transaction speed measured is usually only a part of the total process. In the same way that a data warehouse load has to append new records to dimension and fact tables, these high-volume transaction systems do a lot of behind-the-scenes, after-the-fact, and even batch processing with those transactions. The whole application is kept coherent with these real-time updates by roping off the parts that operate in real time. Think about buying an airline ticket online and getting an instantaneous locator number. That's real time. But how long does it take to get an email notification? About a million transactions later.

In the past, in order to meet our refresh cycle window of eight hours or so, we in the data warehouse would violate many of these rules. For example, in order to get adequate query performance, we would build many indexes, create aggregate tables, turn logging off, and load large amounts of data via bulk loaders. These approaches aren't helpful when trying to update a database intraday while people are using it. It would create an unacceptable amount of downtime.

As you know, the primary schema for a data warehouse is either a star schema or a "normalized" schema. The latter is a term so loosely defined that it's hard to describe, but a normalized schema typically resembles a third (or higher) normal form (3NF) schema that's not dimensional. These 3NF designs don't support query and analysis. Their sole purpose is to act as a staging area, an upstream data repository for a series of star schemas, online analytic processing (OLAP) cubes, and other structures that are directly queried by analysts. The only routine exception to this is Teradata implementations: Because of the unique characteristics of the massively parallel architecture and database optimizer, Teradata can process analytical SQL against a 3NF schema with acceptable performance.

Because the 3NF approach is only a first step, the argument that schemas of this kind are perfect for data warehousing is misleading. The real-time landscape of an RTDW includes the whole process, or at least the process up to and including the physical structures that people query. Because the 3NF-centric approach just adds overhead, it is not, at this time, a good candidate for RTDW.

The star schema, on the other hand, can provide the one-stop load-and-query approach we need. Assuming there's just one RTDW database that's updated and queried directly (not four databases, as in the case of an operational data store, 3NF staging area, star schema, and multidimensional OLAP platform), we can assume that the time to add an incremental load is limited to the time to append new records, modify or rebuild aggregates, and refresh indexes — nothing else downstream. That's still a lot of work to do, but at least it's not being done twice. However, the star schema looks to be, on the surface, a challenge for this type of work because of its heavy indexing and denormalizations.

Setting up a star schema for real-time update is straightforward if you follow a few rules. Though it's possible to do this with hand-coded ETL, it's much easier to use an ETL tool that supports, at minimum, the following:

  • Slowly changing dimensions
  • Automatic generation of surrogate keys
  • Automatic calculation and update of aggregates.

Real-time trickle feed from the ETL tool — in other words, the ability for the ETL engine to run continuously and feed information as it appears — is needed in those cases where the application calls for continuous feed. However, if you're building a process that's still a batch update, even with short (say, five-minute) intervals, this feature isn't needed.

You may need to modify your data model to make this work. First of all, the "real-time" data you map to the data warehouse has to be at the same grain as the data warehouse. The reasoning is simple: If you have fact tables at that level of detail, appending new information in near real time is simply a matter of appending new records to existing fact tables (and other types of tables, too, of course). Ralph Kimball described this process in a column in 1997. (See www.dbmsmag.com/9712d05.html.)

Here are some of the popular choices for designing an RTDW:

  • A "trickle and flip" process
  • Table partitioning
  • Real-time partitions.








IE Weekly Newsletter
Subscribe to the newsletter
    Email Address