Real Time: Get Real, Part IIStart by discarding your current concepts of ETLby Neil Raden Continued from Page 1 Trickle and FlipThe concept of trickle and flip is very simple (refer to "Real-Time Data Warehousing"). A copy is made of the fact table and given a name that isn't in the catalog of any query tool, so that it can't be queried. New data trickles in. (The word "trickle" is misleading, as it could be thousands of records per second.) These records are merely appended to the copy fact table. At certain intervals (say, every five minutes), the trickle is halted, the copy fact table is copied, renamed to the active fact table name (the active fact table is deleted), and the process begins anew. This method poses a scalability problem. If the fact table is very large, all of this copying can be too slow. In addition, a very large fact table with indexing might not be able to keep up with the "trickle." Trickle and flip is a useful technique where the frequency of refresh matches the time it takes to perform the flip and when the rate of records flowing in doesn't overwhelm the database's ability to insert them and update the indexes. For those situations where trickle and flip won't work, you have to use one of the partition methods. Table PartitioningMost relational databases have a table-partitioning feature. The concept is simple: If tables get too large, it becomes difficult to manipulate them. The larger the table, generally, the larger the indexes ... and unless the index can be pinned in memory, performance degrades very quickly. Before there was partitioning, we had to break the fact tables apart ourselves. Doing this required giving them different names, which created a lot of maintenance problems and a lot of problems with the BI tools. The beauty of a partition is that the database actually creates separate tables, but they all have the same name. Partitioning allows us to create very large tables that are handled internally by the database as a series of smaller ones, each with its own indexes. One technique that works particularly well is to partition the fact table on the time key, and to have the real-time load coming into the newest partition. Depending on the relational database you use, there are ways to rope this partition off so that it isn't visible to active queries. The process is similar to trickle and flip, except that no copying and renaming is necessary. When you are ready to add the partition to the active data warehouse, you just drop the ropes. There's a drawback here, too. Because the only logical partition is time, what's the interval? If it's daily, there's little utility in this approach. If it's less than one day, how much less? For the sake of argument, let's assume the partitions are added every 15 minutes from 8:00 a.m. to 8:00 p.m., five days per week. That's 48 partitions a day, 240 per week or roughly 12,000 per year. Now, if this data warehouse carries three years of data, you'd have 36,000 partitions. The whole point of partitions is to make large tables more manageable. For queries, for example, the idea is to avoid reading the entire table, or even the entire index, but rather to resolve the query by using only one or a few of the partitions. If the database has to sift through dozens or hundreds of partitions, not only will the lift from partitioning be lost, the performance will be vastly worse than it would be without them. One solution is to maintain a set of fact tables at the end of each day that are partitioned at a more reasonable level. The grain of the data may not change at all, but the awkward partitioning would exist only in that day's fact tables. Another alternative, but one that's shown only limited success, is to allow the updating tables to be fully participating partitions in the active tables. This strategy alleviates the need for fine-grained partitions, but it opens up the problem of fact tables (or at least the most current partitions) being updated and satisfying queries at the same time. Real-Time PartitionsRalph Kimball described real-time partitions in an earlier column ("Realtime Partitions"; see Resources), so I won't repeat the whole process. Basically, you create new tables that resemble the active fact tables as closely as possible but are designed for fast updates. These tables are "interval" tables, meaning they contain data from only the last update. Unlike the other two alternatives, this one is truly real time in that it is visible to the BI tool and analyst. In fact, if the BI tool is smart enough, it can formulate a query to drill across the static fact tables and include the up-to-the-second records from the real-time partition. (Sadly, only a handful are smart enough: those few that actually generate SQL from a schema.) All of these approaches have merit. But so far, I've discussed only the fact tables. There are other tables in the data warehouse that are updated in tandem with the facts. Dimension tables have to be maintained in real time to preserve not only referential integrity, but to provide all the analytic horsepower of the data warehouse. Unlike the fact tables, which are almost always time-series, the dimension tables lack a time key for partitioning. Trickle and flip is one alternative, so that the updating fact table is being kept in sync with the associated dimension tables. Dimension tables tend to be smaller than fact tables, but we can't always count on that. If they are very large, they may overwhelm the trickle and flip approach. The same approach works for aggregate tables as well. You might be tempted to avoid the aggregates and catch up at the end of the day. But doing this means that queries returned from aggregate tables might return different answers from the rolled up detail, which is unacceptable. Also, aggregation is typically performed on any dimension, not just time, so this method could distort results in ways that aren't easy to spot. The RTDW is here to stay, and it will permeate the landscape over the next five to seven years. Many data warehousing "best practices" need modification or retirement to accommodate it, but many practices in place now will extend nicely to the new approach. It will be a disruptive development to some groups, so expect resistance. Ultimately, the greatest challenges will be on the front end, improving query performance and generating more intelligent SQL, a challenge that will be shared by database vendors, BI tools vendors, and practitioners who implement these systems. It's a risky undertaking; new development should proceed carefully as tools and methodologies improve. But the demand for real-time data mixed with historical and enterprise perspective is growing rapidly. Of course, your initial projects should be those that are driven strictly by solid business process requirements. Neil Raden [nraden@hiredbrains.com] designs and implements data warehouses and analytic applications for clients in North America and Europe. RESOURCES"Real-Time Data Warehousing: 5 Challenges and Solutions" Related Articles on Application Integration at IntelligentEnterprise.com: Part I: "Real Time: Get Real," June 17, 2003 "Realtime Partitions," Feb. 1, 2002
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|



