Look Before You LeapPrototyping your data warehouse can save your team time and money
By Vijay Saradhi & Martin Simoneau
Missing data, gaps in data, lost time, and mounting costs are just a few of the pitfalls that you may encounter while building a data warehouse or data mart. Despite the challenges (well-documented in popular books such as Ralph Kimball's The Data Warehouse Life Cycle Toolkit), a prototype can be a great tool in the development of any data warehouse if you properly execute it, carefully define objectives, and manage user expectations. As we'll explain, a data warehousing prototype could focus on anything from a feasibility study to a micro model of the project's data extract-transform-load (ETL) process. Why Prototype?Implementing simple prototyping techniques is a sensible way to mitigate the adverse effects of unforeseen problems. The development cycle of a typical data warehouse or data mart includes various stages such as data requirement analysis, modeling, extraction, transformation, loading, report development, and testing. Because of the significant time lapse between the requirement analysis phase and final testing of the data warehouse, extensive gaps in data often go undetected until the beta-testing phase of the data warehouse. Discovering missing data at a later stage of the project often comes with a significant price tag in terms of money and lost time. Prototypes illustrate the power and capabilities of data warehousing to the user community. They are especially helpful to companies that are attempting to build their first data warehouse. For example, a medium-sized financial institution could have as many as 5,000 or more data elements that are used throughout their enterprise. Unless your company is a startup, it is very likely that your source data is buried in legacy systems, with little or no documentation. Added demand for integration of electronic-business data (e-data) and traditional-business data (t-data) only makes these data warehousing efforts excessively complex, intimidating, and impractical to execute. With these conditions and challenging deadlines, even the most experienced data warehousing professionals are likely to miss addressing some of the critical functional and data issues in their logical and physical designs. What kind of prototyping can be done in the warehousing and webhousing worlds? Seven Classic Prototyping ScenariosMake enterprisewide data architecture decisions. In 2001, the vision of most corporate CIOs appears to be threefold; namely: have an enterprise data warehouse, achieve enterprisewide application integration, and customize customer applications to achieve one-to-one personalization. One way to achieve this dream is to take a top-down approach; build everything centrally and feed the individual applications. The problem with this approach? It is a never-ending project. The other common choice is to take a bottom-up approach: Build individual applications in different departments and try to integrate them in the end. Unfortunately, integrating these indigenously built applications often requires significant reengineering of one or all of the applications involved. Instead of waiting for each application development to finish and hoping that you will eventually be able to integrate all the applications, it is better to proactively assess the effort involved in the eventual integration. Clearly in this case, prototypes are an extremely helpful tool in assessing the complexities and streamlining the design of your project. Evaluate the software. As an IT professional, you need to purchase several software tools from an ever-increasing pool of software vendors. Your software needs can span a wide variety of choices such as query tools, distribution tools, process automation and scheduling tools, ETL tools, administration tools, application servers, RDBMS servers, security administration tools, and so on. Every vendor claims that its particular piece of software works well in a wide variety of hardware and software environments. But the truth of the matter is, each piece of software typically performs well in a particular set of conditions and hasa mediocre performance in other situations or could even pose unexpected problems in some situations. For example, a query tool designed for relational online analytic processing (ROLAP) type databases might not perform so well when the database is designed as a star schema. A multidimensional OLAP (MOLAP) tool might cause some serious network traffic problems when used in corporations connected via limited and shared bandwidth wide-area networks. Prototyping can proactively identify problems and limitations associated with any software. Evaluate the hardware and its layout architecture. In general, hardware architecture layouts are harder to prototype and far less reliable unless extreme care is taken in simulating realistic testing conditions. However, there are a few situations where prototyping is easy enough to do. For example, you may be designing the hardware layout for your enterprise data warehouse and your data center is located in one country and corporate users are based in different locations across the world. In this case, where would you place your database and application servers to achieve maximum availability to most of the users with the best possible performance? Perhaps your DBA suggested using replication to distribute data to major nodes and distribute data from the nodes to clients (four-tier architecture). You can test the feasibility of the design by prototyping. Evaluate the data model and test for data sufficiency. Your data modeler designed the logical model with the ability to support all the user's requirements. After finishing the modeling, she had to revise the model several times for a number of reasons such as unavailability of data, unclean data, and changes in business requirements and project scope. It is dangerous not to check your revised model for data sufficiency (making sure your revised model can still support user needs). Once the number of data elements exceeds a few hundred elements, it is very difficult and far less reliable to visually check the model for data sufficiency. With a prototype handy, you could accomplish this task fairly easily. (More on this approach later.) Design the ETL process. Here's another typical scenario: the source data is a mainframe MVS file system and DB2, your target database runs on Unix, and you are not using an ETL tool. Your choices are to run your transformation on the mainframe, load the data into a mainframe database such as DB2 and run the transformation process in DB2, FTP your files to your Unix environment and run it there, or load the data to a staging database and run your transformations there. You could also mix and match any of these choices to gain optimum results. A prototype is a very useful tool in determining which alternative ETL designs works best. Size the database. Some of the most common problems that plague data warehouses are underestimating or overestimating the database size. This often happens because your data model changed after you estimated the size and you didn't reestimate the size. If you are using a tool such as ERwin to do your estimates, you could also err on your estimates if you don't know how the tool estimates the database size. With the ever-increasing size of enterprise warehouses and webhouses today offset by limitations to any server architecture and RDBMS, accuracy in database sizing becomes more critical. With prototyping you could greatly improve the accuracy of your estimates by including characteristics of real data and measuring (instead of calculating) the data object sizes. Determine the partitioning strategy. The goal of partitioning is to achieve optimum performance out of your queries by distributing your I/O and taking advantage of all the database and hardware features available to you. You need to keep two issues in mind before deciding on a partitioning strategy: partitioning should serve as many queries as possible, and repartitioning and reorganizing large databases is costly and time-consuming. See the sidebar, "Critical Reminders," for some typical problems that every data architect faces.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|



