|
|
||
|
http://www.intelligententerprise.com/ 010308/webhouse_1.jhtml
|
||
In the past three or four years, a segment of the data storage industry has been quietly building a new architecture that has real potential for our larger, busier data warehouses. This new architecture is called the storage area network (SAN). Think of a SAN as a way to take all the disk drives off all your mainframes and servers, concentrate the drives in a single location, and then allow all the servers to read and write to any combination of the drives simultaneously.
If you could concentrate all your storage technology in one location, together with universal access, you could realize some interesting economies of scale. You could also eliminate redundant costs, compared with the conventional processor-controls-its-own-storage architecture that you are probably using with most of your systems.
Let's take a quick look at a typical SAN configuration, which Figure 1 illustrates . As its name implies, a SAN is its own network, almost always based on fiber channel technology. Fiber channel technology is capable of very high bandwidths, matching the ability of high-performance disk drives to transfer data at their highest sustained rates. But unlike computer buses and SCSI chains, fiber channel can be extended to very large campuses. A SAN based on 9mm fiber optics can extend to a 10km diameter. Keep this thought in mind when I discuss backup and disaster recovery.
SANs normally contain storage devices, servers, and switches. A server can be any of the familiar server types, including online transaction processing (OLTP) servers, data staging servers for your data warehouse back room, presentation servers for your data warehouse front room, and a wide variety of other servers. Other servers include those devoted to data administration and functions such as data mining, multimedia servers, conventional file servers, and "hot response caches" found in Web-centric data warehouses.
Every server that is part of the SAN normally has a fiber channel interface to connect inward to the SAN and a local network interface to connect outward to a conventional local area network (LAN). The SAN switches are capable of connecting every server to every storage device on the SAN, at fiber channel speeds.
At this point, you are probably thinking of a number of advantages that a SAN could bring to a large, busy data warehouse. Here's an attempt to list all the ways a SAN could be interesting to a data warehouse:
High-performance disk access. Above all, a SAN offers very high data transfer rates from disk to server and directly from disk to disk. SANs transfer data at 100MBps, with promises in the near future ranging up to 400MBps. The current speed of 100MBps is comparable to the speed of a gigabit Ethernet but has the immense advantage (compared to a LAN) that every server has intimate access to every storage device. Some people have described the SAN as "SCSI on steroids."
High-performance transfer between applications. A typical data warehouse operation is bottlenecked by two, or possibly three, major data transfer steps. The OLTP system must transfer the primary production data to the staging area (back room) of the data warehouse. Or maybe this first step transfers data to an operational data store (ODS), which we will think of as being in the back room. In either case, a lot of very granular data must be physically copied from one storage device to another. A large retailer could transfer 50 million sales transaction records per day to the staging area. A Regional Bell Operating Company could transfer 200 million call detail records to the staging area each day. And finally, a huge Internet site, such as AOL or Microsoft, could transfer several billion page event records from production Web servers to a staging area each day. The secret is to have both the production servers and the components of the data warehouse back rooms and front rooms all on the same SAN.
A second transfer in the data warehouse must take place after the data goes through all the cleaning steps in the data staging area. In this second step, a "dimension authority" replicates conformed dimensions to many distributed data marts. Because an entire enterprise can use a single SAN, the separate data marts can all be resident on the SAN and can receive the conformed dimensions at high data rates. This possibility raises an interesting, subtle point. The data warehouse can still be a highly distributed affair with separate data marts organized around primary data sources. Having a SAN does not require you to build a monolithic, centralized data warehouse!
A third data transfer step might take place for certain kinds of data warehouse clients, such as data miners, who need to transfer very large "observation sets" from the normal presentation services of the data warehouse into their specialized tools - such as decision tree, neural network, and memory-based reasoning tools. These same specialized end users may also transfer large data sets back into the data warehouse after they have run what-if scenarios or after they have computed behavior scores for all the enterprise's customers.
High-performance direct transfer from disk to disk. Data warehouse operations have all sorts of needs for major data copying from disk to disk. This kind of data copying does not involve a complex application. You may need to move a suite of databases from a test machine onto a production machine. Or perhaps you will need to physically replicate an entire application in order to scale it upward to meet increased demand. (And of course, you can scale down an application pretty easily on a SAN when the surge of demand passes.) Maybe we should call this kind of scaling "naive parallelization" because it lets you copy both servers and data without very much planning. All that seems to be required is to segment the incoming demand so that each of the servers on the SAN can satisfy its part separately. Busy Web sites take this approach, bringing "proxy servers" that are identical clones of each other online to handle high load conditions.An interesting wrinkle in the copying-to-accomplish-scaling scenario allows a DBA to copy granular base data onto separate storage devices but not necessarily copy aggregated data. Separate application servers on the SAN could all have their own copies of the base data but navigate to the same aggregate tables on a single storage device if it makes sense to do so. Given the usual rule of thumb that the total size of aggregated data is equal to the base data, the disk requirements for a parallelized application might be reduced significantly, under the right conditions.
Justified economics for higher-performance devices. Although this benefit may sound like vendorspeak, it is probably true that the economics of centralizing the physical storage really do justify investing in higher-performance devices. This seems especially true for high-performance tape-backup devices. It might not make sense to have a terabyte-capacity tape-backup system for any one application, but it probably would if the device could be shared by the whole enterprise. High-end (expensive) tape subsystems can handle 20TB of data with transfer rates of 500GB per hour. This scenario supports the next point as well.
Elimination of intensive bandwidth data transfers from the LAN. The huge data transfers needed to back up all the various stages of the data warehouse can be removed from the main LAN.
More efficient use of expensive administrative personnel. The arguments for centralizing backup facilities apply to centralizing the personnel who perform these functions. These folks can work more efficiently with the full production-level responsibilities of the whole enterprise, and they can be more skilled personnel who can be paid more.
A single centralized scratch space for major database table manipulations. The standard formula of planning for five times the storage space of your largest fact table can be significantly scaled back because a number of large applications can share a common scratch space for temporary copies of database tables.
No need for applications to know where the data is physically located. The SAN is between the application and the physical storage.
Openness, letting multiple technologies access the stored data. A typical data warehouse staging area could have the production OLTP system running under Unix, the data extract-transform step running under NT, and various data marts running under Unix or Windows. The extract-transform-load tool could control the flow of data as it is being transformed by reading and writing from the native files of each of these systems with the performance of local disk storage.
Configurable to support disaster recovery and fault-tolerant computing. You can extend the SAN to include a separate physical facility up to 10km away from the other devices.
SAN vendors have not developed their technology exclusively for data warehousing, but given the length of the preceding list, we should be a principal target for these vendors. As we in the data warehouse community become more and more distracted by the physical issues of managing our huge data sets, we should welcome this new set of services and products offered by the storage technology vendors.