Welcome Guest. | Log In| Register | Membership Benefits

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Home
Digital Library
Events
RSS | Newsletters
Webcasts




December 5, 2001

Clickstream Data Mart

It's part of the 360-degree customer view. It's being requested more. Are you ready?

By Joe Caserta
Edited by Ralph Kimball

The clickstream's image is finally transforming, deservedly so, from that of a glorified "hit counter" to that of a crucial customer information source. Case in point: Over the past few months, many people from even non-dot-com corporate environments have asked me about clickstream data marts. Most of this change of perception probably owes to the endless effort businesses are exerting to represent all 360 degrees of the customer in their data warehouses. After all, the clickstream data mart is a strategy that allows companies to truly complete the customer profile, by providing a way to integrate all Web-based customer activity data with conventional transactions.

SHEER VOLUME

Important as the clickstream is, as most extract-transform-load (ETL) developers have come to realize, analysis of the raw Web logs that yield clickstream data can be a daunting task. The sheer volume and unformatted nature of a Web log makes for very complex parsing strategies. Data warehouse architects and developers already deal with very large volumes of data, but Web logs can easily be 10 times larger. Therefore, it is 10 times more critical that the dimensional data model be precise and that the ETL process be efficient.

Because of the effort required to conquer this Goliath data source, many companies purchase preprogrammed applications to gather the statistics from their Web sites, creating isolated reporting solutions for single subject areas. Any data warehouse manager knows that creating such stovepipes is one of the taboos of a successful data warehouse implementation. Companies with these solutions in place are now realizing the shortcomings of these disconnected applications.

Web log data has much more to offer than basic Web statistics, a fact that cannot be fully appreciated unless the Web log is integrated into the data warehouse. This important data includes key customer behavior habits that have never before been available to corporate decision makers.

SAME, BUT DIFFERENT

The fact is the clickstream data mart can be implemented just like any other data mart in your data warehouse — with some slight modifications. The data warehouse manager faces the following challenges in the clickstream data mart development life cycle:

  • Unlike most data mart implementations, in which primarily DBAs and data administrators own the knowledge of the source data, the data warehouse team will need to work closely with webmasters and programmers. These groups may be the only resources who can supply the crucial information necessary for finding the key elements the business requires.
  • The types of substring and instring functions required to process the Web logs during the ETL process are terribly inefficient in most DBMSs and ETL tools. Because of the daunting size, you must optimize the clickstream ETL process for performance. Inefficient ETL processing will not reach completion in the allocated load window.
  • The vast amount of data makes it necessary to limit the revisiting or reloading of logs to an absolute minimum. Therefore, the ETL process must discover, transform, and load new pages, page types, Web servers, and customers during the first pass of the Web log. Additionally, the process must also handle new robots, exclusions, and altered business rules gracefully.

The logs themselves, regardless of the particular Web server that creates them, conform to a field layout standard that the World Wide Web Consortium (W3C) sets. As the W3C standardizes available data elements, the content of these elements varies. Additionally, the data within some of these fields is highly customizable, especially the query string and the cookie.

BUS TO THE DATA WAREHOUSE

The query string and cookie fields are what let us "plugin" a properly designed clickstream fact table into the existing data warehouse, using the data warehouse bus architecture described in Ralph Kimball's books and articles. You should use existing conformed dimensions in the data warehouse to add dimensionality to the clickstream data mart where possible. As Figure 1 shows, the Page Events Fact table utilizes the Customer, Product, Promotion, Date, and Time dimensions. These are all conformed dimensions that have been built previously, during unrelated data mart projects. This technique not only enables timely delivery of the data mart, but also provides "drill-across" access to other facts throughout the data warehouse — thus enabling you to link Web-based and conventional customer activity.

Figure 1 Use existing conformed dimensions in the data warehouse to add dimensionality to the clickstream data mart where possible.

PANNING FOR GOLD

Finding and parsing the data required for joining each clickstream fact to the conformed dimensions is another challenge. You must analyze each Web log entry, searching for specific name=value pairs that have meaning to a particular application the Web page supports. These name=value pairs are primarily stored in the cookie and user request line (also referred to as the query string).






IE Weekly Newsletter
Subscribe to the newsletter
    Email Address