January 05, 1999, Volume 2 Number 1
The clickstream data mart can tell you a lot about your customers and their inclinations
Clicking with Your Customer
Ralph Kimball
The Web is pummeling the data warehouse marketplace on all sides. The Web's presence is changing our clients' and servers' architecture by forcing the front-end tool vendors to move the bulk of their technologies toward the server. This shift is taking place because the vendors can no longer control the software content and because everyone is demanding that a Web browser, not a dedicated application, must paint the bits on the screen.
End-user application developers are increasingly building applications around Web pages. The user interface development environment of choice is now a Web page development environment. Just set up a Web page and embed various data- aware objects in it. Endow the Web page's user interface with a few data warehouse gestures, and you have a sophisticated drag-and-drop, drill-up-and-down, pivoting application. An interesting side effect of this forced migration of user interfaces to Web technology is that it's as easy to deliver an application over the Web as it is to deliver one to a dedicated single user.
As a result, our data warehouses are becoming data webhouses. The medium is not only our delivery vehicle, but our business. Increasingly, we mix the fundamental business content of what we sell with Web services and capabilities.
An interesting part of the emerging data webhouse is the data mart that stores and presents the Web activity for later analysis. Fundamentally, we want to analyze all the hits on our Web site. We want to build a comprehensible view of the immense stream of clicks arriving at our sites, whether we're dealing with our intranet users or with customers on our public Web site. We call this aspect of our data webhouse the clickstream data mart.
The Goals of the Clickstream Data Mart
The clickstream data mart can tell us a great deal about detailed customer behavior. If we have information on our customers' every click, gesture, and trajectory through our Web site, we should be able to answer such questions as:
What parts of our Web site get the most visitors?
What parts of the Web site do we associate most frequently with actual sales?
What parts of the Web site are superfluous or visited infrequently?
Which pages on our Web site seem to be "session killers," where the remote user stops the session and leaves?
What is the new-visitor click profile on our site?
What is the click profile of an existing customer? A profitable customer? A complaining customer that all too frequently returns our product?
What is the click profile of a customer about to cancel our service, complain, or sue us?
How can we induce the customer to register with our site so we learn some useful information about that customer?
How many visits do unregistered customers typically make with us before they are willing to register? Before they buy a product or service?
These questions involve analysis at a detailed, behavioral level. We describe much of this behavior as a series of sequential steps; because nearly all Web sites are organized as a hierarchical tree branching out from the home page ("index.html"), these steps necessarily describe the steps of traversing a tree -- actually, a huge tree. The biggest Web sites, such as Microsoft's, have several hundred thousand pages.
Given this information, can we imagine building the clickstream data mart using conventional slice and dice dimensional models? And if we manage to build it, how can we hope to analyze the clickstream data mart to answer all these questions?
To tackle the clickstream data mart, let's use a simple, four-step methodology to build the dimensional model. In sequence, we will define the source of our data, choose the grain of our fact table, choose the dimensions appropriate for that grain, and choose the facts appropriate for that grain. I have used this simple four-step methodology hundreds of times over the past 20 years when designing dimensional models for data warehouses.
The Data Source for the Clickstream Data Mart
We need to go after the most granular and detailed data possible describing the clicks on our Web server. Each Web server will potentially report different details, but at the lowest level we should be able to obtain a record for every page hit with the following information: precise date and time of page click; remote client's (requesting user's) IP address; page requested (with path to page starting at the server machine); specific control downloaded; and cookie information, if available.
This level of detail is both good and bad. As many transactional sources of data do, this one provides exquisite detail. But in many ways, there is too much detail, and it may be difficult to see the forest with all the trees in the way. The most serious problem, which permeates all analyses of Web clicking behavior, is that the page hits are often stateless. Without surrounding context, a page hit may just be a random isolated event that is difficult to interpret as part of a user session. Perhaps the user linked to this page from some remote Web site and then left the page five seconds later without returning. It is difficult to make much sense out of such an event, so our first goal is to identify and label complete sessions.
The second serious problem is whether we can make any sense out of the remote client's IP address. If the only client identification is the IP address, we cannot glean much. Most Internet users come through an Internet service provider (ISP) that assigns IP addresses dynamically. Thus, remote users will have a different address in a later session than they have at the moment. We can probably track the individual session reliably, but we can't be sure when the user returns to the site in a different session.
We can significantly reduce these problems if our Web server creates cookies on the requesting user's machine. A cookie is a piece of information that the requesting user "agrees" to store and may agree to send to your Web server each time his browser opens one of your pages. A cookie usually does not contain much information, but it can identify the requesting user's computer unambiguously. Furthermore, it provides a way to link page hits across a complete user session. A cookie may contain significant information if the user has voluntarily registered with your Web server and provided other information, such as a true human name and a company affiliation. In such a case, the cookie provides an ID to data you have stored in one of your own databases.
In order to make the raw clickstream data usable in our data webhouse, we need to collect and transform the data so it has a session perspective. This process will be a significant step in the back room. We assume that we have some kind of cookie mechanism that lets us transform our data source into the following format:
Precise date and time of page hit
Identity of requesting user (consistent from session to session)
Session ID
Page and event requested.
The Fundamental Grain of the Clickstream Data Mart
We now see that each event an individual user invokes in a special session is the grain of our clickstream data mart fact table. Each event is an individual record, and each record is a event on a Web page. Note that the Web server may not notice events within the user interface of a downloaded Web page unless we have programmed the Web page to specifically alert the server when the event occurs. In the back room's extract and transform process, we filter out automatic events and focus on ones more related to page formatting than to user actions. These kinds of filtered events include the download of graphic images, such as GIFs adorning a requested page. So if we have 100,000 user sessions per day on our Web site, and if each session involves an average of eight meaningful events, then we will collect 800,000 records per day.
The Dimensions (Step 3) and the Facts (Step 4)
See Figure 1 for the clickstream data mart's dimensional model. The dimensions are universal date, universal time, local date, local time, user, page, event, and session. We split the date from the time-of-day because these two time components have quite different descriptors. Date relates to calendar, weekdays, and seasons; and time-of-day relates to the specific spot we are in within a day. Most data warehouse fact tables that track specific times split the date from the time-of-day in this way. The date dimension clearly has a real join to a real dimension table with many textual attributes. The time-of-day dimension may be rather dull as a dimension unless we have some specific intervals during the day to which we are willing to assign names.
Figure 1.
The dimesional model for the clickstream datamart. The grain of the fact table is every meaningful event in every individual user session.
The fact is the number of seconds the user waits before the next event
We provide two versions of the date and time -- universal and local -- that let us align clickstream events in absolute time as well as relative to the user's wall clock. The analyst's query tool can perform this alignment, but this extra logic imposes an unreasonable burden on the application. Therefore, we prefer to provide two hard-wired entry points into each event's date/timestamp.
The user dimension should contain some useful information about who the user is, other than just a consistent machine ID. However, this will depend on whether we have coaxed the user into revealing facts about his or her identity.
The page dimension is important because it contains the meaningful context that tells the analyst the user's Web site location. Each Web page must contain some simple descriptors identifying the location and type of page. A complete path name is not nearly as interesting as such basic attributes as "Product Info," "Company Info," "Frequently Asked Questions," and "Order Form." A large Web site should have a hierarchical description associated with each page that gives progressively more detail about what the page is. This information needs to be stored in the page dimension and be maintained consistently as we update and modify the Web site. In other words, we have to update the production transaction system (the Web server) responsively to meet the needs of the data webhouse analysts.
Finally, the session dimension is more than just a tag that groups together all the page events that constitute a single user's session. This dimension is also where we label the session and trace its activity. We might characterize sessions as "Searching for Information," "Random Browsing," "Price and Feature Shopping," or "Placing an Order." We may be able to create these tags with simple criteria regarding what the user does during the session, or we may turn over the session record to a full-blown link analysis data mining package. In either case, the result is a set of descriptive tags we can put in the session dimension. The session should also be characterized by what we currently know about the customer, such as "Recent Large Purchaser," "Not Yet a Customer," or "Chronic Product Returner."
Our clickstream fact table only contains one fact, and that fact ("time spent") is an estimate. We try to accurately record the length of time the user spent on the page after the last click and before moving on. Because page browsing is basically stateless, we can never be completely sure whether the user has perhaps minimized the window or clicked on an unrelated site. We can only make an accurate estimate of the time spent on the page if we have a following event that is part of the session, but we have to be careful not to interpret long "times spent" too seriously.
Analyzing the Clickstream Data Mart
This dimensional design lets us perform many powerful queries. It is fairly easy to find the most frequently visited parts of the Web site and identify the most frequent users. We can also correlate pages and users to our more valuable customers because we know who places the order on the Web site.
The analysis of clicking histories and clicking trajectories through our Web site depends heavily on how we have built the session dimension. If we have good descriptive labels on each session and on the current state of the customer, we can perform all the necessary behavioral analyses.
The good news about this design is that we have successfully established a framework for collecting and analyzing all the clicks on our Web site. The bad news is that we really haven't shed much light on whether we are selling products or Web services. That confusion is quite deeply rooted and is one of the reasons that the Internet revolution is so interesting and important.
Ralph Kimball, Ph.D., co-invented the Star Workstation at Xerox and founder of Red Brick Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him through his Web page at www.ralphkimball.com.