Welcome Guest. | Log In| Register | Membership Benefits

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Home
Digital Library
Events
RSS | Newsletters
Webcasts


April 10, 2000 Volume 3 - Number 6


How data warehouse vendors think you can channel the clickstream deluge

More Than You Hoped For


Richard Winter                

Be careful what you wish for these days. Roaring technological change may well make it come true — and then what will you do?

Consider the marketing analysts. Five years ago they were saying, “Soon, we will have databases recording every customer interaction, no matter how seemingly insignificant….” Then, they would go on to predict how that holy grail, the complete customer record, would allow us to understand customer behavior, interests, and needs in greater depth than ever before. From that understanding would come growth, profit, customer satisfaction — everything a business executive could want, they said.

Well, when it comes to the complete customer record, the Internet has let the genie out of the bottle. The clickstream, that record of every mouse click or keystroke of every visitor to a Web site, has generated a more complete customer record than has ever existed in any form of commerce.

In a retail store, the most you can get today is the full detail the point-of-sale device records: products purchased, quantities, prices, time of day, and sometimes, by whom. But on the Web, what glorious new possibilities exist! You can know a visitor’s every single action: questions asked, products viewed, alternatives compared, prices viewed, last action before leaving, and so on. Of course, you get full point-of-sale detail, too.

In fact, clickstream warehousing — the concept of storing, analyzing, and mining the clickstream, and integrating it with extensive customer data — has fired the imagination of marketers like few concepts in recent memory. It is, without doubt, the hottest concept going in the data warehouse field. The Data Webhouse Toolkit (John Wiley & Sons, 2000), is a very good book that my fellow columnist Ralph Kimball cowrote with Richard Merz. And there are more products either on the market or coming to it that can help you with your clickstream warehouse than you can shake a stick at.

So, it seems the marketing analysts’ dream has already come true. We have at hand today all the information we could possibly want about every action of every visitor to a Web site.

What’s the problem, then? There is way too much of the stuff. The group of Web sites Microsoft operates (microsoft. com, expedia.com, msnbc.com, hotmail. com… a total of 30 services in 27 countries) collectively generate about two billion hits a day, according to Microsoft’s John Eng, a data warehouse product manager. This number is growing rapidly. The Web server logs every hit with a unique page ID, a user ID, a timestamp, and other data. Microsoft can understand its customers only by integrating all these separate logs — with one another and with other customer data (25 million uniquely identified individuals in Microsoft’s case) — and processing, storing, and analyzing them.

What comes out of those Web servers is thus more than 200GB of data per day! That’s more than 70TB per year, and it’s growing fast—more data than most of us would want to be responsible for understanding. And, by the way, there are Web-site operators with more hits per day than Microsoft.

Not to mention, Web-site traffic has just begun to build. In November, 75 million people reportedly logged on to the Internet as the record-breaking Christmas shopping season took hold. But wait until there are a few hundred million Web-enabled information appliances operating out there. Then users won’t have to be at a PC and in many cases can be wireless and mobile — and the click volumes will really build.

In other words, massive clickstreams growing at astonishing rates are here for a while. And, at the same time marketing analysts are wondering if they should have wished more carefully, database professionals are trying to figure out what in cyberspace to do.

Several approaches are in use. Microsoft DBAs prefer to first collect, parse, and transform the clickstream, according to Microsoft employee Tarek Najm—and discard some, less meaningful, information. They then summarize it, after which it captures about 1GB per day, and store it in a SQL Server database. They save these low-level aggregates for four months, which yields a summarized clickstream database of about 120GB—a manageable size. They further aggregate the data into a set of cubes stored with Microsoft OLAP Services. About 350 analysts throughout Microsoft and partner organizations can access the OLAP cubes and drill through to the SQL Server database. In addition, the fully detailed, unaggregated clickstream resides in a separate database for batch reporting and data mining. They don’t publicly disclose the size and retention period of this batch database. However, I believe it could be as large as 100GB per day and that Microsoft stores it in SQL Server.

Najm thinks the biggest challenge right now is the daily processing that must be completed before loading the database. The problem of bringing together logs from hundreds of separate servers and transforming, integrating, analyzing, and aggregating them has required significant innovation, given the rapidly growing volume—which already totals 200 GB per day. Much of Microsoft’s work has gone into learning how to solve that problem efficaciously, in part via parallel processing. It developed the results of that work into an offering, Business Internet Analytics (BIA), announced in November.

DataSage approaches the problem somewhat differently, using its netCustomer Analysis Engine, according to VP of engineering Jon Lunny. Using its proprietary Parallel Data Channel Architecture, netCustomer processes the full detail of the clickstream, joins it with customer data, and builds a profile of each customer. The profile is then stored in a relational database (such as Oracle), where it is available for analysis. In the DataSage approach, the clickstream itself is not necessarily retained—only the analysis by customer. DataSage has been acquired by Vignette, a provider of e-business applications—perhaps an indication of the degree to which business intelligence analysis, and hence, clickstream analysis, is increasingly considered essential to e-business.

According to Charlie Berger, Oracle’s director of marketing for data mining, the biggest challenge is to get business-level knowledge for strategic conclusions from the clickstream. When a given customer visits the site, you want to know what to present first: sweaters or shirts, cotton or wool. Such conclusions are far more advanced than aggregating hits by URL—they require application of true data mining capability to extremely large volumes of data.

Oracle acquired Thinking Machines’ Darwin product partly to attack the requirements clickstreams present. Darwin is now part of Oracle’s Intelligent Webhouse offering; Berger believes that what it lets you do with clickstream data is typical of what customers want. As an example of the kind of results sought, Berger described how Darwin helped a client select the 5,000 people out of 10 million that were most likely to buy a ski vacation costing $2,000 or more. The project required analysis of 650 data items per person to identify the particular combination of traits that indicated a propensity to buy. Purchase transactions associated with the customer data that needed sampling and analyzing numbered 200 million. People will continue to mine samples—perhaps samples of a million clicks. Scalable data mining technology is an important requirement — scalability beyond many existing products’ abilities.

Allen Razdow, CEO of Torrent Systems, thinks the key issue is development of an infrastructure that enables scalable marketing. The data transformation, integration, aggregation, analysis, and mining that will be required, he says, is well beyond the capabilities of the marketing infrastructure available in most solutions today. Torrent’s Orchestrate enables the effective development, testing, and operation of multistep, parallel business processes—such as those required for clickstream analysis and other aspects of e-business.

Glen Sheffield, DB2 UDB senior product manager at IBM, says Razdow describes what IBM did at Fingerhut, where Orchestrate, combined with IBM’s DB2 on an RS/6000, formed the basis of a system that can manage clickstream data—in combination with direct mail, catalog, and call-center data—to increase efficiency at one of the largest catalog businesses in the world. This system manages terabytes of data, including more than 1,400 attributes for each of 12 million customers, to support an operation that includes a traditional catalog business and an e-business.

Razdow points out that a good many e-business operators are not interested in storing large volumes of clickstream data. In contrast to the warehouse approach, Razdow says that many just want to save the analysis results. They recognize that there are risks in doing it this way, but they prefer those risks to a large investment in storage and warehouse infrastructure. For these companies, the key is being able to complete a large analysis quickly—typically in a single pass of the data. A strength of Orchestrate is that it can be deployed either with or without a data warehouse.

Bob Terdeman, chief warehouse architect at EMC, says that most clickstream analysis to date has focused on operational use: analyzing traffic and usage patterns to understand how better to serve customer demand and meet response time objectives. The next stage, which will add much more value, is using the clickstream to better understand customers and what they want. Terdeman and Berger independently commented that the big payoff in e-business comes when you build a long-term relationship with the customer. Gaining real insight through the clickstream could be a significant contributor to that process. Terdeman says he has seen customers dealing with peak hit rates exceeding a billion an hour. The leading-edge e-business operators are capturing the clickstream data, processing it to the extent they can, and retaining it for as long as they believe they can afford. The next challenge, which Terdeman feels has not yet been met on a large scale, is figuring out the data warehouse design that makes the concept work. Terdeman predicts many e-businesses will work toward storing 15 to 27 months of clickstream data, just as they frequently store that much, or more, transaction data today. Given the data volumes and business needs involved, Terdeman believes we will soon see 100TB clickstream warehouses—and we won’t think it’s a big deal.

The perspective is similar at NCR, where director of retail e-commerce solutions Barry Rickman argues that you must capture and retain the full detail of the clickstream for an extended period. The only way to extract business insight from it, then, is to warehouse it. That way, it can be fully integrated with data about customers, products, Web sites, site performance, and so on.

Why retain a history? To give one example, perhaps the most valuable thing you can do on a site is attract and retain a profitable customer. The typical method is to build a relationship with the customer over time. It may take months, or even more than a year, for the customer relationship to become profitable and exhibit continuity. When such relationships develop, though, the company needs to mine the data associated with them to determine, for one, how to create a similar experience for other customers and, two, what other customers might be inclined to follow a similar pattern. This process requires going back through the history and analyzing all the factors relevant at the time — which calls for an extended clickstream history integrated into a fully realized data warehouse. NCR offers products and services that help retailers undertake this task, as well as ones tailored to other industries.

My own view is that most businesses will need to store the clickstream for a long while, for analysis and mining, in order to reach a fully competitive position. However, in the Internet world where change occurs at blistering speed and market turbulence is one of the few guarantees, I think many will opt for a more tactical approach, at least in the short term. But regardless of how they approach it, almost everyone is going to have to do it. E-businesses need good intelligence about their customers—and about how people react to their Web sites. They need to personalize the customer experience and identify who their customers are and determine how to build relationships with them. E-businesses need to offer the right products and services configured the right way for a new, rapidly changing world. And e-businesses need the results of clickstream analysis to figure this out.

Because virtually every business, to some degree, is becoming an e-business, it looks like we all have a lot of clickstream analysis ahead.

Richard Winter (Richard.Winter@wintercorp.com, fax: 617-338-4499) is a specialist in large database technology and implementation, and president of Waltham, Mass.-based Winter Corp. (www.wintercorp.com).

RESOURCES

Oracle: www.oracle.com
DataSage: www.datasage.com
Vignette: www.vignette.com
Torrent: www.torrent.com

 

 

 

Copyright © 2004 CMP Media Inc. ALL RIGHTS RESERVED
No Reproduction without permission
     




IE Weekly Newsletter
Subscribe to the newsletter
    Email Address