|
Data Frontiers, by Curt Monash
Curt Monash runs Monash Research, which provides strategic, analysis-based advice to users and vendors of advanced information technology. He also writes the blogs DBMS2, Text Technologies, and Strategic Messaging. See More by Curt Monash eBay's Enormous Data Warehouses Detailed
A few weeks ago, I had the chance to visit eBay, meet briefly with Oliver Ratzesberger and his team, and then catch up later with Oliver for dinner. I've already alluded to those discussions in a couple of posts, specifically on MapReduce (which eBay doesn't like) and the astonishingly great difference between high- and low-end disk drives (to which eBay clued me in). Now I'm finally getting around to writing about the core of what we discussed, which is two of the very largest data warehouses in the world. Metrics on eBay's main Teradata data warehouse include:
Metrics on eBay's Greenplum data warehouse (or, if you like, data mart) include:
eBay's Teradata installation is a full enterprise data warehouse. Besides size and scope, it is most notable for its implementation of Oliver's misleadingly named analytics-as-a-service vision. In essence, eBay spins out dozens of virtual data marts, which:
The whole scheme relies heavily on Teradata's workload management software to deliver with assurance on many SLAs (Service-Level Agreements) at once. Resource partitions are a key concept in all this. So far as I can tell, eBay uses Greenplum to manage one kind of data -- Web and network event logs. These seem to be managed primarily at two levels of detail -- Oliver said that the 17 trillion event detail records reduce to 1 trillion real event records. When I asked where the 17:1 ratio comes from, Oliver explained that a single web page click -- which is what is memorialized in an event record -- resulted in 50-150 details. That leaves a missing factor of 3-8X, but perhaps other less complex kinds of events are also mixed in. Two uses of eBay's Greenplum database are disclosed -- whittling down from detailed to click-level event data, and sessionization. The latter seems to be done in batch runs and take 30 minutes per day. A couple of other uses are undisclosed. I assume eBay is doing something that requires UDFs (User-Defined Functions), because Oliver remarked that he likes the language choices offered by Greenplum's Postgres-based UDF capability. But basically eBay's Greenplum database is used for and evidently does very nicely at:
eBay's Teradata database handles the rest. Related links:
|
Blog Channels
on Enterprise App Development on Changing the Enterprise by Shawn Shell by Kas Thomas Subscribe to RSS feed of all blogs Archives
|
|
|





