CSI: Data Warehouse > > Intelligent Enterprise: Better Insight for Business Decisions

Welcome Guest. | Log In| Register | Membership Benefits

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Home
Digital Library
Events
RSS | Newsletters
Webcasts


  • EMAIL
  • PRINT
  • REPRINTS
  • Follow Us on Twitter
  • FOLLOW US
  • Share

CSI: Data Warehouse


Armed with a knowledge of patterns, you can root out bad data.


By Joe Celko
December 4, 2004

My wife and I love all the CSI police procedural dramas that are so popular now. The crime lab crew gets to the crime site knowing nothing about the situation and finds the bad guy in 60 minutes. How about a show called CSI: Data Warehouse on Tech TV?

Imagine that you walk into a client who has a large amount of data, and he wants to know if his data is real or fake. You don't know anything about his data, or even his industry. It turns out that data qua data actually has some patterns that are fairly easy to find in a modern database. Let me give a quick overview, without much mathematics, of some of the easy ones.

Seeing the Pattern

In the old days, before we had pocket calculators and cheap computers, engineers solved equations with slide rulers and books of mathematical tables. A slide ruler is accurate to three decimal places and most of the books were accurate to five or six places at least. (If you don't know what a slide ruler is, then Google it.)

In 1938, Dr. Frank Benford was a physicist at General Electric who noticed that pages in the logarithm tables that corresponded to numbers starting with the numeral one were getting worn out faster than the other pages. Most of us would have stopped at that point and not thought about it very much.

Dr. Benford could see no immediate reason that physicists and engineers would prefer logarithms starting with one. He started data mining unrelated sets of numbers. I mean really unrelated — geography, census data, baseball statistics, numbers in magazine articles, and just about anything he could find. After looking at 20,229 sets of numbers, he found that they all followed the same pattern he had seen in the logarithm tables.

The usual guess would be that all digits, one through nine, would be equally likely to pop up at the start of a string of digits. Nope, not true; Benford's Law says that it can be approximated by the formula P(d is first digit) = LOG10(1.0 + 1.0/d). The pattern is 30.1 percent for one, 17.6 percent for two, and down to 4.6 percent for nine. You can get some confirmation of this in "The First-Digit Phenomenon" by T. P. Hill (American Scientist, July-August 1998). Benford's Law gets better as the sample gets larger and more varied.

What makes Benford's Law useful to a data miner is that you don't have to understand the data. If the data drifts from the pattern, you know to look for a systematic bias or faked data. Like any statistic, it isn't a certainty, but it's a good place to start. In fact, there are fraud detection packages based on Benford's Law that look at patterns in expense reports and other financial data.

A Run of Luck

Another data mining trick is looking at runs. If you toss a fair coin 200 times, you're almost certain to encounter a run of six heads or six tails. But when people fake data, they don't like to repeat long runs of one value. This idea can be extended to multiple values and patterns of runs, but the coin toss is the simplest case and very easy to test.

We can generalize the idea of patterns over time. Given a jar of (n) numbered marbles, draw one out of the jar at random and put it back. How many draws do you have to make to have a better than 50 percent chance of getting a previous marble that was drawn (p) draws ago? You can make this scenario into customers who come back to the store, trucks that pull into the motor pool, or whatever. Most people guess (n/2), but it's much lower. The formula is a bit messy, but here's a quick table.

n

p

10

4

100

12

1000

37

10,000

118

100,000

372

In the real world, breaks in the data patterns are often the result of something systematic and not criminal. For example, the digit nine might show up a lot in an expense report sample because there's a $100.00 limit on something and people are padding their reports.

But the digit nine might also show up too often because the data warehouse handles missing data by filling codes that start with nines. Now we have an indication that we have a lot of missing data. We might want to look at this data and see if we need more detailed codes for kinds of missing data, if we need to improve data capture, or if we have to adjust the certainty of predictive reports to allow for this fact.

Joe Celko is an independent consultant in Austin, Texas and the author of Joe Celko's Trees and Hierarchies in SQL for Smarties (Morgan Kaufmann, 2004) and Joe Celko's SQL for Smarties: Advanced SQL Programming (Morgan Kaufmann, 1999).


  • EMAIL
  • PRINT
  • REPRINTS
  • Follow Us on Twitter
  • FOLLOW US
  • Share


 





New on the BLOG
5 Opportunities and 3 Threats for Oracle
02. 9.2010
blog author
Rajan Chandras
With the acquisition of Sun, Oracle now has a few things going for it, including something no other IT giant has -- not IBM, not Microsoft, and not SAP. And lurking also are a few challenges.

Read more from Rajan Chandras >>

Cindi Howson
Is Gartner's Quadrant the Problem, Or Is It How It's Used?
Bashing Gartner's Magic Quadrants seems to be a popular industry pastime, but in truth, I kind of like the quadrants. My biggest gripe is in how the quadrants are used, not necessarily the quadrants themselves...

02. 8.2010
Read more from Cindi Howson >>

Clarabridge Asks, Are You Customer Experienced?
02. 5.2010
blog author
Seth Grimes
Add "customer" to Jimi Hendrix' song title and you have a question central to last week's Clarabridge Customer Connections (C3) conference, Are You Customer Experienced?

Read more from Seth Grimes >>



Intelligent Enterprise Newsletters
Subscribe Here:
*Email:
 First Name:
 Last Name:
  Intelligent Enterprise Blogosphere Newsletter:
  Intelligent Enterprise Newsletter:

Email Type: