Welcome Guest. | Log In| Register | Membership Benefits

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Home
Digital Library
Events
RSS | Newsletters
Webcasts




May 31, 2003

It's All Subjective

Which, if any, is the right textual business intelligence solution for your company?

by Jeanette Burriesci

What, if anything, is a zebra? Natural historian Stephen Jay Gould poses this question in the title of a famous essay discussing how classification systems are anything but straightforward. In it, Gould writes, "The world is much more interesting than ideal."

Text is no exception. Categorizing text is the basis of extracting business intelligence (BI) from unstructured data, commonly estimated to compose about 80 percent of an enterprise's information assets.

Unlike predefined, granular data elements, freeform text doesn't fall into simple, easily delineated categories. Start at the simplest level, that of a single word. For example, what is "Java"? Morphologically, "Java" is the sequence of letters you see, and nothing more. But contextual clues might show that one instance of Java refers to an island, while another refers to a computer language. Etymologically, "java" comes between them: The slang term "java" for coffee comes from the Island's name; Java the computer language was named for coffee. Grammatical rules convert "java" to "Java" when it's the first word in a sentence or when it appears in a title or business name.

Therefore, when it comes to search and retrieval — an important component of textual BI as well as knowledge management (KM) — keyword searches can get us only so far. We can improve precision on a search for Java the language by adding other keywords likely to appear in a programming context but not in reference to islands or coffee. But doing so might exclude documents that don't happen to include those precise words.

You can use a reverse strategy: block documents with the word "island" or "coffee." But then you inadvertently exclude documents that briefly discuss the origin of Java's name or the information systems phrase "island of information." (And still you will capture documents about the island or coffee that don't use those exact words.) Search and retrieval systems often present this trade-off between recall, or the ability to find appropriate files, and precision, which is the ability to exclude irrelevant ones.

This Java search example is very simple, however, compared to many text searches that people attempt in reality. Many documents highly relevant to a certain concept may not overtly use the words we can think of to search for them. Just as bad, all the words we can think of may also appear in many documents that are completely unrelated to the subject we're researching.

Precision Plus Recall?

Of course, nobody wants recall and precision to mutually exclude each other. How can both be improved simultaneously? For a long time now, the dichotomy has been broken with another trade-off: labor expense. Companies that rely heavily on unstructured information to conduct business have often poured a lot of human resources into adding metadata to their documents. Many times they'll hire small armies of library scientists or other experts to create classification systems — taxonomies and ontologies — and tag documents accordingly.

Of course, most companies without such experts and classification systems just muddle through with keyword searches. When employees or executives at such a company assemble their hard-won knowledge in a document that may already have largely existed at the company, this new document will likely never be used again, even if it's precisely what others need. Even with human experts, who can typically tag only 20 to 100 documents an hour and must continually update their classification system, many items will fall through the cracks. This is a KM problem, but failure to solve it makes BI on the same unstructured data an impossibility. In fact, Web search enhancer Intelliseek Inc.'s CEO and founder Mahendra Vora tells me that BI and KM have been converging over time.

From KM to BI

Whether a company uses human experts to augment a search engine or not, a search engine alone will not enable it to act quickly in response to trends, problems, or opportunities that incoming unstructured data might reveal. Unstructured data, in aggregate, can often contain important, actionable information long before the structured, financial data reflects the same.

Island Data Corp.'s InsightRT applies unstructured-data analysis to email and other call center communications to yield just this sort of information. By automatically analyzing call center notes, comments on surveys, email messages, and the like for the writer's tone, general subjects brought up, and other fuzzy categories, InsightRT is supposed to be able to generate sales leads, improve marketing campaigns, lower the customer attrition rate, point to cross-sell and up-sell opportunities, improve product development, provide other customer insights, and generate automated responses to customer correspondence.

Island Data names as its biggest differentiator its Concept Recognition Engine. Generically speaking, it's an entity extractor. Whereas the best-selling textual BI and KM companies have to date based their products on machine-assisted taxonomies or ontologies (think of Applied Semantics Inc., Autonomy Corporation plc, IBM Lotus software brand, Interwoven Inc., Open Text Corp., Stratify Inc., and Verity Inc.), a few companies are emerging that use entity extraction algorithms instead.

ClearForest Corp., Inxight Software Inc., Megaputer Intelligence Inc., Mohomine Inc. (recently acquired by Kofax), and Recommind Inc. also use proprietary entity extraction algorithms to automatically identify concepts (not just words) in the text base, and then relate documents by concept. In contrast, taxonomy-dependent systems require a big investment of human involvement to train, tune, and maintain the system.








IE Weekly Newsletter
Subscribe to the newsletter
    Email Address