Five Principles of Intelligent Content ManagementDeveloping a clear technology strategy is vital to the success of your B2B projects and initiativesBy Dan SullivanContinued from Page 1 PRINCIPLE 4: SUPPORT RICH SEARCHINGEfforts to improve searching have led to a variety of techniques for representing documents, enhancing user queries, and finding correlations among terms. Nevertheless, most information retrieval systems still hit a wall around the 60 to 70 percent range of precision and recall because they depend on primarily statistical techniques instead of linguistic understanding (which is not yet feasible on a general scale). Although we could continue to squeeze marginal gains from keyword searching, a better approach is to combine three techniques: keyword searching, clustering, and visualization.
The most effective keyword search techniques expand the user's query. A thesaurus automatically adds synonyms to queries, so a search for "stocks" becomes a search for "stocks or equities." "Stemmers" are used to account for inflections and derivations, so searches for "African" will also find "Africa," and "banks" will check for "bank." Soundex and fuzzy matching are also useful for compensating for misspellings But even with relatively high precision and recall, keyword searches can yield seemingly unmanageable number of hits. Clustering is an effective way to address this problem. Hierarchical clustering is the process of building a tree structure in which the root of the tree contains all documents, internal nodes contain groups of similar documents, and the size of the groups decrease as you move farther from the root until the leaf nodes contain only single documents. These clusters provide a familiar taxonomy-like structure that let users navigate from broad collections of topics to more narrowly focused texts. Another technique that has proven quite effective in reducing the time it takes users to find relevant content is the scatter/gather algorithm. With scatter/gather, a result set is clustered into a small, fixed number of groups. (Five seems to be a good size.) Users then select the most relevant of the groups and the documents within that group are then clustered into the same number of groups. Again, the user can drill down into the most relevant cluster and each time the elements are grouped into a number of semantically related clusters. The advantage of this approach is that users can dynamically direct the clustering process as they focus on the most relevant topics. Clustering effectively groups documents based upon content, but sometimes users need to explore the areas around hyperlinked documents. For example, a member of a geographically distributed sales team might look into sales to a consumer electronics retail chain and find several types of documents ranging from meeting notes of other team members to news feeds from Comtex News, Factiva, or other business content aggregators. Course-grained navigation tools, such as Inxight's Tree Studio, display hyperbolic trees where each node in the tree represents a document labeled with a title or other descriptive text. Instead of clicking through to each linked page individually, a user can quickly assess a neighborhood of hyperlinked documents and focus in on topics of particular interest. When users are looking for targeted information, a more fine-grained navigation tool is appropriate. For example, if an account manager is looking for information about sales of mobile phone service and needs to distinguish key marketing terms among a variety of plans, then a tool such as Megaputer Intelligence Inc.'s TextAnalyst allows users to quickly focus on particular terms and discover their relationship to other terms in the text. Of course, searching, clustering, and navigation all presume a significantly large repository of relevant content. That fact leads us to the final principle. PRINCIPLE 5: KEEP CONTENT TIMELY, AUTOMATICALLYSome aspects of content management should be automated to keep pace with the available supply of potentially useful content. First of all, you can use harvesters, crawlers, and file retrieval programs to gather documents for inclusion in the content repository. These programs are themselves driven by metadata about which sites to search and which directories or document management systems to scan for relevant content. In many cases, only metadata about documents and indexing detail need to be stored in the portal or document warehouse, and the documents themselves can be retrieved on an as-needed basis. Automatically gathered documents may require file format or character set conversion before indexing, clustering, metadata tagging, and other text analysis tools can go to work. Automatically managing content is as difficult, or more so, than the extraction, transformation, and load process in data warehousing because the structure, format, and range of topics is more varied. This process will require a series of filters, transformations, and analysis steps as text moves into the content repository.
Unlike data warehouses that tend to keep historical data, portal content should be purged. Again, metadata about document types and sources will drive this process. For example, analysts' predictions about earnings reports become irrelevant when an actual earnings report is issued (unless, of course, you want to track the accuracy of the past predictions). In other cases, we might want to keep only the summary of a text, such as a product recall notice or a competitor's press release more than two years old. Tracking when documents arrived, where they came from, who created them, and other attributes will provide the grist for a number of content management processes. REMEMBER THE FIVEFree-form text is often called unstructured, but that term is a misnomer. Language's rich structure succinctly represents complex concepts and relationships, but to effectively access that information requires techniques that account for that structure and let users bridge the gap from their interests to information retrieval. Your organization can realize intelligent content management by adhering to these five basic principles. All are based on the realization that users need to find small amounts of targeted information from sprawling repositories of enormous scale such as the aptly named World Wide Web. As long as we use language, we will always use words with multiple meanings and concepts expressed with a variety of words, as well as confront constantly less than perfect precision and recall. Making the implicit explicit through metadata; modeling user interests; protecting access to content; supporting search, organization, and navigation tools; and keeping content up to date all chip away at the inherent structural problems of dealing with large volumes of unstructured texts. Dan Sullivan [DSullivan@RedmontCorp.com] is CTO of Redmont Technologies, a consulting firm specializing in business intelligence and content management systems. SEARCH ENGINES: THEY ARE NOT ALL THE SAMEAlthough search engines on their own are insufficient for reaching high levels of precision and recall in information retrieval systems, they are a solid starting point. Some search engines work from the most basic principle - indexing words that appear in documents - while others analyze patterns without regard to language specifics, or exploit syntactic and semantic knowledge of language to identify concepts represented in texts.
RESOURCES Sullivan, Dan. Document Warehousing and Text Mining (Wiley, 2001) Available at the IntelligentEnterprise.com bookstore. Autonomy: www.autonomy.com Inxight: www.inxight.com Klarity: www.klarity.com au Megaputer Intelligence: www.megaputer.com Oracle: www.oracle.com Semio: www.semio.com Solutions-United: www.solutions-united.com Verity: www.verity.com Related Articles on IntelligentKM.com: "Extracting Knowledge," May 7, 2001: www.intelligentkm.com/feature/010507/feat1.jhtml "Word Wranglers," Jan. 1, 2001: www.intelligentkm.com/feature/010101/feat1.jhtml
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|



