Five Principles of Intelligent Content ManagementAs the amount of unstructured text out there steadily grows in size, intelligent information retrieval techniques are all the more important for keeping decision-makers on trackBy Dan Sullivan Enterprise information portals and other content management applications have many purposes, but most of them share a common goal: to give large numbers of knowledge workers access to large amounts of unstructured text.
Content management applications have become increasingly integrated and sophisticated, but nevertheless, we continue to suffer the fact that no single tool, algorithm, or technique can solve the problems posed by enormous volumes of information expressed in complex and ambiguous natural languages. The range of possible topics, our means of expressing them, and the difficulties in teasing out the structural relationships within natural language texts are so demanding that we are unlikely to find a "silver bullet" anytime soon. In this article, I'll explain how "intelligent" content management techniques that explicitly represent and exploit implicit information and underlying relationships in unstructured text can help senior IT managers and business leaders give their users better control over the information retrieval process. We can classify these techniques into five design principles that serve the same objective: reduce the effort required to find relevant content such as marketing reports, customer communications, and news in ever-expanding repositories of text. PRINCIPLE 1: MAKE METADATA KINGMetadata about content serves several functions. First and foremost, metadata describes the essential aspects of text, such as main topics, author, language, publication, and revision dates. This type of metadata is designed to improve the precision of full text and keyword searching by letting users specify additional document attributes. It is also helpful for classifying and routing content, purging expired texts, and determining the need for additional processing, such as translation. Managing content metadata involves two main challenges: extracting it from text, and then storing it. The extraction process needs to detect information ranging from the author's name to the dominant themes in the text while the storage process must efficiently support a number of access and retrieval methods. Manual extraction of metadata is effective on a limited scale, but in general, automated methods such as Solutions-United's MetaMarker, Klarity's eponymous metadata extraction tool, and Inxight Software Inc.'s (a Xerox spin-off) Categorizer are more helpful. Automated methods will not produce the same quality of metadata as a human being would, but they are effective for most purposes. Regardless of how metadata is generated, you can store it in two ways. First, you can manage it separately from documents in a relational database. This approach is typical in document warehousing, where tight integration with other database applications, such as data warehouses, is required. Alternatively, you can store the metadata directly within the document. In this case, the XML-based standard Resource Description Format (RDF) is your best bet. RDF is not tied to a particular metadata standard, so you can apply it to a variety of requirements. Metadata can also enable access control. For example, you could use explicit attributes of a text resource - such as a copyright or license agreement - to manage the distribution of content. For example, corporate libraries that license electronic subscriptions to business and scientific journals might need to track the number of article downloads or restrict access to particular departments or a fixed number of concurrent users. Unlike content metadata, access-control metadata exists at several logical points, such as at the document, journal, magazine, or publisher level. Quality control and document ranking is another (and often overlooked) use for metadata. Not all texts are created equal, and your information retrieval programs should reflect that fact. For example, you can safely assume an article from The Wall Street Journal on foreign investments in Latin America is accurate, but what about a piece from an obscure Web site dedicated to the same topic? Even internal documents vary in importance. A final report to the COO should have more weight than draft memos circulating among analysts, yet the memos could easily rank higher in a search based simply on word frequencies. Metadata is not limited to content description, access control, and quality control; you can also extend it to include automatically generated summaries and clustering data. Depending on the application, however, both summary and clustering information might be generated more effectively on an as-needed basis rather than explicitly stored. PRINCIPLE 2: KNOW THE USERRepresenting a user's long-term interests through a profile is yet another key to improving the precision and recall of information retrieval. Profiles, which are explicit representations of these interests, generally use the same representation schemes (such as keyword vectors) as the metadata describing the contents of documents. Tsvi Kuflik and Peretz Shoval, who research information filtering techniques at Ben-Gurion University in Israel, have identified six different kinds of profiles:
Each of these techniques has its benefits and drawbacks, such as requiring manual updating by users when interests change or slow adaptation to such changes. But whichever one you use, the generated profile provides a long-term resource for filtering, disambiguation, and document gathering. PRINCIPLE 3: CONTROL ACCESS TO CONTENTIn contrast to the free-flowing information model of the Web, effective portals require controlled access to content because users are more likely to share information when they know it is distributed within the bounds of well-defined security business rules. In general, content can be grouped into three broad access control areas: open-access information, license-restricted information, and privileged information.
When content is adequately protected, you can then turn your attention to creating a navigable repository.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|




