23 / 41

Main Publications, presentations, research papers Analysis of different types of indexing and retrieval (Kirill Fesen...

Analysis of different types of indexing and retrieval (Kirill Fesenko, 2006)

Introduction

Evolving formats and publishing technologies supported the growth of knowledge and an increasing output of information through the history of human kind. As one writer pointed out, it was possible to meet a capable person several hundred years ago who could claim to have read all material stored in a largest library of her time. Usual complaints of our contemporaries about inability to follow even major publications in their narrow disciplines are indicative of the magnitude of changes which took place in this area. This problem is often referred to as an information overload.

Review of the literature relating to human indexing

One way in which humans help each other in this situation is by doing indexing. Various definitions of indexing boil down to the core idea of providing readers with useful access points to information through which it can be located and retrieved. Despite the seeming simplicity of this concept, a student of indexing soon discovers a mature science and art with its own developed terminology and methods. Bill Kasdorf notes three different uses of word “indexing”—a distinction which may serve well as a starting point for an overview of indexing:

(1) Creation of “brilliant and sophisticated” back-of-the-book indexes (BOTBI), a process which is also referred to as a closed-system indexing. An indexer identifies topics, concepts and relationships between them which she finds in a book and creates an index. This index is specific to the book and it includes terms that may or may not occur there. Pat Booth suggests that this index should fit harmoniously with the book, be concise, comprehensive, user-friendly and made with the intended readership in mind. In order to produce such index, a human indexer should examine internal components of the book, consider its conceptual content and meaningful matter, make a mental map and a summary of the content and structure as a whole.

(2) Assigning subject classification and terms to documents in a collection (indexing of newspaper and journal articles, for example). This type of indexing is called open-system indexing and its purpose is not as much to thoroughly represent individual article’s concepts and their interconnections as in the case with BOTBI, but to direct readers to whole documents in a collection which are about certain topics. Open-system indexes, Bill Kasdorf notes, benefit from a consistent, and often controlled, vocabulary. These indexes are typically flat lists of keywords which may be assigned by librarians or catalogers who may not think of themselves as indexers. The indexing process for articles is also “lighter” compared to books. Frank Kellerman’s advice for article indexers is the following: no need to read the entire article; look at the title to grasp the author’s main points; read introductory material; scan the methods, results, and conclusions section to detect substantive points of the article; the index headings should be as specific as is warranted by the article; and read the abstract last to double check that no major points were missed.

(3) Computer-generated list of words from the document(s) with records of their location. This and other types of automated indexing will be addressed later in this paper.

This brief overview of human indexing would not be complete without mentioning that indexing is not limited to articles and books. Just about anything that humans direct their attention to presents an indexable matter, including images and sounds. Although human indexing techniques are different for these other formats, the goal remains the same—provision of useful access points through which the items may be located and retrieved. It should be also noted, as we get closer to the automatic indexing part of the paper, that it is much more difficult to train computers to do indexing of images and sounds. Automatic indexing of images and sounds, James Anderson and Jose Perez-Carballo point out, is still in its infancy, compared to fifty years of work on automatic indexing of language texts.

Automatic indexing review

Automatic (machine) indexing is very different from the human indexing techniques briefly outlined above. James Anderson and Jose Perez-Carballo define machine indexing as the analysis of text by means of computer algorithms:

(a) machine recognition and definition of a word;

(b) extraction from text and compilation of lists of words and phrases using stop lists and stemming;

(c) counting words in a document and collections to measure what’s important in them and grouping them together on that basis;

(d) latent semantic indexing (LSI). This technique is one of the most recent attempts to replicate human ability to see underlying ideas behind the surface of words by

grouping documents together based on the co-occurrence of terms in them.

This is by no means the complete list of what computers can do with texts automatically or semi-automatically with human participation. Tom Reamy gives other examples of computer use for indexing: vector machines that represent every word and its frequency with a vector; neural networks; linguistic interfaces; use of pre-existing sets of categories and seeding categories with keywords techniques.

Despite the vigorous research and development efforts in the area of automated indexing, the authors point out that its quality is much lower compared to human indexing. Tom Reamy notes that automatic indexing, in most cases, produces shallow list of terms and that one out of ten documents is indexed wrongly. And comparing indexing results produced by various software packages is also a challenge.

Still, information retrieval companies are increasingly using automated indexing which they see as an opportunity to add value to databases and become more competitive. Automatic extraction of terms and categorization allow addition of browse displays to databases and improvements to full text searching. Automatic indexing also saves manual indexing labor and it has generally become much cheaper to develop the software. Tom Reamy notes that among most active users of automated indexing are news and content provider companies, intelligence agencies.

Ongoing investigations/implementations in automatic indexing

For a novice researcher of automated indexing, searching on the internet and in licensed databases for information on ongoing research and implementation in this area, does not appear to produce results that would show this field as flourishing. There are a few experimental automated indexing projects for images and electronic documents which are in the midst of various research cycles with unclear operational perspectives. These search results also leave an impression which supports Tom Reamy’s point that automated indexing is most widely used by news companies and content aggregators, businesses and government agencies for shorter documents and records management.

Bernard Chester gives several examples of successful implementation of automatic categorization tools: Estates Gazette Interactive, for example, is using GammaWare to automatically sort article into three different taxonomies and places them in the content store; CNN is segmenting broadcasts and assigns metadata using Virage VideoLogger; U.S. government is using Hummingbird SearchServer to separate electronic documents into private communication and public records, categorizing the latter for routing and retention schedules assignment.

LEXIS-NEXIS, one of the earlier implementers of computer-aided indexing, these days is using SmartIndexing technology in its news products for indexing of thousands of documents for various topics and proper names daily. During the indexing process, SmartIndexing software analyses and labels documents with standardized terms which reflect their main topics. High accuracy of assigned terms is assured by human subject matter specialists who take part in writing, testing and maintaining the indexing algorithms. In this approach, Carol Tenopir notes, human experts take part only in the initial stage as term profiles are developed. A term profile is a collection of words which collectively describe a controlled subject. SmartIndexing analyses documents to detect sets of words which match controlled subject profiles and assigns these subjects to the document with indication of the relevance. This allows researchers limiting search queries to documents in which given subject appears as a major, strong or a weak topic. The term relevance is based on criteria such as location in the document, frequency, weight. Researchers can also combine controlled subject terms to limit the search results.

The company’s web site informs that LEXIS-NEXIS currently maintains index terms for 330,000 company names, 20,000 people names, 10,000 organizational names, 950 geographic locations and thousands of subject and industry terms covering business, industry and news. New terms are introduced weekly based on the need of particular taxonomy and customer feedback. New terms and term name changes are published on the company’s web site.

Another example is the U.S. Department of Education (DoEd) use of artificial neural network technology to analyze and categorize 4 gigabytes of e-mails and half a gigabyte of word processing documents in a demonstration project with STG Inc., a company based in Fairfax, Virginia. This test set of documents belonged to one collection produced by individuals who left DoEd at the end of the Clinton Administration. Hummingbird’s Knowledge Manager Workstation artificial neural network software was used to analyze the frequency and placement of words and concepts within this collection of documents and to place them in a multidimensional grid. By changing the grid’s various parameters, human operator can control the level of inclusiveness of certain ideas or concepts within specific groups of documents. This manual adjustment of parameters helped increase accuracy of categorization. It was also noticed that the accuracy could be further increased by narrowing the subject scope of analyzed documents. When this was realized, the software showed better results when applied to groups of documents with limited subject coverage (for example, documents produced by a specific task group). By focusing the software on these individual collections of documents, the team produced groups of words and concepts which formed certain clusters of knowledge. The individual groups of documents were further divided into smaller groups until desired level of categorization was achieved. This process of examination and adjustment of the clusters was referred to as “training” of the artificial neural network.

Yet, with all the fine-tuning of the software, it did not successfully categorize all documents. Particular problem for categorization presented short e-mail messages, jokes and cartoon messages. This experiment also highlighted the fact that the higher individual’s position in the organizational hierarchy, the more chances there would be larger portion of material with enduring value. That was true for both electronic and paper documents. However, there appeared to be a higher percentage of material of transitory value in electronic format than in paper. Scheduling a meeting, for example, may generate a number of messages where participants are discussing a time convenient for all.

Donald Schewe, a consultant for STG Inc., points in his article to certain advantages of automated indexing for DoEd if department implements the system: (1) improved access to documents through indexing of phrases and concepts; (2) records managers will be spending more time managing information rather than coaching staff about how to file; (3) increased importance of records managers who would play a role of the indexing system designers and supervisors of its implementation.

Human vs. machine indexing

The literature and automated indexing practice indicate that current automated indexing is mostly limited to computer processing of relatively short records, documents and articles. This software can count words, measure their weight and extract them from the documents and group the documents together. However, even this basic computer work is not free from errors and it requires extensive human participation for a meaningful result. Automated indexing of more complex documents like books or non-textual formats with quality comparable to human indexing is still problematic after decades of research.

Bill Kasdorf emphasizes in this regard that only “rudimentary” forms of indexing can be automated because computers cannot understand texts in the way humans can. It takes a sense of proportion and understanding of subtle nuances that only an educated and intelligent human being possess. James Anderson and Jose Perez-Carballo also emphasize that humans, compared to computers, make documents accessible to other humans by identifying themes, relationships, slants, points of view, values, purposes, research methods and other aspects of texts that automated indexing is incapable to capture.

We can also add that human indexing deals with such depths of interpretation which are impenetrable for computers. Let’s take indexing of images, for example, which may be compared to indexing of texts when thought of as a communication message. Art historian Erwin Panofsky, as Paula Berinstein notes in her article, advices indexers to pay attention to the various levels of meaning during subject description: first level of the picture’s literal description (the things that appear in it – bird and a plane, for example) and the second context or symbolic level (not just bird and a plane but also a superman). This second specific and sometimes symbolic level depends on the viewer’s familiarity with the icons and conventions of a given culture. Here is another advice on the treatment of detail from the same article: (1) do not name that which is an integral part of the larger whole, but only the whole, and (2) name a detail if it represents a meaningful whole in the picture or if it is so rare and unusual as to be meaningful. And we also know that a human indexer needs to possess certain empathic abilities to index for specific readership while machines are incapable of compassion.

What machines are capable of is an algorithmic analysis of short texts for extraction of terms and phrases and basic categorization of documents. It is obvious that continuous research efforts in this area will eventually lead to improvements in automatic indexing. These may be indeed timely innovations given the growing information overload which is impossible to control by expensive human indexing only. Present full text searching together with the automatic indexing and mature browse displays will be very helpful to future readers. It is also safe to assume that human indexers will have to learn new skills needed to manage artificial intelligence software, neural network technologies or other future tools which will increase their productivity. As current examples of practical implementation of automatic indexing demonstrate, approaching indexing systems will require human indexers’ analysis and oversight. Still, valuable lengthier texts will have to be indexed by humans equipped with more powerful software tools so that future readers can also enjoy “brilliant and sophisticated” indexes to books, collections of images and important documents.

During these times of rapid technological innovations, the American Society of Indexers should be actively participating in the automatic indexing research projects and closely monitoring and evaluating developments in this area. It may also serve as a source of initiatives, standards and requirements for automatic indexing software and processes, evaluation and comparison methods of automatic and semi-automatic indexing technologies. Finally, society of indexers should be actively promoting values, science and art of indexing back to the LIS community, information professionals in other fields and readers

It is certainly hard to foresee the future but an old advice from Francis Bacon may stand the test of time for human indexers: “And lastly, that the Novelty, though it be not rejected, yet be held for a Suspect: And as the Scripture saith; That we make a stand upon the Ancient Way, and then looke about us, and discover, what is the straight, and right way, and so to walke in it.”

Literature used

Anderson, James D. and Perez-Carballo, Jose. "The Nature of Indexing: How Humans and Machines Analyze Messages and Texts for Retrieval. Part II: Machine Indexing, and the Allocation of Human Versus Machine Effort." Information Processing and Management 37:255-277, 2001.

Bacon, Francis. “Of Innovations.” In: The Essays of Francis Bacon. New York, Crowell, 1901, p. 100.

Bates, Marcia J. "Indexing and Access for Digital Libraries and the Internet: Human, Database, and Domain Factors." Journal of the American Society for Information Science 49(13):1185-1205, 1998.

Berinstein, Paula. "Do You See What I See? Image Indexing Principles for the Rest of Us." Online 23(2):85-86,88, March/April, 1999.

Booth, Pat F. In chapter 3, "What (and Whether) to Index." In: Indexing: The Manual of Good Practice. Munich, K G Saur, 2001. pp.49-66.

Chester, Bernard. Auto-Categorization and Records Management. AIIM E-Doc Magazine 18 no2 16-18 Mr/Ap 2004.

Kasdorf, Bill. "Indexers and XML: An Overview of the Opportunities." The Indexer 24(2):75-78, October, 2004.

Kellerman, Frank R. Chapter 2, "Indexing, Index Medicus, and Other Traditional Abstracting and Indexing Services." In: Introduction to Health Sciences Librarianship. Westport, CT, Greenwood Press, 1997. pp.21-48.

Lexis-Nexis. Company flyer on SmartIndexing technology. Available at <www.lexis-nexis.com/custserv/pdfs/business/Nx572_Indexing_Flier.pdf> (accessed on May 1, 2006)

Lexis-Nexis. SmartIndexing web site: <http://www.lexisnexis.com/infopro/smartindexing> (accessed on May 1, 2006)

Reamy, Tom. "Auto-Categorization." EContent 25(11):16-18, 20-22, November, 2002.

Schewe, Donald B. “Classifying Electronic Documents: A New Paradigm.” Information Management Journal 36(2): 54, 56-59, March/April 2002.

Tenopir, Carol. "Human or Automated, Indexing is Important." Library Journal 124(18):34,38, November 1, 1999.

Examples:

Back-of-the-book index

Algorithmic analysis, 5. See also Automatic indexing, Computer(s)

American Society of Indexers

role of, 5

Ancient Way, 6

Anderson, James, 2, 5

Articles. See also Books, Images

indexing of, 1-2

Artificial intelligence software, 5. See also Software

Artificial neural network

training of, 4

Automatic categorization. See also Automatic indexing, GammaWare, Software, Virage

VideoLogger, Hummingbird SearchServer

accuracy of, 4

problems of, 4

tools, 3

Automatic indexing, 5. See also Automatic categorization, Computers, Indexing, Human indexing, Latent semantic indexing, Linguistic interfaces, Neural networks, SmartIndexing technology, Vector machines

advantages of, 4

benefits of for information retrieval companies, 3

definition of, 2

of images and sound, 2

implementation of, 3

quality of, 2

of language texts, 2

software, 2

compared to human indexing, 4-5

Back-of-the-book indexes, 1, 5. See also Books

Bacon, Francis, 6

Berinstein, Paula, 5

Books. See also Articles, Images

indexing of, 1

Booth, Pat, 1

BOTBI. See Back-of-the-book indexes

Browse displays, 3, 5

Journal article index

American Society of Indexers

Automatic indexing

Computer-aided indexing

Human indexing

Humans

Indexing

Information retrieval

Innovation

Interpretation

Machine indexing

Understanding

United States