Published in:  J. STROBL and C. BEST (Eds.), 1998: Proceedings of the Earth Observation & Geo-Spatial Web and Internet Workshop '98 = Salzburger Geographische Materialien, Volume 27. Instituts für Geographie der Universität Salzburg. ISBN: 3-85283-014-1


Digital Library Approaches to Resource Discovery in Earth and Space Science

Howard Burrows and Ramachandran Suresh

NASA Digital Library Program, Raytheon STX
7701 Greenbelt Road #423, Greenbelt, MD, USA
Hburrows@stx.com

Contents:

  1. Abstract
  2. Introduction
  3. Traditional Libraries
  4. Roget's Thesaurus
  5. Encyclopedia Britannica
  6. Conclusions
  7. References

Keywords:

metadata, resource discovery, knowledge structures, digital library


1. Abstract

Navigation, browsing, and effective search depend on an underlying organization of information. "Metadata" is produced and made prominent according to expectations determined by this underlying information infrastructure. Selection decisions regarding access, persistence in archives, and expected utility and audience for datasets and documents are established using this organized underlying intellectual construct. For centuries traditional librarians have dealt with these same issues to stock shelves with a selection of books and journals that match the expected needs of their local patrons. Over the years in order to facilitate the process, librarians have formalized a catalog that captures one underlying organization. Similarly, encyclopedists have had to work with an intellectual organization of knowledge in order to select the content of their books and create suitable indexing for the expected readers. In this presentation we will explore the lessons learned in library cataloging and encyclopedia organization and indexing. Then we will explore how these lessons might be extrapolated to take advantage of the new digital library technologies.

We will suggest that with new technologies there is an opportunity to shift to a new paradigm in which the information infrastructure is based on the questions that have been asked. A dynamic infrastructure based on questions would support the diversity of human interests and understandings at any given time. If the delivery system for information has a mandate to provide suggested answers and explanations, an infrastructure based on an organization of questions provides the necessary scaffold to reveal the appropriate raw datasets and documents. The revealed metadata, and decisions about content and archiving would be derived from this underlying infrastructure rather than from the current content of the repository.

We will explore three different approaches to such an infrastructure that have been used successfully in the past. Since the Internet and the World Wide Web support multiple relationships, a combination of these approaches might lead to a better tool for orienting and advertising content to potential users. The scientific questions that led to the acquisition of a dataset begin the formal infrastructure. As the data becomes useful in other conceptual domains, the original infrastructure is elaborated. This infrastructure becomes the ontology and forms the raw material for the construction of new questions and new data collection.

Back to contents


2. Introduction

Large volumes of data will be generated from Earth observing satellites in the next five years. Scientists from a variety of backgrounds and disciplines must transform these data into useful information. Conventional ways of describing and searching for critical data within this large space of raw numbers will not be sufficient. Current approaches primarily focus on search for metadata or data descriptions. This approach helps to locate the data and provide some high level information about data sets. A new knowledge-based approach is required to search for information from large volumes of data. Such a knowledge-based search requires revisiting fundamental issues of science:

What is the pattern of assumptions that scientists work from when they set out to make discoveries? How do they evaluate proposals and rank them?

In order to have the notion of progress and growth in a field, there must be an underlying intellectual infrastructure that defines the direction forward and constructs the notion of size and expansion. In this article we look at the way that such an infrastructure is revealed in the works of three scholars: a librarian, a scientist/linguist, and an encyclopedist. We wish to begin a dialogue to discover how each of these efforts from other domains might be extended to fit the needs of Earth and space science. How can we link observations to their potential implications for scientific conjecture?

Resource discovery entails an assessment of the value of various pieces of scientific information. This assessment depends on the underlying structure of beliefs within the field. What data and what types of data support or deny these beliefs? Only by capturing this infrastructure we will be able to predict and advertise potential implications of our datasets.

Back to contents


3. Traditional Libraries

3.1 Context and Purpose

Librarians have been the custodians of our literature spanning the universe of knowledge. Shiyala Ramamrita Ranganathan started his career as a mathematician. In the 1920's he turned his attention to organizing libraries and devoted the next fifty years to the task of developing a "System of Knowledge" rich enough to support not just our documents, but even our nascent "spot thought." Active in international library organizations, Ranganathan's influence is reflected in the cataloguing systems throughout the world. The system that he developed called "Colon Classification" is still in wide use in India and may be better suited to the new technologies than are the better known Library of Congress Classification, Dewey Decimal Classification, or the Universal Decimal Classification.

Ranganathan's first book, and perhaps the most important, was "Five Laws of Library Science" published in 1931. In this book he set out the purpose of classifying. The laws are very simple, but replacing the word "book" with the word "data," serve as a good set of laws for those maintaining datasets. The laws are:

  1. Books are for use.
  2. Every reader his book.
  3. Every book its reader.
  4. Save the time of the reader.
  5. The library is a growing organism.

To serve these requirements Ranganathan approached the problem of library classification with incredible diligence. The three editions of his book, "Prolegomena to Library Classification," trace the progress of his systematic efforts to understand the issues in developing a system of knowledge suitable for the patrons of his libraries. He built on a deep understanding of his native Vedic System. In one publication he compares his task to the "Avatar Khurma," the great turtle that first brought substance on its mighty shell out of the muddy depths at the beginning of creation.

The background for Ranganathan's library science spread well beyond India's intellectual and spiritual history. In the 1957 edition of the Prolegomena a large section is devoted to the "pre-history" of library classification systems. Here he cites Richardson's 1901 book "Classification: Theoretical and Practical" in which 116 systems of knowledge are enumerated. He discusses the systems of Aristotle, Plato, Francis Bacon, Kant, Hegel, Comte, and others. In the rest of the book, he takes this background and applies a mathematician's attention to notation and formal exposition to clarify and extend the science of classification.

At one point he was brought to the United States by the Rockefeller Foundation with the thought that he might be able to "lay a foundation of library classification as a language…, so as to make it serve, if possible, as an international language of communication free from the fussiness usually caused in a natural language by drifting folk…"

3.2 Organizational Structure

There are many ways to sort books on a shelf. The First Law directed the classification effort to provide a topical arrangement since it was found that most readers seek books by subject matter. "Books are for use as embodied thought, not as physical commodities…" Ranganathan noted that few readers are able to "name exactly the specific subjects of their interest at the moment; they usually think of a broader or a narrower one." The reader should find in the library items that "he was only vaguely conscious of wanting…" To do this at least one arrangement of items should be according to "the degree of their mutual relation or affinity."

These dictates of Ranganathan's fundament laws required deeper analysis of the way readers think about elements in the "universe of knowledge." Ranganathan provided that analysis, together with a cataloging notation and strategy that may have value in the Earth and space communities as they seek ways to link scientific hypotheses to the data elements that could validate or reject them.

Ranganathan's books detail strategies for forming subject headings and sequencing the headings and subheadings so that they attract readers. He introduced the notion of "facets", which allow the classifier to represent significant aspects of a document outside the body of the hierarchical classification structure.

3.3 Potential Extensions in Earth and Space Science

It is clear that these considerations demonstrate that we are not discussing a passive archive of numbers. We are fast closing the gap from the granularity of long rigid books, through mixed bound journals, to the more dynamic fine-grained space of natural language. We can now think to include raw datasets and their metadata in this fluid medium, this new language for expressing our ideas and scientific beliefs.

Back to contents


4. Roget's Thesaurus

4.1. Scientific Context and Purpose

In 1848 Peter Mark Roget retired after more than 40 years as Secretary of the Royal Academy of Science in London. At age 70 he began work on what would be published five years later as the well-known Roget's Thesaurus of English Words and Phrases. Why would a scientist be the one to see the power and develop the structure for such a book?

Roget was a public health physician who spent most of his career cleaning streets and sewers to stop the spread of epidemic diseases. Nevertheless, his interests ranged widely over many other areas of science. He was admitted to the Royal Academy for a paper in which he described a mechanical device for computing that later became the slide rule. In another example, he noted while shaving one morning that spokes on a wagon wheel appear bent when viewed through the slits in a picket fence. This led to a publication on visual memory which was instrumental in the development of the motion picture industry.

With such broad interests, it is not surprising that Roget became involved in the organization of the library at the Royal Academy. Long before Dewey, Bliss, and Ranganathan he developed a strong set of skills for giving intellectual structure to a large set of concepts. This practice and debates on the proper organization with the traditional librarian Panzini left Roget well prepared to begin a project that involved a deeper analysis of the conceptual organization of human thoughts. He brought his strong sense of order and writing skills to the Encyclopedia Britannica and contributed many articles throughout his career.

When Roget retired from the Royal Academy, he had a plan to develop a new tool that might help with the "analysis and classification of our ideas." He wanted this tool to "determine the principles through which a strictly Philosophical Language might be constructed" that might bring about "the removal of that barrier to the interchange of thought and mutual good understanding between man and man." Though his notes were largely lost over the course of several fires, based on his introductions to the original editions, Roget's Thesaurus was an attempt by a noted scientist to provide a unified language for science.

4.2. Organizational Structure

Roget's first task was to create a structure, "a system of classification of the ideas that are expressible by language" that would point out similarities in concepts. The top level of his Synopsis of Categories consists of six topics, three in the physical realm and three in the mental. Over a depth of five hierarchical levels, Roget's taxonomy includes 1000 terms which provided a scaffold remarkably effective in teasing out the subtle differences in meaning between all the words and phrases in the English language. Recognizing that categories often represented a whole graded dimension of human observations, Roget frequently chose two or three terms to anchor the extremes of a concept, for example, taking "Loudness" and "Faintness" in the class for "Sound" or "Master" and "Servant" in class for "Intersocial Volition."

4.3. Potential Extensions in Earth and Space Science

It is not clear from the sketchy remaining records how far Roget would have developed his notion. It seems likely that he would have extended his ontology from words and phrases to include sentences and propositions, thus providing a framework for registering and comparing related ideas. Can we now develop and extend his language to provide the structure for a social process of knowledge engineering?

How could such a tool help us develop a new approach to resource discovery that directly links datasets to human knowledge?

Back to contents


5. Encyclopedia Britannica

5.1 Context and Purpose

Mortimer Adler came to the Encyclopedia Britannica one hundred and forty years after Roget. His original purpose was to help generate interest in a set of books Britannica was publishing called "The Great Books of the Western World." The Great Books series included such works as the Bible, all of Shakespeare, the Greek philosophers, even works by mathematicians and scientists. What Adler determined to do was to develop an analysis of the great ideas contained in the books, as if the authors were involved in a Great Conversation. He developed the notion of a "Syntopicon" and brought a team of scholars together to produce a new type of index to the concepts in the documents.

Key to the success of his enterprise was Adler's notion of a "Summa Dialectica." His notion of a dialectical task was that of "rendering an objective, impartial, and neutrally formulated report of a many-sided discussion." Similar to Hegel's notion of the dialectic involving three parts, thesis, antithesis, and synthesis, Adler thought to establish the extremes of human insight into an issue and use the disambiguation that this produces to help resolve differences of opinion.

An important aspect of this enterprise is its social context. Over 10,000 concepts were identified in the Great Books by Adler's team of scholars. Analysis of the frequency of appearance of each concept allowed them to rank the concepts in order of significance to humanity. Through this process Adler found 102 Great Ideas that distinguished themselves above the others.

5.2 Organizational Structure

Before 1910 the Encyclopedia Britannica had only an alphabetical arrangement of topics. While this served the purpose to provide an organized repository of information, it did not provide a very good tool for one who wanted a deeper knowledge of the relationship among topics. Even thought the information was contained in the encyclopedia to do an in depth study of a larger concept, there was no easy way to locate all the relevant topics. In the 1910 edition Britannica partially remedied this by including a "Classified Table of Contents."

Adler's team in working with the concepts in the Great Books took this notion one step further. Using thousands of index cards they developed a sorting scheme that allowed hierarchies of related ideas. This scheme allowed a location for all the concepts in the books and even provided a way to develop broader categories to subsume many of the details when readers wished to browse for general content.

In 1974 Adler extended this concept again in developing the Outline of Knowledge for a "Propaedia" to the Britannica itself. This outline was not alphabetical and was created to give the reader the ability to "make a complete study of a given topic." The entire realm of human intellectual thought and experience was segmented into ten components. Each component was then further segmented into more and more particular sections. A commitment was made to one set of organizing principles and one apportionment of scope and detail. This underlying organization of knowledge was developed prior to the consignment of articles became the driving force for the selection and acceptance of content.

5.3 Potential Extensions in Earth and Space Science

Developing this underlying organization of knowledge in Earth and space science will require a team effort. It will require what Ranganathan calls "classificationists" in addition to classifiers. Classificationists look deeper into the way we make use of the content and attempt to establish strategies for setting types to the concepts and relationships. With todays technology we can extend this to include the toolsets and instruments that we have to make use of our ideas.

Back to contents


6. Conclusions

The volume of data that will be generated in the next few years will be more than all the data collected from the five Landsat missions. The EOSDIS itself will generate nearly one terabyte of data per day. Conventional ways of describing and searching for data will not be sufficient for users to quickly access data and information.

It is by capturing an underlying organization in Earth and space science that we can effectively apportion and catalogue our datasets and their recorded implications. New information technologies using computers, the internet, and the world wide web provide ample power to extend and employ a strong ontology that we can structure using ideas from the classification system of Ranganathan. The necessary resiliency can be obtained using the flexibility of Roget's conceptual spaces and the collaborative power to quickly build such a structure can be obtained using Adler's notions of a summa dialectica.

The examples from history cited here offer the possibility of a new knowledge-based search powerful enough to reveal the information content in large volumes of data we are receiving. Moreover, they can advertise this information to diverse user communities when they need to use the information in a wide range of circumstances to achieve a wider set of goals. Such knowledge-based search requires the development and use of many algorithms, many now being studied in the digital library programs. Currently many such algorithms are used to produce higher-level data products. The seamless integration of such algorithms to search large volumes of data from multiple sources will advance knowledge-based search.

Classification strategies must be extended to identify events and anomalies in the data as they relate to our current understanding of cause and effect. Simple geolocation is one component, but may be only a small part of the structure we will need to make sense of our datasets. What can we design to allow us to see trends that drift in elevation and geolocation, or to help us understand the dangers of changes to global systems? Can we use successful examples from the past to accelerate the development of the tools we need now?

Back to contents


10. References

Back to contents


© 1998 Department of Geography, Salzburg University