Nothing Special   »   [go: up one dir, main page]

Stories

D-Lib Magazine
November 1999

Volume 5 Number 11

ISSN 1082-9873

The NASA ADS Abstract Service and the Distributed Astronomy Digital Library

Michael J. Kurtz, Guenther Eichhorn, Alberto Accomazzi
Carolyn S. Grant, Markus Demleitner, and Stephen S. Murray
Harvard-Smithsonian Center for Astrophysics
Cambridge, MA 02138, USA
Point of Contact: kurtz@maury.harvard.edu

red line

Introduction

Astronomy has a fully functional digital library. A majority of working astronomers use it almost daily, and substantially more articles per month are read through it than in the sum of all the traditional print libraries worldwide.

This has come about by the close collaboration of the major journals, the Strasbourg Data Center (CDS) and several other data centers, and the NASA Astrophysics Data System Abstract Service (ADS). This collaboration, which Peter Boyce (Boyce 1996) called Urania (after the muse of Astronomy), has fully revolutionized the way astronomers use the literature and can serve as an example for other disciplines.

This article describes the central role that the ADS has played, and plays, in developing and enabling this revolution, as well as some of the technical details which have helped the system to work.

A Worked Example

Before discussing the system in any detail, it is reasonable to use it. In this section, we will lead the reader through a live example. The original query will be "canned," but the rest will be real-time responses of the system to the queries. As such, the responses may change over time, especially as new relevant articles are written. We would like to thank the Astrophysical Journal for kindly removing the access restrictions on one of their articles to allow this demonstration.

To begin, click on the ADS example. (This will create a second browser window; you may wish to resize them.)

This is the standard ADS query form, with a specific query filled in for the purpose of this demo. Each of the boxes allows a query to the system (an author query, an object query, a title query, a text query). The queries are merged using weights and logics set by the user, further filtered to the user's specifications (e.g., must be from refereed journal, or must have on-line data), and finally returned. To see the options which may be set by the user, scroll the page down to the bottom, but please do not change any of them now.

The current query is set up to return all papers on the metallicity of globular clusters in M87, and nothing else. To do this, four different queries are made simultaneously.

The CDS/SIMBAD database is queried for articles concerning the astronomical object M87. (Note that for astronomers an "object" is something in the sky, such as a star or a galaxy.) Most objects have many different names; it is the job of the CDS to associate the objects with the papers about them. For example, the bright galaxy M87 currently has 58 different names known to SIMBAD, and there are about 1700 references concerning M87 currently in SIMBAD.

The ADS phrase data base is queried for the phrase "globular cluster." A globular cluster is a tightly bound group of stars, typically 500,000 stars. There are about 7500 papers with the phrase "globular cluster" in their abstracts in the ADS.

"Metallicity" is looked up in the ADS synonym list, and its synonyms found. Astronomers call any element heavier than Helium a "metal," so among its many (currently 57) synonyms is "abundance." The synonym list is highly astronomy-specific.

The ADS word database is queried for "metallicity" and its synonyms, and the 30,000 papers containing that word or its synonyms are returned.

The queries are merged and sorted by total relevance, according to user specified logic. For this example we will require exact matches and sort on publication date.

Clicking on the "Send Query" button now initiates the query. You should see a list of the (currently 69) papers that are about the metallicity of the M87 globular clusters. A query of this type was, essentially, impossible before the establishment of joint ADS-SIMBAD queries in 1993.

Looking at the returned page we see the list of relevant articles. The first article (currently, the article will move down the list as more articles are published) is a paper by Kundu, et al; 1999ApJ...513..733K. This is the standard ADS return format for queries. More than 10,000,000 article titles per month are returned in this format.

From this page there are a number of methods for going farther. One can check off articles of interest, then do some actions at the bottom of the page, or one could go directly to data related to the articles by using the shortcut links located on the right (AEF RC SN). We will go to the main entry for the Kundu, et al. paper by clicking on the link labeled 1999ApJ...513..733K; please do that now.

Now you should be seeing the main ADS page for the paper by Kundu, et al. In the upper left corner is a list of available data, below that is basic bibliographic information, and below that is the abstract provided to the ADS by the journal.

Next we will follow the "Electronic Refereed Journal Article" link to the Astrophysical Journal. Ordinarily, this link could be used only by subscribers, but Astrophysical Journal has kindly removed the access restrictions from this article for the purpose of this demonstration.

Click now on the "Electronic Refereed Journal Article" link.

This should have brought you to the website of the University of Chicago Press, and to the HTML version of the Kundu, et al. paper. Normally, at this point one would read the paper; we, however, will continue with the demo.

Immediately after the abstract, you will see an outline of the article labeled "CONTENTS." In this outline, please click on "REFERENCES."

This is the References section of the Kundu, et al. article. Notice that the vast majority of articles cited have "NASA ADS" links. To continue the demo, please look at the first paper on the list (Aguilar, et al.) and click on the "NASA ADS" link.

We are now back at the ADS web site, viewing the ADS page for the Aguilar, et al. paper. From here we will look at other information available via ADS. We will look at some of the available items, and return to this page via the browser's BACK button.

First, click on the "Full Refereed Scanned Article" link. This brings up a scanned image of the article. A large and growing number of research journals of astronomy have been scanned by ADS back to volume one; nearly all the major journals have been scanned going back more than 25 years. A 600 dpi printable version of this article can be sent to the printer from this page.

Now go BACK to the abstract page and click on the "Citation Links." Here you see a list of articles which cited the Aguilar, et al. paper. Note that the Kundu, et al. paper is (currently) second on the list. We get our citation information from three sources:

  1. We parse the machine readable reference sections of participating journals;
  2. We purchase them from the Institute for Scientific Information (ISI) (the American Astronomical Society, as part of the electronic Astrophysical Journal paid for a large fraction of these); and
  3. We OCR and parse the reference sections of the articles we scan.

Now go BACK again to the abstract page, and click on the "SIMBAD Objects" link. This takes you to the Strasbourg Data Center's SIMBAD database. You will see a list of objects (in this case, globular clusters) which are referred to in the Aguilar, et al. paper. Clicking on any of the object names will get you more information on the object -- measurements, maps of their surroundings, references in the literature, etc. SIMBAD is access controlled, so many D-Lib Magazine readers will not be able to go past this point; however, essentially all astronomers have access to SIMBAD.

This ends the demo.

History of the ADS

The ADS Abstract Service was conceived in the late 1980's, and a prototype text retrieval engine was built in 1990. In 1991, a plan for the system was presented publicly. The plan described how a natural language index of the abstracts, combined with the citation information and the CDS/SIMBAD object index, could form the core of a distributed, Internet-based retrieval system for journal articles (both scanned and original electronic articles) and data. Save that the World Wide Web has replaced the proprietary network system which had been developed for NASA, the current system is a mirror of the 1991 plan.

The abstract retrieval system was built in 1992 and released to the public (using the proprietary network system) in early 1993. By the end of 1993, the connection with the CDS/SIMBAD in Strasbourg had been made, allowing real time simultaneous queries of the two databases, as in the example above.

In February 1994, we ported the system to the WWW. The number of users quadrupled within five weeks (from 400 to 1600). During 1994, ADS began receiving abstracts directly from the publishers. The first journals to send us original author abstracts were Astronomy & Astrophysics Supplements, The Publications of the Astronomical Society of the Pacific and The Astrophysical Journal (Letters). Towards the end of 1994, we began to scan and put on-line bitmap images of journal articles; the first journal to be scanned was The Astrophysical Journal (Letters). Also during 1994, we made the first links to external data tables, via links to the CDS.

In July 1995, The Astrophysical Journal (Letters) (ApJL) went on-line in HTML and PDF formats. From the first electronic issue, the reference section of the journal pointed to the ADS abstracts for the referenced papers. At that time, the scanning was complete going back ten years, so the ApJL was able to offer a fully useable on-line alternative to paper from the start.

During 1996, the American Astronomical Society purchased a subset of the Science Citation Index for the ADS, thus beginning the ADS citation indexing and completing the vision set out in the 1991 plan.

By 1997, essentially every original source of bibliographic information in astronomy was collaborating with the ADS to the extent they found to be technically feasible. Now, in late 1999, all of the most important journals for astronomers are on-line in HTML and point in their reference sections to ADS. ADS collaborates closely with the major data centers, including the CDS, the National Extragalactic Database in Pasadena(NED), and the Astrophysics Data Center at Goddard Spaceflight Center(ADC).

The Bibcode (a.k.a. DOI)

The connection of the ADS with the data centers (CDS, NED, ADC) and with the journals is enabled by a article identifier code, the bibcode (Schmitz, et al., 1995). The bibcode can be calculated independently by the separate organizations. This code was developed about ten years ago to facilitate the exchange of references between CDS and NED.

The original bibcode was designed only to refer to well known serials (about 200 of them); it was not intended to solve the general problem of literature matching, only the specific one at hand. The ADS has extended the definition to be more general, but as a consequence, the bibcode can no longer be calculated independently with great certainty.

The form of the bibcode is:

YYYYJJJJJVVVVLPPPPA

The publication year is YYYY; the (agreed upon) abbreviation for the journal name is JJJJJ; the volume is VVVV; if the article is in the letters section of the journal the L flag is set to "L", the page number is PPPP; and the first authors initial is A.

This code works directly for about 70% of all articles referenced in the reference sections of the main journals of astronomy. Thus the links in the reference sections are 70% complete without any further interaction between the publishers and ADS. The ADS provides publishers with tools to create bibcodes and confirm that they point to an ADS reference; some publishers use them, other publishers do all the work with their own tools.

Another 10% to 20% of the references in astronomy journal articles are to books and conference proceedings. ADS creates bibcodes for these references, but they cannot, in general, be unique. For those cases where our rules create multiple codes, we create more than one code and link them internally. Thus it is often possible to guess the bibcode. For example, the first paper describing the ADS Abstract Service is:

Kurtz, M. J., Karakashian, T., Grant, C. S., Eichhorn, G., Murray, S. S., Watson, J. M., Ossorio, P. G., & Stoner, J. L. 1993, Intelligent Text Retrieval in the NASA Astrophysics Data System, ASP Conf. Ser. 52: Astronomical Data Analysis Software and Systems II, 132

This paper was presented at the second Astronomical Data Analysis Software and Systems Conference, and was published in the ASP Conference Series as number 52. We therefore created two bibcodes for the paper, representing the conference and the publication: 1993adass...2..132 and 1993aspc...52..132. Both retrieve the scanned image of the same conference proceeding paper.

In order to create links within our database, e.g., to create a citation index, we must parse the reference string in the back of the journal articles, or parse similar information which we get from ISI. We are currently able to parse and recognize about 80% of these references (much higher for ISI, but we only get articles appearing in major journals from them).

There are a number of reasons why we may not be able to recognize a reference:

  1. Our database is not 100% complete; we can only recognize a reference we have.
  2. The title of the reference is ambiguous, and ADS does not recognize the variation of the title that is used.
  3. The reference has errors in it; this is a very common occurrence.
  4. The reference is not to an article (for example, preprints and personal communications).

In order to parse references and resolve them to unique article identifiers, substantial intelligence and error correction are required. For us, the intelligence is mainly in the requirement that the reference be present in the ADS system. Thus it must not be from a similar sounding source from a distant discipline. This requirement allows a substantial amount of error correction that otherwise would not be possible. Things that we can sometimes correct include misspellings, wrong publication dates, wrong author order, wrong page, wrong volume, wrong journal name (!), and incomplete book title. References that we can parse, but not identify, are saved for examination and possible future inclusion in the database.

The URANIA Collaboration

URANIA (named for the muse of astronomy [Boyce, 1996]) is the collaboration of the journals, the data centers, and the ADS. The success of this effort, in which ADS plays a central organizing role, is quite measurable: the total efficiency of the discipline has increased by 3% - 5% as a result of this distributed digital library. For a discussion of this estimate see the OVERVIEW paper referenced below.

We suggest that the experience of astronomy can provide an example for other disciplines; here we discuss aspects of URANIA which we believe have been important to its success.

1. The existence of ADS. The ADS provides a central organizing structure that is independent of any publisher or society. We act as neutral agents while collating information and directing our end users (research astronomers) to it. The research efficiencies engendered by ADS exceed the ADS budget by a very large factor (hundreds); it would make tremendous financial sense to copy this structure in other fields.

2. The convergence of goals. Essentially, all the organizations involved in URANIA -- the data centers, the scientific societies which own the journals, and the ADS -- have as their main mission to further astronomical research, not make money or make the organization grow. This has allowed the general cooperation and sharing of information that has made URANIA work. The establishment of URANIA has not been without difficulties, for example some journals have had to change publishers.

3. A shared vision. At the same 1991 meeting where M. Kurtz presented the plan for an intelligent index for astronomical literature and data based on abstracts, P. Boyce presented the plans of the American Astronomical Society to publish electronic journals. At that time, the CDS already had network access to the SIMBAD database, and were already entering into collaborations with other groups. The system that evolved is a natural extension of the shared vision of these groups.

Further Reading

There will be a special issue of Astronomy & Astrophysics Supplements devoted to the ADS and the CDS in March 2000. The four ADS articles for it are currently available as preprints in PostScript format. OVERVIEW provides an introduction to the system and describes its use. SEARCH describes the search engine and user interface. ARCHITECTURE discusses how the system is structured from a hardware and software point of view. Finally, DATA talks about our data handling methods, input and maintenance procedures, and sources. It also discusses our bibcode rules at length.

A list of publications about ADS

A recent reference to CDS

References

Boyce, P. 1996, 1996AAS...189.0603B

Schmitz, M., G. Helou, P. Dubois, C. Lague, B. Madore, H. G. J. Corwin and S. Lesteven 1995. 1995VA.....39R.272S

Copyright © 1999 Michael J. Kurtz, Guenther Eichhorn, Alberto Accomazzi, Carolyn S. Grant, Markus Demleitner, and Stephen S. Murray

blue line

Top | Contents
Search | Author Index | Title Index | Monthly Issues
Letters | Next story
Home | E-mail the Editor

blue line

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/november99-kurtz