A Service-Oriented Framework for Bibliography Management

Despite the importance of bibliographies in scientific/technical documents, a global solution to the bibliography management problem has still not been developed. Current solutions are limited in the sense that they are word-processor oriented, whereas users must often write documents with different tools. This obligates authors to use different bibliography managers, and even different collections (with subsequent problems such as consistency enforcement, updating, etc.) depending on the word processor used. In this article, we introduce Bibshare, a new framework for bibliography management that allows writers to use the same bibliography collection(s) regardless of the word processing system they use. Moreover, both personal and external collections can be used to retrieve the bibliographic information to be inserted into documents. Bibshare is an example of the new generation of applications that have been built using a ServiceOriented approach. It is open and extensible so new collections and word processors can be added in a straightforward way. In addition, it is available free of charge. This article outlines the architecture of Bibshare and enumerates some of Bibshare's features, emphasizing its federated search service.

1. Introduction

Research communities highly value having accurate and well-managed bibliographic collections. However, the continuous growth of information sources and the subsequent increase in the size of collections has made bibliography management one of the most frustrating tasks researchers face. To make this task easier, a number of utilities for bibliography management have been developed in the last few decades (e.g., End Note [1], ProCite [2], Biblioscape [3], and the older Refer [4] and BibTeX [5]). Briefly, each tool is associated with a specific word processing system and allows a writer to insert citations into his or her documents to automatically generate a bibliography list at a later time. In most cases, the citations come from bibliographic collections compiled over time by a user or a research group. However, this leads to a huge amount of duplicate information, as it is likely that the bibliographic records of the most relevant works in a given field will be present in almost every bibliography collection in that field. Hence, there are commercial tools that provide users with access to large bibliographic collections, which helps an author avoid having to maintain personal collections of citations.

The main drawback of these commercial tools is that they are designed to work with only a small number of the word processing systems currently available. Specifically, EndNote, ProCite and Biblioscape were created to support bibliography management in Microsoft® Word (with little support for other word processors); Refer is part of UNIX troff, and BiBTeX works only with the LaTeX typesetting system [5]. However, there are situations in which the rules imposed by specific editors and/or institutions obligate researchers to prepare documents using a word processor different from the one they customarily use. Currently, the only way to reuse bibliographic collections is based on format conversions, that is, the development of tools able to read collections in a particular tool's format and transform the bibliographic records into the another tool's format. Although this obviously works, format conversion is far from being the best solution, because when converting records we are actually duplicating collections. Furthermore, all the conversion procedures must be done by the user, resulting in loss of time that could be used for other purposes.

A definitive solution to the bibliography management problem should be one in which a user could use the same bibliographic source(s) regardless of the word processing system used, and that these source(s) would be accessible from anywhere. Such a solution should also be scalable, that is, new sources and/or word processors could be seamlessly integrated as needed.

In this article, we introduce Bibshare [6], a new framework for bibliography management based on the Service-Oriented Computing paradigm. As stated at the web site for the International Conference on Service-Oriented Computing [7], the notion of Service refers to "autonomous platform-independent computational elements that can be described, published, discovered, orchestrated and programmed using standard protocols for the purpose of building agile networks of collaborating applications distributed within and across organizational boundaries". Following this principle, an XML Web Service (or Web service, for short) is a self-describing, self-contained application that is accessible over the Web; all the communication between a Web service and its clients is done using XML as the information representation model and HTTP as the communication protocol [8].

Unlike the above-mentioned solutions, Bibshare is not a single tool but is rather a set of bibliography management Web services designed to be called by different text processing systems, avoiding the need for different bibliography collections and formats. The layered structure of Bibshare allows the development of clients for different word processors that use the bibliography management Web services (search, format conversion, etc.). These services can be remotely invoked using a protocol that is independent from the word processing system. In addition, Bibshare services include access to free, well managed and up-to-date bibliographic collections available on the Internet so that writers don't need to worry about collection maintenance. Bibshare is a clear example of how Service-Oriented Computing is changing the way software is being developed and used in the current stage of Internet computing.

This article is structured as follows. In section 2, we describe the overall architecture of the Bibshare framework. In section 3, we focus on federated searches and explain the management of the federated collections. In section 4, we describe the Federated Search Web service. In sections 5 and 6, we illustrate the use of Bibshare in Word and LaTeX, respectively.

2. Architecture of Bibshare

Bibshare was conceived to serve as large and heterogeneous a community of users as possible. This can be achieved using the layered architecture shown in Figure 1.

At the Client Layer, Bibshare clients are plugged into the host word processing applications. Using a client, a user can retrieve bibliographic information from different collections, insert them into the document, and generate the bibliography automatically using different bibliographic styles, labels, etc. The functionality of a typical client is split into three components. The citation handler is dependent on the word processor and uses the application programming interface (API) of the word processor to insert citations into the document and to generate the bibliography. The citation seeker helps the user to retrieve the citations from the bibliographic sources. It is called by the citation handler, and its behaviour is totally independent from the word processor. This means that only a citation seeker implementation is needed for each operating system, and new clients can be added to the framework just by implementing the corresponding citation handler. There are currently a number of citation seekers available. On the one hand, we have implemented a Dynamic Link Library (DLL) that can be used in any Windows based application, as illustrated in section 5 for MS Word. LaTeX needs a different solution that includes several options, as explained in section 6. Finally, the private collection manager supports the interaction with private bibliography collections and includes services for basic collection maintenance (add, delete and modify citations) and for citation retrieval via browsing and/or search.

The Collection Layer is composed of a number of autonomous, heterogeneous and distributed collections of bibliographic data. These collections range from those that serve a large community of users (usually through their own web-based access) to private bibliography collections (whose owner is an individual or a group). Examples of the former are DBLP [9], one of the largest collections of Computer Science bibliography containing around 500,000 records, RePEc [10], a bibliographic collection for the Economics field, and PubMedCentral [11] for the health sciences; these collections can be searched under user request in a federated way, as explained below.

Private collections do not participate in federated searches. They are implemented as XML files containing references in the Bibshare Bibliographic Format (BBF), which is essentially an XML encoding of the BibTeX format. A private collection is called a local collection if it is stored on the same computer that a client is running on, otherwise it is called remote. The latter is the case in which a group maintains and uses a private collection that is accessed by group members working remotely. The access to a private collection is done via an API that implements the basic services; for remote access, the API is wrapped with a Web service interface facilitating remote access by clients.

The Services Layer includes private search and federated search Web services. Private search services allow the retrieval of bibliographic records from a private collection, in either local or remote mode. For federated searches, this layer includes the Federated Search Engine (FSE), a Web Service that accepts a query plus a list of collections as input and then returns the set of bibliographic records matching the search criteria. The answers are put together in an XML document, which is processed by a duplicate detector, and sent back to the caller.

In this article, we focus on federated collections, and we show how they are managed and accessed from the different word processing systems in the next sections.

3. Managing Federated Collections

A collection can participate in federated searches provided that it has been registered. Registered collections are organized hierarchically according to a thematic classification schema. Each collection has its own metadata format(s); that means that different XSL templates are needed to convert records from the collection's format(s) to the BBF in order to unify the answers of a federated search. If the source is a Bibshare private collection, this step is trivial, as these collections use the BBF as the metadata format. Otherwise, the XSL template must be generated before registration. However, writing an XSL template is not a trivial task. To overcome this, we have developed an assistant that helps collection owners create XSL templates, making the registration of a collection very easy. The registration process can be summarized as follows:

3.1 Accessing the Open Archives from Bibshare

As a special case of federated search, we have developed OAI-Bibshare, a service that allows the insertion of citations coming from (a subset of) the Open Archives Initiative (OAI) [12] data providers; the list of data providers is customizable so that users can select which repositories must be searched for citations. All the communications between OAI-Bibshare and the data sources are made via the OAI Protocol for Metadata Harvesting; the format conversion is performed in a way similar to the one described above for collection registration.

OAI-Bibshare is currently in its last testing phase before being registered as an OAI service provider. A distinctive feature, which in our opinion opens up an interesting trend in the OAI world, is that OAI-Bibshare is not an end-user-oriented service; it is instead a tool-oriented service as the intended clients are the Bibshare citation seekers. According to the new trends in computing models, specifically the Service-Oriented paradigm, OAI Web-service providers could have a significant impact in the near future.

4. The Bibshare Federated Search Web Service

We have defined a protocol to search references and obtain the metadata associated with a bibliographic reference. The protocol is implemented as an XML Web Service that is available at <http://www.bibshare.org/BibShareSearchEngine/BibShareSearchEngine.asmx>. The Web Service includes five operations that are described in the remainder of this section. (Other operations are available for preserving backward compatibility.)

4.1 GetRepositories

Description: returns the names and identifiers of all the bibliographic collections registered in the Federated Search Engine.

Output values: an XML string with an enumeration of the registered repositories classified by categories. The XML string has the following structure:

<?xml version="1.0" encoding="UTF-8"?>
<BDsResult>
<Category id="id of the category" level="1">
<Name> name of the category </Name>
<Category id="id of the category" level="2">
<Name> name of the category </Name>
<Source id="id of the source"> name of the source </Source>
<Category>
. . .
<Category id="id of the category" level="2">
. . .
<Category>
</Category>
. . .
<Category id="id of the category" level="1">
. . .
</Category>
</BDsResult>

where each "source" element refers to one federated repository, and the attribute "level" of the "Category" element defines the level of the category in the hierarchy.

4.2 Query

Description: allows users to search references in a federated repository or a set of federated repositories.

<?xml version="1.0" encoding="UTF-16"?>
<search>
<source>id of the source1</source>
<source>id of the source2</source>
<source>id of the source3</source>
. . .
<source>id of the sourceN</source>
<author>author selection criteria</author>
<title>title selection criteria</title>
<keywords>keywords selection criteria</keywords>
</search>

where the values of the "source" elements are the repositories where the user wants to search and the values of the "author", "title" and "keywords" elements are the criteria for searching. A query must have at least one search criteria.

<?xml version="1.0" encoding="utf-16"?>
<!DOCTYPE QueryReturn SYSTEM "http://bibshare.dsic.upv.es/FSE/QueryReturn.dtd">
<QueryReturn>
<source id="id of the source1" num="number of selected Bibitems ">
<Bibitem id="id of the first reference">
<Authors>names of the authors of the Bibitem</Authors>
<Keywords>keywords for the Bibitem</Keywords>
<Title>title of the Bibitem</Title>
<Type>type of the Bibitem</Type>
</Bibitem>
<Bibitem id="id of the second reference">
<Authors>names of the authors of the Bibitem</Authors>
<Keywords>keywords for the Bibitem</Keywords>
<Title>title of the Bibitem</Title>
<Type>type of the Bibitem</Type>
</Bibitem>
. . .
</source>
</QueryReturn>

where the references matching the search criteria are listed for each "source". The value of each element "Bibitem" is a subset of the BBF record associated with the citation.

4.3 GetRecord

Description: allows users to get the full BBF record corresponding to a specific reference from a federated repository.

<?xml version="1.0" encoding="UTF-8"?>
<GetRecord>
<sourceId>id of the source</sourceId>
<refId>id of the reference</refId>
</GetRecord>

where the value of "sourceId" is the identifier of the repository where the bibliographic reference is stored and "refId" is the identifier of the reference.

<?xml version="1.0" encoding="UTF-8"?>
<GetRecordReturn>
<id>identifier of the reference</id>
<type>type of the reference</type>
<author>authors names</author>
<title>title of the referenced Bibitem</title>
. . .
</GetRecordReturn>

4.4 GetFormattedRecord

Description: allows users to get the full record corresponding to a specific reference from a federated repository in a format other than BBF.

Input Arguments: an XML String with the same structure as the GetRecord input argument, plus a string with the name of the required output format.

Output values: String with the full bibliographic record in the required format.

4.5 GetOutputFormats

<?xml version="1.0" encoding="UTF-8"?>
<OutTrans>
<translation>
<label>label of the output format</label>
<name>name of the output format</name>
<xslt>uri where the xsl template is stored</xslt>
</translation>
. . .
</OutTrans>

5. Using Bibshare in Microsoft Word

The citation handler for Microsoft (MS) Word extends its functionality in the form of a new menu option plus a new toolbar. The added functions allow the user to search for references (via the citation seeker for Windows), insert citations into the document, and generate the bibliography list using different styles.

Inserting a new citation from the federated collections is done in several steps. First, the Add Reference -> from Bibshare.org option at the BibShare menu must be selected; then, the citation seeker form appears (see Figure 2), giving the user the chance to select references. Second, the user must select the collection(s) to be searched (DBLP in Figure 2) as well as some of the search criteria (in this case author = "Borgman"); then, after clicking the "Search" button, the results of the search for publications registered at the DBLP collection that have an author with name Borgman are shown. To see more information about a specific reference, one can select it and click on the "View info" button. The amount of metadata available varies for the different collections, though almost all of the most important fields from the bibliographic perspective are present.

And third, if this is the reference sought, a citation is inserted into the document by clicking on the "Cite" button. All the inserted citations are used to automatically generate a bibliography list with different bibliographic styles.

Figure 3 shows a document whose bibliography has been generated after inserting several references. Notice that the generated bibliography includes some mistakes; for instance, references to articles only include the journal name or acronym and publication year, and a separation space between them is missing. In another reference, no mention of the conference where the paper was presented appears.

Inconsistencies may exist for several reasons: incomplete bibliographic information in the source collection may cause some fields to be missing, and ill-defined bibliographic styles may also cause the incorrect presentation of the information. In the list in Figure 3, the journal acronym and year appear without separation because the corresponding style did not include it; also, the conference name is missing because the style designer forgot to include it in the set of fields to be shown. The appearance of the acronym instead of the journal name, however, comes from the fact that, in the source collection, the conference and journal names are replaced in the bibliographic records by their acronyms, and separate declarations define the value of every acronym. Making the right conversion automatically is an issue that is independent from the word processor and is currently under consideration.

The Bibshare client for Microsoft Word can be downloaded from the Bibshare Website [6]. It has been implemented using the .NET Framework; specifically, the citation handler is a Word add-in implemented in C#; the same language was used to implement the citation seeker. To support the development of clients for other word processors, we have packaged the citation seeker for Windows as a component in the form of a DLL which can be called by the citation handler of any Windows-based word processor; we have used it in the implementation of the Bibshare client for MS Word, and it can be downloaded separately.

6. Bibshare for LaTeX Users

LaTeX is a word processing system that is widely used in disciplines dealing with large numbers of mathematical formulae, such as Mathematics and Physics, and also by a large number of computer scientists. Unlike MS Word, which follows a WYSIWYG (What You See Is What You Get) approach to word processing, LaTeX runs in batch mode, that is, a writer prepares a manuscript as a plain text document in which different formatting tags are inserted in the appropriate places. Later, the manuscript is processed, producing an output in a special format (the .dvi format) in which all the tags have been replaced by the corresponding formatting operations. The .dvi file is then converted to printable formats—normally Postscript or PDF—leading to a high quality final document. The bibliography utility associated with LaTeX is called BibTex; it is based on the use of local collections which are stored as plain text files (.bib files) containing bibliographic records in the BibTeX format. A citation is inserted into a document by using the "\cite{id}" tag, where "id" is the identifier of the bibliographic record to insert. Later, at document processing time, BibTeX is run using the document and the .bib file containing the bibliographic collection as input, and producing the document with the "\cite {...}" tags transformed into citations and the reference list generated in the document as output.

From the Bibshare perspective, the case of LaTeX is more complex, as manuscripts are written in plain text, which means that different authors can use different tools to create documents. The tools range from plain text editors (e.g., notepad, vi and emacs) to more sophisticated environments including syntax-oriented editors (e.g., WinShell [13] and WinEdt [14]). There are two ways to serve LaTeX users in such a heterogeneous context:

7. Conclusions and Further Work

The Service-Oriented Computing paradigm is emerging as an evolution of the Internet computing models that will make the development of complex systems easier. In this article, we have shown how the Service-Oriented approach enabled a general solution for the bibliography management problem. Bibshare exploits the power of Web services to provide a framework for bibliography management in which any word processing system can invoke citation retrieval services. From the user's perspective, this avoids the need to have different collections for different word processors, as well as the need for collection management.

Further work must be done in all the layers of the Bibshare architecture. With regard to collections, we are seeking new bibliographies to be added to the Bibshare framework so that users from different disciplines can become members of the Bibshare federation. From the client side, we encourage developers of word processors to build their clients for Bibshare, as we firmly believe that sharing is the way to make bibliography management an affordable task. Finally, having a unified solution would permit the development of new services such as citation analysis, search logs and others.

8. Acknowledgements

Bibshare is a project funded by Microsoft Research Cambridge under Grant no. 2003-191. The authors thank Dr. Eduardo Mena from the University of Zaragoza, Spain, and Dr. Michael Ley from the University of Trier, Germany and manager of the DBLP collection, for their valuable feedback during the development of the project. Andrea Goytre and Emilio Sánchez made contributions in the first stages of the project. Finally, we thank Stephen Eglen, who helped us to improve the emacs client.

9. References

[5] Goossens, M., Mittelbach, F. and Samarin, A. The LaTeX Companion. Addison-Wesley, 1993.

[7] Call for papers, 2^nd International Conference on Services Computing, 2004. <http://www.icsoc.org/>.

[8] Tsalgatidou, A. abd Pilioura, T., An Overview of Standards and Related Technology in Web Services. Distributed and Parallel Databases, 12, 135-162, 2002.

D-Lib Magazine
November 2004

Volume 10 Number 11

ISSN 1082-9873