Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

Customizing information capture and access

Published: 01 January 1997 Publication History

Abstract

This article presents a customizable architecture for software agents that capture and access information in large, heterogeneous, distributed electronic repositories. The key idea is to exploit underlying structure at various levels of granularity to build high-level indices with task-specific interpretations. Information agents construct such indices and are configured as a network of reusable modules called structure detectors and segmenters. We illustrate our architecture with the design and implementation of smart information filters in two contexts: retrieving stock market data from Internet newsgroups and retrieving technical reports from Internet FTP sites.

References

[1]
ALLAN, J. AND SALTON, G. 1993. The identification of text relations using automatic hypertext linking. In the Workshop on Intelligent Hypertext, The ACM Conference on Information Knowledge Management. ACM, New York.]]
[2]
BALCAZAR, J. L., DfAZ, J., AND GABARRO, J. 1988. Structural Complexity. EATCS Monograph on Theoretical Computer Science, vol. 1. Springer-Verlag, Berlin.]]
[3]
BELKIN, N. AND CROFT, W. 1992. Information filtering and information retrieval: Two sides of the same coin. Commun. ACM 35, 12 (Dec.), 29-38.]]
[4]
BLUM, M. AND KOZEN, D. 1978. On the power of the compass (or, why mazes are easier to search than graphs). In Proceedings of the Symposium on the Foundations of Computer Science. IEEE, New York, 132-142.]]
[5]
BROOKS, R. 1986. A robust layered control system for a mobile robot. IEEE J. Robot. Automat. RA-2 (Apr.).]]
[6]
BROOKS, R. 1990. Elephants don't play chess. In Design of Autonomous Agents, P. Maes, Ed. MIT/Elsevier, Cambridge, Mass.]]
[7]
CANNY, J. AND GOLDBERG, K. 1993. A "RISC" paradigm for industrial robotics. In Proceedings of the International Conference on Robotics and Automation. IEEE, New York.]]
[8]
CATE, V. 1992. Alex: A global file system. In Proceedings of the Usenix Conference on File Systems. USENIX Assoc., Berkeley, Calif.]]
[9]
COHEN, J., Ed. 1993. Commun. ACM 36, 4 (Apr.).]]
[10]
CREAN, P., RUSSELL, C., AND DELLON, M.V. 1991. Overview and programming guide to the Mind image management systems. Tech. Rep. X9000627, Xerox, Inc., Palo Alto, Calif.]]
[11]
DAVIS, J. AND LAGOZE, C. 1995. Dienst--An architecture for distributed document libraries. Commun. ACM 38, 4 (Apr.), 47.]]
[12]
DONALD, B. 1995. Information invariants in robotics. Artif. Intell. 72, 217-304.]]
[13]
DONALD, B., JENNINGS, J., AND RUS, D. 1993. Information invariants for cooperating autonomous mobile robots. In Proceedings of the International Symposium on Robotics Research. Carnegie-Mellon Univ., Pittsburgh, Pa.]]
[14]
DONALD, B., JENNINGS, g., AND RUS, D. 1995. Minimalism + distribution = supermodularity. J. Exper. Theoret. Artif. Intell. To be published.]]
[15]
ETZIONI, O. AND WELD, D. 1994. A softbot-based interface to the Internet. Commun. ACM 37, 7 (July), 72-76.]]
[16]
FUJISAWA, H., NAKANO, Y., AND KURINO, K. 1992. Segmentation methods for character recognition: From segmentation to document structure analysis. Proc. IEEE 80, 7.]]
[17]
GENESERETH, M. AND KETCHPEL, S. 1994. Software agents. Commun. ACM 37, 7 (July), 48-53.]]
[18]
GRAY, R. 1995. Transportable agents. Tech. Rep. PCS-TR95-261, Dept. of Computer Science, Dartmouth College, Hanover, N.H.]]
[19]
GRAY, R. 1996. Agent Tcl: A flexible and secure mobile agent system. In Proceedings of the 4th Annual Tcl / Tk Workshop. ACM, New York.]]
[20]
HEARST, M. AND FLAUNT, C. 1993. Subtopic structuring for full-length document access. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 59-68.]]
[21]
HOPCROFT, J. AND ULLMAN, J. 1979. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading, Mass.]]
[22]
HUTTENLOCHER, D., KLANDERMAN, G., AND RUCKLIDGE, W. 1993. Comparing images using the Hausdorff distance. IEEE Trans. Patt. Anal. Machine Intell. 15, 9, 850-863.]]
[23]
HUTTENLOCHER, D., NOH, J., AND RUCKLIDGE, W. 1992. Tracking non-rigid objects in complex scenes. Tech. Rep. TR92-1320, Cornell Univ., Ithaca, N.Y.]]
[24]
JAIN, A. AND BHATTCHARJEE, S. 1992. Address block location on envelopes using Gabor filters. Patt. Recog. 25, 12.]]
[25]
KAHLE, B. 1991. Overview of wide area information servers. WAIS Online Doc. Online 15 (Sept. 5), 56-60.]]
[26]
KAHN, R. AND CERF, V. 1988. The world ofknowbots. Report to the Corporation for National Research Initiative, Arlington, Va.]]
[27]
KAUTZ, H., SELMAN, B., AND COEN, M. 1994. Bottom-up design of software agents. Commun. ACM 37, 7 (July), 143-145.]]
[28]
KUCERA, H. AND FRANCIS, W. 1967. Computational Analysis of Present Day American English. Brown University Press, Providence, R.I.]]
[29]
LESK, M. 1991. The CORE electronic library. In Proceedings of SIGIR. ACM, New York.]]
[30]
MAES, P. 1994. Agents that reduce work and information overload. Commun. ACM 37, 7 (July), 31-40.]]
[31]
MITCHELL, T., CARUANA, R., FREITAG, D., MCDERMOTT, J., AND ZABOWSKI, D. 1994. Experience with a learning personal assistant. Commun. ACM 37, 7 (July), 81-91.]]
[32]
MIZUNO, M., TsuJI, Y., TANAKA, T., TANAKA, H., ISASHITA, M., AND TEMMA, T. 1991. Document recognition system with layout structure generator. NEC Res. Devel. 32, 3.]]
[33]
MUNKRES, J. 1975. Topology: A First Course. Prentice-Hall, Englewood Cliffs, N.J.]]
[34]
NAGY, G., SETH, S., AND VISHWANATHAN, M. 1992. A prototype document image analysis system for technical journals. Computer 25, 7.]]
[35]
PEARCE, C. AND NICHOLAS, C. 1993. Generating a dynamic hypertext environment with n-gram analysis. In Proceedings of the ACM Conference on Information Knowledge ManagemeAt. ACM, New York, 148-153.]]
[36]
ROBERTSON, S. 1981. The methodology of information retrieval experiment. In Information Retrieval Experiment, K. Sparck Jones, Ed. Butterworths, Durban, S. Africa, 9-31.]]
[37]
ROBERTSON, G., CARD, S., AND MACKINLAY, J. 1993. Information visualization using 3D interactive animation. Commun. ACM 36, 4 (Apr.), 57-70.]]
[38]
Rus, D. AND SUBRAMANIAN, D. 1993. Multi-media RISSC informatics: Retrieving information with simple structural components. In Proceedings of the ACM Conference on Information and Knowledge Management. ACM, New York.]]
[39]
Rus, D. AND SUMMERS, K. 1995. Using whitespace for automated document structuring. In Advances in Digital Libraries, N. Adam, B. Bhargava, and Y. Yesha, Eds. Lecture Notes in Computer Science, vol. 916. Springer-Verlag, New York.]]
[40]
Rus, D., GRAY, R., AND KOTZ, D. 1997. Transportable information agents. In Proceedings of the 1st International Conference on Autonomous Agents. ACM, New York. To be published.]]
[41]
SALTON, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, Mass.]]
[42]
SALTON, G. AND BUCKLEY, C. 1990. Improving retrieval performance by relevance feedback. J. Am. Soc. Inf. Sci. 41, 4, 288-297.]]
[43]
SALTON, G. AND MCGILL, M. 1983. Introduction to Modern Information Retrieval. McGraw- Hill, New York.]]
[44]
SANKOFF, D. AND KRUSKAL, J. 1983. Time Warps, String Edits, and Macromolecules: The Theory of Practice of Sequence Comparison. Addison-Wesley, Reading, Mass.]]
[45]
SCHWARTZ, M. AND TSIRIGOTIS, P. 1991. Experience with a semantically cognizant Internet white pages directory tool. J. Internetworking Res. Exper. (Mar.).]]
[46]
SCHWARTZ, M., EMTAGE, A., KAHLE, B., AND NEUMAN, B. 1992. A comparison of Internet discovery approaches. Comput. Syst. 5, 4.]]
[47]
STATISTICAL SCIENCES. 1991. Splus Reference Manual. Statistical Sciences, Inc., Seattle, Wash.]]
[48]
TSUJIMOTO, S. AND ASADA, H. 1992. Major components of a complete text reading system. Proc. IEEE 80, 7.]]
[49]
WANG, D. AND SRIHARI, S. 1989. Classification of newspaper image blocks using texture analysis. Comput. Vis. Graph. Image Process. 47.]]
[50]
TONG, K., CASEY, R., AND WAHL, F. 1982. Document analysis system. IBM J. Res. Devel. 26, 6.]]

Cited By

View all
  • (2020)Analysis and research of news gathering and editing process based on data Mining2020 2nd International Conference on Applied Machine Learning (ICAML)10.1109/ICAML51583.2020.00053(220-223)Online publication date: Oct-2020
  • (2008)Service oriented architecture for financial customer relationship managementProceedings of the second international conference on Distributed event-based systems10.1145/1385989.1386027(301-304)Online publication date: 1-Jul-2008
  • (2006)Table-processing paradigms: a research surveyInternational Journal of Document Analysis and Recognition (IJDAR)10.1007/s10032-006-0017-x8:2-3(66-86)Online publication date: 9-May-2006
  • Show More Cited By

Recommendations

Reviews

Richard S. Marcus

The authors report on their development and evaluation of software agents that identify structural elements in documents from which useful information can be data mined and retrieved. The primary example is an algorithm for recognizing tables. Remarkably good results are reported for the analysis of 711 news items in a Usenet newsgroup on the computer hardware business. Of the 151 tables in the text, the algorithm failed to identify only 4, with a like number of pieces of text wrongly identified as tables—that is, it had recall and precision figures of about 97 percent. These results were achieved by selecting optimizing values for three parameters relating to horizontal and vertical spacing and variations from nominal values for these parameters in tabular formats. While some “sensitivity” analysis with respect to these parameters is discussed, it is not clear what kind of average performance can be expected if the parameters are held fixed over a variety of text samples. The authors discuss a variety of issues, such as how to apply the table identification algorithm to actual information access tasks; the appropriate level of specificity and context for development of the algorithms and how to customize the software for more particular cases; and how to integrate these structural techniques into more standard word-based techniques. The authors are to be congratulated for developing table identification algorithms that appear to be impressively effective and for raising issues related to the structural analysis of text. That there is much work to do before such issues are resolved is attested to by the fact that the following (additional) questions remain unanswered: After automatic identification of tables, how well could the software identify the categories of data in the columns and rows__?__ How much do the structural algorithms improve performance in standard tasks such as document retrieval__?__ How much effort should be given to this kind of development of tools for structural analysis of messy, heterogeneous text versus the encouragement of standardization of text representation, for example by encouraging the use of tags for HTML and other standard schemes for identifying structures such as tables and their columns and rows, measurement units, and such<__?__Pub Caret>__?__

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 15, Issue 1
Jan. 1997
101 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/239041
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 1997
Published in TOIS Volume 15, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. information gathering
  2. software agents
  3. table recognition

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)128
  • Downloads (Last 6 weeks)12
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Analysis and research of news gathering and editing process based on data Mining2020 2nd International Conference on Applied Machine Learning (ICAML)10.1109/ICAML51583.2020.00053(220-223)Online publication date: Oct-2020
  • (2008)Service oriented architecture for financial customer relationship managementProceedings of the second international conference on Distributed event-based systems10.1145/1385989.1386027(301-304)Online publication date: 1-Jul-2008
  • (2006)Table-processing paradigms: a research surveyInternational Journal of Document Analysis and Recognition (IJDAR)10.1007/s10032-006-0017-x8:2-3(66-86)Online publication date: 9-May-2006
  • (2005)Extraction of Keyterms by Simple Text Mining for Business Information RetrievalProceedings of the IEEE International Conference on e-Business Engineering10.1109/ICEBE.2005.66(332-339)Online publication date: 12-Oct-2005
  • (2005)Information retrieval, information structure, and information agentsIntelligent Hypertext10.1007/BFb0023964(145-182)Online publication date: 10-Jun-2005
  • (2004)Three-tier multi-agent architecture for asset management consultantIEEE International Conference on e-Technology, e-Commerce and e-Service, 2004. EEE '04. 200410.1109/EEE.2004.1287305(173-176)Online publication date: 2004
  • (2003)A lightweight tool for easy Web site navigationProceedings of the 7th International Conference on Properties and Applications of Dielectric Materials (Cat. No.03CH37417)10.1109/WISE.2003.1254477(134-143)Online publication date: 2003
  • (2003)Personalizing Interactions with Information Systems10.1016/S0065-2458(03)57007-3(323-382)Online publication date: 2003
  • (2002)A multi-agent decision support system for stock tradingIEEE Network: The Magazine of Global Internetworking10.1109/65.98054116:1(20-27)Online publication date: 1-Jan-2002
  • (2001)Information and knowledge exchange in a multi-agent system for stock trading2001 Enterprise Networking, Applications and Services Conference Proceedings.. EntNet@SUPERCOMM2001 (Cat. No.01EX543)10.1109/ENTNET.2001.981989(47-55)Online publication date: 2001
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media