Nothing Special   »   [go: up one dir, main page]

skip to main content
article

How to build a WebFountain: An architecture for very large-scale text analytics

Published: 01 January 2004 Publication History

Abstract

WebFountain is a platform for very large-scale text analytics applications. The platform allows uniform access to a wide variety of sources, scalable system-managed deployment of a variety of document-level "augmenters" and corpus-level "miners," and finally creation of an extensible set of hosted Web services containing information that drives end-user applications. Analytical components can be authored remotely by partners using a collection of Web service APIs (application programming interfaces). The system is operational and supports live customers. This paper surveys the high-level decisions made in creating such a system.

Supplementary Material

XML File (sj4301.xml)

References

[1]
Google, http://www.google.com.]]
[2]
AltaVista, http://www.altavista.com.]]
[3]
T. Sterling, J. Salmon, D. J. Becker, and D. F. Savarese, How to Build a Beowulf, The MIT Press, Cambridge, MA (1999).]]
[4]
A. Broder and M. R. Henzinger, "Algorithmic Aspects of Information Retrieval on the Web," in Handbook of Massive Data Sets, M. R. J. Abello and P. M. Pardalos, Editors, Kluwer Academic Publishers, Boston, forthcoming.]]
[5]
S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener, "The Lorel Query Language for Semistructured Data," International Journal of Digital Libraries1, No. 1, 68-88 (1997).]]
[6]
J. M. Hellerstein, M. J. Franklin, S. Chandrasekaran, A. Deshpande, K. Hilldrum, D. Maden, V. Raman, and M. A. Shah, "Adaptive Query Processing: Technology in Evolution," IEEE Data Engineering Bulletin23, No. 2, 7-18 (June 2000).]]
[7]
G. Arocena, A. Mendelzon, and G. Mihaila, "Applications of a Web Query Language," Proceedings of the 6th International World Wide Web Conference (WWW6), Santa Clara, CA (1997), pp. 1305-1315.]]
[8]
E. Spertus and L. A. Stein, "Squeal: A Structured Query Language for the Web," Proceedings of the 9th International World Wide Web Conference (WWW9) (2000), pp. 95-103.]]
[9]
G. Mecca, A. Mendelzon, and P. Merialdo, "Efficient Queries over Web Views," Proceedings of the 6th International Conference on Extending Database Technology (EDBT), Valencia, Spain, Lecture Notes in Computer Science1377, Springer-Verlag (1998) pp. 72-86.]]
[10]
The Internet Archive, http://www.archive.org.]]
[11]
J. Hirai, S. Raghavan, A. Paepcke, and H. Garcia-Molina, "WebBase: A Repository of Web Pages," Proceedings of the 9th International World Wide Web Conference (WWW9) (2000), pp. 277-293.]]
[12]
Web-in-a-Box, Web Archeology, Hewlett Packard SRC Classic Lab, Palo Alto, CA, http://research.compaq.com/SRC/WebArcheology/wib.html.]]
[13]
I. Foster, C. Kesselman, and S. Tuecke, "The Anatomy of the Grid: Enabling Scalable Virtual Organizations," Lecture Notes in Computer Science2150 (2001).]]
[14]
Semantic Web Activity: Advanced Development, Technology and Society Domain, W3C, http://www.w3.org/2000/01/sw/.]]
[15]
O. Lassila and R. R. Swick, Resource Description Framework (RDF) Model and Syntax Specification, W3C Recommendation, http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/ (February 1999).]]
[16]
D. L. McGuinness and E. F. van Harmelen, OWL Web Ontology Language Overview, W3C Candidate Recommendation, http://www.w3.org/TR/owl-features/ (August 18, 2003).]]
[17]
The DARPA Agent Markup Language (DAML) Homepage, http://www.daml.org.]]
[18]
A. Wolfe, "IBM Sets Its Sights on Autonomic Computing," News Analysis, IEEE Spectrum (January 2002).]]
[19]
P. Horn, Autonomic Computing: IBM's Perspective on the State of Information Technology, IBM Corporation (October 15, 2001), http://www.research.ibm.com/autonomic/manifesto/autonomic_computing.pdf.]]
[20]
A. Newell, "Some Problems of the Basic Organization in Problem-Solving Programs," Proceedings of the Second Conference on Self-Organizing Systems, Washington, DC (1962), pp. 393-423.]]
[21]
L. D. Erman, F. Hayes-Roth, V. R. Lesser, and D. R. Reddy, "The Hearsay Speech Understanding System: Integrating Knowledge to Resolve Uncertainty," Computing Surveys12, No. 2, 213-253 (1980).]]
[22]
R. Agrawal, R. Bayardo, D. Gruhl, and S. Papadimitriou, "Vinci: A Service-Oriented Architecture for Rapid Development of Web Applications," Proceedings of the Tenth International World Wide Web Conference (WWW10), Hong Kong, China (2001), pp. 355-365.]]
[23]
F. Yergeau, UTF-8, A Transformation Format of ISO 10646, Internet Engineering Task Force (January 1998), http://www.ietf.org/rfc/rfc2279.txt.]]
[24]
M. Minsky, A Framework for Representing Knowledge, Technical Report, MIT-AI Laboratory Memo 306, Massachusetts Institute of Technology Artificial Intelligence Laboratory, Cambridge, MA (June 1974).]]
[25]
D. A. Patterson, G. Gibson, and R. H. Katz, "A Case for Redundant Arrays of Inexpensive Disks (RAID)," Proceedings of the ACM Conference on Management of Data (SIGMOD) (June 1988), pp. 109-116.]]
[26]
M. Seltzer, P. Chen, and J. Ousterhout, "Disk Scheduling Revisited," Proceedings of the USENIX Winter 1990 Technical Conference, USENIX Association, Berkeley, CA (1990), pp. 313-324.]]
[27]
G. H. Sockut and B. R. Iyer, "A Survey of Online Reorganization in IBM Products and Research," IEEE Bulletin of the Technical Committee on Data Engineering19, No. 2, 4-11 (1996).]]
[28]
D. Gruhl, The Search for Meaning in Large Text Databases, Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA (2000).]]
[29]
User queries may be returned in rank orders that are appropriate for viewing, but long-running queries that are processed to completion are returned in UEID order.]]
[30]
C. Clarke, G. Cormack, and F. Burkowski, "Shortest Substring Ranking (MultiText Experiments for TREC-4)," Proceedings of the Fourth Text Retrieval Conference (November 1995).]]
[31]
S. Chakrabarti, B. Dom, and P. Indyk., "Enhanced Hypertext Classification Using Hyper-Links," ACM SIGMOD International Conference on Management of Data (1998), pp. 307-318.]]
[32]
K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian, "The Connectivity Server: Fast Access to Linkage Information on the Web," Proceedings of the 7th International World Wide Web Conference (April 1998), pp. 14-18.]]
[33]
Simple Object Access Protocol (SOAP) 1.1, W3C, http://www.w3.org/TR/SOAP/.]]
[34]
Web Service Definition Language (WSDL), W3C, http://www.w3.org/TR/wsdl.]]
[35]
Since each node represents less than a half percent of our data, having one or even two down does not materially impact the quality of queries that develop an aggregate statistical understanding over a broad data set.]]
[36]
For information on the particular set of mining and applications, please contact the WebFountain team directly.37]]
[37]
WebFountain Overview, IBM Corporation, Almaden Research Center, http://www.almaden.ibm.com/webfountain.]]

Cited By

View all
  • (2020)Analyzing #LasTesis Feminist Movement in Twitter Using Topic ModelsSocial Computing and Social Media. Design, Ethics, User Behavior, and Social Network Analysis10.1007/978-3-030-49570-1_44(624-635)Online publication date: 19-Jul-2020
  • (2018)Unsupervised Domain Ranking in Large-Scale Web CrawlsACM Transactions on the Web10.1145/318218012:4(1-29)Online publication date: 27-Sep-2018
  • (2017)Temporal Update Dynamics Under Blind SamplingIEEE/ACM Transactions on Networking10.1109/TNET.2016.257768025:1(363-376)Online publication date: 1-Feb-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IBM Systems Journal
IBM Systems Journal  Volume 43, Issue 1
January 2004
199 pages

Publisher

IBM Corp.

United States

Publication History

Published: 01 January 2004

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Analyzing #LasTesis Feminist Movement in Twitter Using Topic ModelsSocial Computing and Social Media. Design, Ethics, User Behavior, and Social Network Analysis10.1007/978-3-030-49570-1_44(624-635)Online publication date: 19-Jul-2020
  • (2018)Unsupervised Domain Ranking in Large-Scale Web CrawlsACM Transactions on the Web10.1145/318218012:4(1-29)Online publication date: 27-Sep-2018
  • (2017)Temporal Update Dynamics Under Blind SamplingIEEE/ACM Transactions on Networking10.1109/TNET.2016.257768025:1(363-376)Online publication date: 1-Feb-2017
  • (2016)On Sample-Path Staleness in Lazy Data ReplicationIEEE/ACM Transactions on Networking10.1109/TNET.2015.248859524:5(2858-2871)Online publication date: 1-Oct-2016
  • (2015)On the Predictive Power of Web Intelligence and Social MediaRevised Selected Papers from the 5th International Workshop on Big Data Analytics in the Social and Ubiquitous Context - Volume 954610.5555/2950238.2950240(26-45)Online publication date: 1-Nov-2015
  • (2015)Agent-based Approach to WEB Exploration ProcessProcedia Computer Science10.1016/j.procs.2015.05.26351:C(1052-1061)Online publication date: 1-Sep-2015
  • (2014)On the predictive power of web intelligence and social media the best way to predict the future is to tweet itProceedings of the 5th and 1st International Conference on Big Data Analytics in the Social and Ubiquitous Context - 5th International Workshop on Modeling Social Media, 5th International Workshop on Mining Ubiquitous and Social Environments and First International Workshop on Machine Learning for Urban Sensor Data10.5555/3120818.3120820(26-45)Online publication date: 1-Jan-2014
  • (2014)Predicting crowd behavior with big public dataProceedings of the 23rd International Conference on World Wide Web10.1145/2567948.2579233(625-630)Online publication date: 7-Apr-2014
  • (2013)Information extraction as a filtering taskProceedings of the 22nd ACM international conference on Information & Knowledge Management10.1145/2505515.2505557(2049-2058)Online publication date: 27-Oct-2013
  • (2013)OXPathThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-012-0286-622:1(47-72)Online publication date: 1-Feb-2013
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media