Nothing Special   »   [go: up one dir, main page]

Skip to main content

A Workflow-Based Approach for Creating Complex Web Wrappers

  • Conference paper
Web Information Systems Engineering - WISE 2008 (WISE 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5175))

Included in the following conference series:

Abstract

In order to let software programs access and use the information and services provided by web sources, wrapper programs must be built to provide a “machine-readable” view over them. Although research literature on web wrappers is vast, the problem of how to specify the internal logic of complex wrappers in a graphical and simple way remains mainly ignored. In this paper, we propose a new language for addressing this task. Our approach leverages on the existing work on intelligent web data extraction and automatic web navigation as building blocks, and uses a workflow-based approach to specify the wrapper control logic. The features included in the language have been decided from the results of a study of a wide range of real web automation applications from different business areas. In this paper, we also present the most salient results of the study.

This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Aalst, W., Hofstede, A., Kiepuszewski, B., Barros, A.: Workflow patterns. Distributed and Parallel Databases 14(1), 5–51 (2003)

    Article  Google Scholar 

  2. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading (1995)

    MATH  Google Scholar 

  3. Anupam, V., Freire, J., Kumar, B., Lieuwen, D.F.: Automating Web navigation with the WebVCR. Computer Networks 33(1-6), 503–517 (2000)

    Article  Google Scholar 

  4. Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: Proceedings of the 2003 ACM SIGMOD International Conference, pp. 337–348 (2003)

    Google Scholar 

  5. Baumgartner, R., Flesca, S., Gottlob, G.: Declarative Information Extraction, Web Crawling and Recursive Wrapping with Lixto. In: Proceedings of the 6th International Conference on Logic Programming and Nonmonotonic Reasoning (LPNR) (2001)

    Google Scholar 

  6. Oasis WS-BPEL. Web Services Business Process Execution Language, http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=wsbpel

  7. BPMN: Business Process Modelling Notation, http://www.bpmn.org

  8. Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A Survey of Web Information Extraction Systems. IEEE Transactions on Knowledge and Data Engineering 18(10), 1411–1428 (2006)

    Article  Google Scholar 

  9. Crescenzi, V., Mecca, G.: Automatic Information Extraction from Large Websites. J. ACM 51(5), 731–779 (2004)

    Article  MathSciNet  Google Scholar 

  10. Doorenbos, R., Etzioni, O., Weld, D.S.: A Scalable Comparison-Shopping Agent for the World-Wide Web. Agents, 39–48 (1997)

    Google Scholar 

  11. Kistlera, T., Marais, H.: WebL: A Programming Language for the Web. In: Proceedings of the 7th International World Wide Web Conference, pp. 259–270 (1998)

    Google Scholar 

  12. Wang, J., Lochovsky, F.: Data Extraction and Label Assignment for Web Databases. In: Proceedings of the 12th World Wide Web Conference, pp. 187–196 (2003)

    Google Scholar 

  13. Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (1999)

    Google Scholar 

  14. Pan, A., Raposo, J., Álvarez, M., Hidalgo, J., Viña, A.: Semi Automatic Wrapper-Generation for Commercial Web Sources. In: Proceedings of IFIP WG8.1 EISIC (2002)

    Google Scholar 

  15. Sahuguet, A., Azavant, F.: WysiWyg Web Wrapper Factory (W4F). In: Proceedings of the 8th International World Wide Web Conference (1999)

    Google Scholar 

  16. Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25(3) (March 1992)

    Google Scholar 

  17. Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)

    Article  Google Scholar 

  18. Zhai, Y., Liu, B.: Extracting Web Data Using Instance-Based Learning. In: Proceedings of the 16th International World Wide Web Conference (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

James Bailey David Maier Klaus-Dieter Schewe Bernhard Thalheim Xiaoyang Sean Wang

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Montoto, P., Pan, A., Raposo, J., Losada, J., Bellas, F., López, J. (2008). A Workflow-Based Approach for Creating Complex Web Wrappers. In: Bailey, J., Maier, D., Schewe, KD., Thalheim, B., Wang, X.S. (eds) Web Information Systems Engineering - WISE 2008. WISE 2008. Lecture Notes in Computer Science, vol 5175. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85481-4_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85481-4_30

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85480-7

  • Online ISBN: 978-3-540-85481-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics