Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Multidimensional content eXploration

Published: 01 August 2008 Publication History

Abstract

Content Management Systems (CMS) store enterprise data such as insurance claims, insurance policies, legal documents, patent applications, or archival data like in the case of digital libraries. Search over content allows for information retrieval, but does not provide users with great insight into the data. A more analytical view is needed through analysis, aggregations, groupings, trends, pivot tables or charts, and so on. Multidimensional Content eXploration (MCX) is about effectively analyzing and exploring large amounts of content by combining keyword search with OLAP-style aggregation, navigation, and reporting. We focus on unstructured data or generally speaking documents or content with limited metadata, as it is typically encountered in CMS. We formally present how CMS content and metadata should be organized in a well-defined multidimensional structure, so that sophisticated queries can be expressed and evaluated. The CMS metadata provide traditional OLAP static dimensions that are combined with dynamic dimensions discovered from the analyzed keyword search result, as well as measures for document scores based on the link structure between the documents. In addition, we provide means for multidimensional content exploration through traditional OLAP rollupdrilldown operations on the static and dynamic dimensions, solutions for multi-cube analysis and dynamic navigation of the content. We present our prototype, called DBPubs, which stores research publications as documents that can be searched and -most importantly-- analyzed, and explored. Finally, we present experimental results of the efficiency and effectiveness of our approach.

References

[1]
A. Anagnostopoulos, A. Z. Broder, and D. Carmel. Sampling search-engine results. World Wide Web, 9(4):397--429, 2006.
[2]
A. Baid, A. Balmin, H. Hwang, E. Nijkamp, J. Rao, B. Reinwald, A. Simitsis, Y. Sismanis, and F. V. Ham. DBPubs: Multidimensional Exploration of Database Publications. Accepted as a demonstration proposal at VLDB '08.
[3]
A. Balmin, V. Hristidis, and Y. Papakonstantinou. Objectrank: Authority-based keyword search in databases. In VLDB, pages 564--575, 2004.
[4]
N. Bansal and N. Koudas. Blogscope: A system for online analysis of high volume text streams. In VLDB, pages 1410--1413, 2007.
[5]
K. S. Beyer, D. D. Chamberlin, L. S. Colby, F. Özcan, H. Pirahesh, and Y. Xu. Extending xquery for analytics. In SIGMOD Conference, pages 503--514, 2005.
[6]
K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD Conference, pages 199--210, 2007.
[7]
D. M. Blei and J. D. Lafferty. Dynamic topic models. In ICML, pages 113--120, 2006.
[8]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1--7):107--117, 1998.
[9]
S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. SIGMOD Record, 26(1):65--74, 1997.
[10]
CiteSeer. http://citeseer.ist.psu.edu.
[11]
DBLP. http://www.informatik.uni-trier.de/ley/db.
[12]
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391--407, 1990.
[13]
P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB, page 399, 2007.
[14]
J. Diederich. Faceted DBLP, http://dblp.13s.de.
[15]
Eventseer. http://eventseer.net.
[16]
G. H. Golub and C. Reinsch. Singular value decomposition and least squares solutions. Numerische Mathematik, 14(5):403--420, 1970.
[17]
J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In ICDE, pages 152--159, 1996.
[18]
Harzing. Publish or Perish, http://www.harzing.com/pop.htm.
[19]
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999.
[20]
C. A. Hurtado, C. Gutiérrez, and A. O. Mendelzon. Capturing summarizability with integrity constraints in olap. ACM Trans. Database Syst., 30(3):854--886, 2005.
[21]
Y. E. Ioannidis, D. Maier, S. Abiteboul, P. Buneman, S. B. Davidson, E. A. Fox, A. Y. Halevy, C. A. Knoblock, F. Rabitti, H.-J. Schek, and G. Weikum. Digital library information-technology infrastructures. Int. J. on Digital Libraries, 5(4):266--274, 2005.
[22]
R. Kimball, L. Reeves, M. Ross, and W. Thornthwaite. The Data Warehouse Lifecycle Toolkit. Wiley, 1998.
[23]
J. Lechtenbörger and G. Vossen. Multidimensional normal forms for data warehouse design. Inf. Syst., 28(5):415--434, 2003.
[24]
W. Lehner, J. Albrecht, and H. Wedekind. Normal forms for multidimensional databases. In SSDBM, pages 63--72, 1998.
[25]
H.-J. Lenz and A. Shoshani. Summarizability in olap and statistical data bases. In SSDBM, pages 132--143, 1997.
[26]
G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In VLDB, pages 346--357, 2002.
[27]
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD, pages 169--178, 2000.
[28]
Mondial, http://www.dbis.informatik.uni-goettingen.de/mondial.
[29]
T. B. Pedersen, C. S. Jensen, and C. E. Dyreson. A foundation for capturing and querying complex multidimensional data. Inf. Syst., 26(5):383--423, 2001.
[30]
M. Rafanelli and A. Shoshani. Storm: A statistical object representation model. In SSDBM, pages 14--29, 1990.
[31]
A. Shukla, P. Deshpande, and J. F. Naughton. Materialized view selection for multi-cube data models. In EDBT, page 269, 2000.
[32]
D. Takuma and I. Yoshida. Top-n keyword calculation on dynamically selected documents. IBM Research Report, RT-0760, October 2007.
[33]
D. S. Whelan. Filenet integrated document management database usage and issues. In SIGMOD Conference, page 533, 1998.
[34]
P. Wu, Y. Sismanis, and B. Reinwald. Towards keyword-driven analytical processing. In SIGMOD Conference, page 617, 2007.

Cited By

View all
  • (2022)Multidimensional Mining of Massive Text DataundefinedOnline publication date: 19-Mar-2022
  • (2022)Phrase Mining from Massive Text and Its ApplicationsundefinedOnline publication date: 26-Mar-2022
  • (2021)Efficient top-k recently-frequent term querying over spatio-temporal textual streamsInformation Systems10.1016/j.is.2020.10168797(101687)Online publication date: Mar-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 1, Issue 1
August 2008
1216 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2008
Published in PVLDB Volume 1, Issue 1

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Multidimensional Mining of Massive Text DataundefinedOnline publication date: 19-Mar-2022
  • (2022)Phrase Mining from Massive Text and Its ApplicationsundefinedOnline publication date: 26-Mar-2022
  • (2021)Efficient top-k recently-frequent term querying over spatio-temporal textual streamsInformation Systems10.1016/j.is.2020.10168797(101687)Online publication date: Mar-2021
  • (2020)Scalable Distributed Pivot Analysis over Massive Big Data: Models, Paradigms, New Advancements2020 International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW51313.2020.00098(696-700)Online publication date: Nov-2020
  • (2019)An Efficient Method for High Quality and Cohesive Topical Phrase MiningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.282375831:1(120-137)Online publication date: 1-Jan-2019
  • (2019)Exploring market competition over topics in spatio-temporal document collectionsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-018-0522-928:1(123-145)Online publication date: 1-Feb-2019
  • (2018)Knowledge discovery in multidimensional knowledge representation frameworkIran Journal of Computer Science10.1007/s42044-018-0019-01:4(199-216)Online publication date: 4-Apr-2018
  • (2018)A Context-Aware Fuzzy Classification Technique for OLAP Text AnalysisRecent Findings in Intelligent Computing Techniques10.1007/978-981-10-8633-5_8(73-85)Online publication date: 4-Nov-2018
  • (2018)Data Warehousing on Nonconventional DataEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_80670(913-916)Online publication date: 7-Dec-2018
  • (2017)Efficiently mining high quality phrases from textsProceedings of the Thirty-First AAAI Conference on Artificial Intelligence10.5555/3298023.3298073(3474-3481)Online publication date: 4-Feb-2017
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media