research-article

LlamaFur: learning latent category matrix to find unexpected relations in Wikipedia

Authors:

Corrado MontiAuthors Info & Claims

WebSci '16: Proceedings of the 8th ACM Conference on Web Science

Pages 218 - 222

https://doi.org/10.1145/2908131.2908153

Published: 22 May 2016 Publication History

Abstract

Besides finding trends and unveiling typical patterns, modern information retrieval is increasingly interested in the discovery of serendipity and surprising information. In this work we focus on finding unexpected links in hyperlinked corpora when documents are assigned to categories. To achieve our goal, we determine a latent category matrix that explains common links using a highly scalable margin-based online learning algorithm, which makes us able to process graphs with 10⁸ links in less than 10 minutes. We show that our method provides better accuracy than all existing text-based techniques, with higher efficiency and relying on a much smaller amount of information. It also provides higher precision than standard link prediction, especially at low recall levels; the two methods are in fact shown to be orthogonal to each other and can therefore be fruitfully combined.

References

[1]

L. A. Adamic and E. Adar. Friends and neighbors on the web. Social Networks, 25:211--230, 2001.

[2]

C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., 2006.

Digital Library

[3]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003.

Digital Library

[4]

P. Boldi and C. Monti. Cleansing wikipedia categories using centrality. In Proc. 24th Int.Conf. on WWW, WWW '16 Companion, 2016 (To appear).

Digital Library

[5]

Paolo Boldi, Irene Crimaldi, and Corrado Monti. A network model characterized by a latent attribute structure with competition. Information Sciences, to appear.

Digital Library

[6]

C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In Proc. of the 27th ACM SIGIR, pages 25--32. ACM, 2004.

Digital Library

[7]

J. Chang and D. M. Blei. Relational topic models for document networks. In Int. Conf. on AI and statistics, pages 81--88, 2009.

[8]

Nello Cristianini and John Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, 2000.

Digital Library

[9]

K. Henderson and T. Eliassi-Rad. Applying latent dirichlet allocation to group discovery in large graphs. In Proc. 2009 ACM Symposium on Applied Computing, pages 1456--1461. ACM, 2009.

Digital Library

[10]

F. Jacquenet and C. Largeron. Discovering unexpected documents in corpora. Knowledge-Based Systems, 22(6):421--429, 2009.

Digital Library

[11]

N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429--449, 2002.

Digital Library

[12]

M. Kim and J. Leskovec. Multiplicative attribute graph model of real-world networks. Internet Mathematics, 8(1--2):113--160, 2012.

[13]

S. Lattanzi and D. Sivakumar. Affiliation networks. In Proc. of ACM STOC '09, pages 427--434, 2009.

Digital Library

[14]

B. Liu, Y. Ma, and Philip S. Yu. Discovering unexpected information from your competitors' web sites. In Proc. KDD 2001, pages 144--153. ACM, 2001.

Digital Library

[15]

Y. Liu, A. Niculescu-Mizil, and W. Gryc. Topic-link lda: joint models of topic and author community. In Proc. 26th Annual Int. Conf. on Machine Learning, pages 665--672. ACM, 2009.

Digital Library

[16]

L. Lü and T. Zhou. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications, 390(6):1150--1170, 2011.

[17]

C. Monti, A. Rozza, G. Zappella, M. Zignani, A. Arvidsson, and E. Colleoni. Modelling political disaffection from twitter data. In Proc. of the 2nd Int. WISDOM, page 3. ACM, 2013.

Digital Library

[18]

T. Murakami, K. Mori, and R. Orihara. Metrics for evaluating the serendipity of recommendation lists. In Proc. of the 2007 Conf. on New Frontiers in AI, JSAI'07, pages 40--46. Springer-Verlag, 2008.

Digital Library

[19]

N. Ramakrishnan and A. Y. Grama. Data mining-guest editors' introduction: From serendipity to science. Computer, 32(8):34--37, 1999.

Digital Library

[20]

Stuart Russell and Peter Norvig. Artificial intelligence: A modern approach. 2010.

Digital Library

[21]

F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A large ontology from wikipedia and wordnet. Web Semantics: Science, Services and Agents on the WWW, 6(3):203--217, 2008.

Digital Library

Cited By

Monti CD'Ignazi JStarnini MDe Francisci Morales G(2023)Evidence of Demographic rather than Ideological Segregation in News Discussion on RedditProceedings of the ACM Web Conference 202310.1145/3543507.3583468(2777-2786)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583468
Jatowt AHung IFärber MCampos RYoshikawa M(2021)Exploding TV Sets and Disappointing Laptops: Suggesting Interesting Content in News Archives Based on Surprise EstimationAdvances in Information Retrieval10.1007/978-3-030-72113-8_17(254-269)Online publication date: 27-Mar-2021
https://doi.org/10.1007/978-3-030-72113-8_17
Hung IFärber MJatowt A(2018)Towards Recommending Interesting Content in News ArchivesMaturity and Innovation in Digital Libraries10.1007/978-3-030-04257-8_13(142-146)Online publication date: 15-Nov-2018
https://doi.org/10.1007/978-3-030-04257-8_13

Index Terms

LlamaFur: learning latent category matrix to find unexpected relations in Wikipedia

Recommendations

Discovering "title-like" terms

This paper examines the feasibility of discovering "title-like" terms using a decision tree classifier from the document. The premise of discovering title-like terms is that title terms and title-like terms should behave similarly in the document. This ...
Improving the utility of MeSHź terms using the TopicalMeSH representation

Display Omitted TopicalMeSH representation combines information from topic models and MeSH terms.TopicalMeSH has better document retrieval performance than MeSH.TopicalMeSH has better classification performance than MeSH and several alternatives. ...
Document expansion for image retrieval
RIAO '10: Adaptivity, Personalization and Fusion of Heterogeneous Information

Successful information retrieval requires effective matching between the user's search request and the contents of relevant documents. Often the request entered by a user may not use the same topic relevant terms as the authors' of these documents. One ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WebSci '16: Proceedings of the 8th ACM Conference on Web Science

May 2016

392 pages

ISBN:9781450342087

DOI:10.1145/2908131

General Chairs:
Wolfgang Nejdl
Leibniz University Hannover & L3S Research Center, Germany
,
Wendy Hall
University of Southampton, UK
,
Program Chairs:
Paolo Parigi
Stanford University
,
Steffen Staab
University of Koblenz, Germany

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 May 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

WebSci '16

Sponsor:

SIGWEB

WebSci '16: ACM Web Science Conference

May 22 - 25, 2016

Hannover, Germany

Acceptance Rates

WebSci '16 Paper Acceptance Rate 13 of 70 submissions, 19%;

Overall Acceptance Rate 245 of 933 submissions, 26%

Upcoming Conference

Websci '25

Sponsor:
sigweb

17th ACM Web Science Conference

May 20 - 24, 2025

New Brunswick , NJ , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
85
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Monti CD'Ignazi JStarnini MDe Francisci Morales G(2023)Evidence of Demographic rather than Ideological Segregation in News Discussion on RedditProceedings of the ACM Web Conference 202310.1145/3543507.3583468(2777-2786)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583468
Jatowt AHung IFärber MCampos RYoshikawa M(2021)Exploding TV Sets and Disappointing Laptops: Suggesting Interesting Content in News Archives Based on Surprise EstimationAdvances in Information Retrieval10.1007/978-3-030-72113-8_17(254-269)Online publication date: 27-Mar-2021
https://doi.org/10.1007/978-3-030-72113-8_17
Hung IFärber MJatowt A(2018)Towards Recommending Interesting Content in News ArchivesMaturity and Innovation in Digital Libraries10.1007/978-3-030-04257-8_13(142-146)Online publication date: 15-Nov-2018
https://doi.org/10.1007/978-3-030-04257-8_13

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten