Hybrid entity clustering using crowds and data

Jongwuk Lee¹,
Hyunsouk Cho¹,
Jin-Woo Park¹,
Young-rok Cha¹,
Seung-won Hwang¹,
Zaiqing Nie² &
…
Ji-Rong Wen³

1791 Accesses
9 Citations
Explore all metrics

Abstract

Query result clustering has attracted considerable attention as a means of providing users with a concise overview of results. However, little research effort has been devoted to organizing the query results for entities which refer to real-world concepts, e.g., people, products, and locations. Entity-level result clustering is more challenging because diverse similarity notions between entities need to be supported in heterogeneous domains, e.g., image resolution is an important feature for cameras, but not for fruits. To address this challenge, we propose a hybrid relationship clustering algorithm, called Hydra, using co-occurrence and numeric features. Algorithm Hydra captures diverse user perceptions from co-occurrence and disambiguates different senses using feature-based similarity. In addition, we extend Hydra into ${\mathsf{Hydra }_\mathsf{gData }}$ with different sources, i.e., entity types and crowdsourcing. Experimental results show that the proposed algorithms achieve effectiveness and efficiency in real-life and synthetic datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering Image Search Results by Entity Disambiguation

Entity Matching Across Multiple Heterogeneous Data Sources

A Hybrid Data Deduplication Approach in Entity Resolution Using Chromatic Correlation Clustering

Notes

www.freebase.com.
www.mpi-inf.mpg.de/yago-naga/yago.
We adopt refined $R_{ij}$ from [53] to discourage an extreme case of merging two distance clusters with large size difference, i.e., $|C_{i_1}| \gg |C_{i_2}|$. More details on this refined notion can be found in [53].
These entity types in Table 1 are collected from Freebase (www.freebase.com).

References

Aggarwal, C.C.: A human-computer cooperative system for effective high dimensional clustering. In: KDD (2001)
Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. In: SIGMOD (1999)
Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. In: SIGMOD (2000)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high-dimensional data for a data mining applications. In: SIGMOD (1998)
Agrawal, R., Gollapudi, S., Halverson, A., Ieong, S.: Diversifying search results. In: WSDM, pp. 5–14 (2009)
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB, pp. 586–597 (2002)
Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD Conference, pp. 783–794 (2010)
Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: KDD, pp. 59–68 (2004)
Bazzanella, B., Stoermer, H., Bouquet, P.: Entity type disambiguation in user queries. JIKM 10(3), 209–224 (2011)
Google Scholar
Bilenko, M., Basu, S., Sahami, M.: Adaptive product normalization: Using online learning for recored linkage in comparison shopping. In: ICDM (2005)
Bouquet, P., Palpanas, T., Stoermer, H., Vignolo, M.: A conceptual model for a web-scale entity name system. In: ASWC, pp. 46–60 (2009)
Carterette, B., Chandar, P.: Probabilistic models of ranking novel documents for faceted topic retrieval. In: CIKM, pp. 1287–1296 (2009)
Cheng, C.-H., Fu, A.W., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: KDD (1999)
Cheng, D., Kannan, R., Vempala, S., Wang, G.: A divide-merge methodology for clustering. In: TODS (2005)
Chierichetti, F., Kumar, R., Pandey, S., Vassilvitskii, S.: Finding the jaccard median. In: SODA, pp. 293–311 (2010)
Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: SIGMOD, pp. 201–212 (1998)
Cui, Y., Hasler, N., Thormählen, T., Seidel, H.-P.: Scale invariant feature transform with irregular orientation histogram binning. In: ICIAR, pp. 258–267 (2009)
Doan, A., Ramakrishnan, R., Halevy, A.Y.: Crowdsourcing systems on the world-wide web. Commun. ACM 54(4), 86–96 (2011)
Article Google Scholar
Franklin, M.J., Kossmann, D., Kraska, T., Ramesh, S., Xin, R.: CrowdDB: answering queries with crowdsourcing. In: SIGMOD, pp. 61–72 (2011)
Goil, S., Nagesh, H., Choudhary, A.: Mafia: efficient and scalable subspace clustering for very large data sets. Technical Report, Northwesthen University (1999)
Gomes, R., Welinder, P., Krause, A., Perona, P.: Crowdclustering. In: NIPS, pp. 558–566 (2011)
Hearst, M.A., Pedersen, J.O.: Re-examining the cluster hypothesis: Scatter/Gather on retrieval results. In: SIGIR (1996)
Jain, A., Pennacchiotti, M.: Open entity extraction from web search query logs. In: COLING, pp. 510–518 (2010)
Jang, M., Park, J.-W., Hwang, S.: Predictive mining of comparable entities from the web. In: AAAI (2012)
Ji, X., Xu, W., Zhu, S.: Document clustering with prior knowledge. In: SIGIR (2006)
Jindal, N., Liu, B.: Identifying comparative sentences in text documents. In: SIGIR, pp. 244–251 (2006)
Lee, J., Hwang, S., Nie, Z., Wen, J.-R.: Query result clustering for object-level search. In: KDD, pp. 1205–1214 (2009)
Lee, J., Hwang, S., Nie, Z., Wen, J.-R.: Navigation system for product search. In: ICDE, pp. 1113–1116 (2010)
Lee, T., Wang, Z., Wang, H., Hwang, S.: Web scale taxonomy cleansing. PVLDB 4(12), 1295–1306 (2011)
Google Scholar
Li, S., Lin, C.-Y., Song, Y.-I., Li, Z.: Comparable entity mining from comparative questions. In: ACL, pp. 650–658 (2010)
Liu, Y., Li, W., Lin, Y., Jing, L.: Spectral geometry for simultaneously clustering and ranking query search results. In: SIGIR (2008)
Marcus, A., Wu, E., Madden, S., Miller, R.C.: Crowdsourced databases: Query processing with people. In: CIDR, pp. 211–214 (2011)
Mecca, G., Raunich, S., Pappalardo, A.: A new algorithm for clustering search results. Data Knowl. Eng. 62(3), 504–522 (2007)
Google Scholar
Nie, Z., Ma, Y., Shi, S., Wen, J.-R., Ma, W.-Y.: Web object retrieval. In: WWW (2007)
Nie, Z., Wen, J.-R., Ma, W.-Y.: Object-level vertical search. In: CIDR (2007)
Nie, Z., Wen, J.-R., Ma, W.-Y.: Statistical entity extraction from the web. Proc. IEEE 100(9), 2675–2687 (2012)
Google Scholar
Nie, Z., Zhang, Y., Wen, J.-R., Ma, W.-Y.: Object-level ranking: bringing order to web objects. In: WWW (2005)
Parameswaran, A.G., Polyzotis, N.: Answering queries using humans, algorithms and databases. In: CIDR, pp. 160–166 (2011)
Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. SIGKDD Newsletter 6(1), 90–105 (2004)
Article Google Scholar
Patrikainen, A., Melia, M.: Comparing subspace clusterings. TKDE 18(7), 902–916 (2006)
Google Scholar
Radlinski, F., Dumais, S.T.: Improving personalized web search using result diversification. In: SIGIR, pp. 691–692 (2006)
Scripps, J., Tan, P.-N.: Clustering in the presence of bridge-nodes. In: SDM (2006)
Selke, J., Lofi, C., Balke, W.-T.: Pushing the boundaries of crowd-enabled databases with query-driven schema expansion. PVLDB 5(6), 538–549 (2012)
Google Scholar
Song, Y., Wang, H., Wang, Z., Li, H., Chen, W.: Short text conceptualization using a probabilistic knowledgebase. In: IJCAI, pp. 2330–2336 (2011)
Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Conttrainted k-means clustering with background knowledge. In: ICML (2001)
Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. PVLDB 5(11), 1483–1494 (2012)
Google Scholar
Wang, X., Zhai, C.: Learn from web search logs to organize search results. In: SIGIR (2007)
Wang, X.-J., Ma, W.-Y., He, Q.-C., Li, X.: Grouping web image search result. In: ACM Multimedia, pp. 436–439 (2004)
Whang, S.E., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with negative rules. VLDB J. 18(6), 1261–1277 (2009)
Article Google Scholar
Whang, S.E., Lofgren, P., Garcia-Molina, H.: Question selection for crowd entity resolution. In: PVLDB (2013)
Woo, K.-G., Lee, J.-H., Kim, M.-H., Lee, Y.-J.: FINDIT: a fast intelligent subspace clusteing algorithm using diemsnion voting. Inform. Softw. Technol. 46(4), 255–271 (2004)
Article Google Scholar
Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: SIGIR (2003)
Yip, K.Y., Cheung, D.W., Ng, M.K.: HARP: A practical projected clustering algorithm. TKDE 16(11), 1387–1397 (2004)
Google Scholar
Yip, K.Y., Cheung, D.W., Ng, M.K.: On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In: ICDE (2005)
Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR (1998)
Zeng, H.-J., He, Q.-C., Chen, Z., Ma, W.-Y., Ma, J.: Learning to cluster web search results. In: SIGIR (2004)
Zhu, X., Ghahramani, Z., Lafferty, J.D.: Semi-supervised learning using gaussian fields and harmonic functions. In: ICML, pp. 912–919 (2003)

Download references

Acknowledgments

This research was supported by the Ministry of Knowledge Economy (MKE), Korea and Microsoft Research, under IT/SW Creative research program supervised by the NIPA (National IT Industry Promotion Agency). (NIPA-2012-H0503-12-1036).

Author information

Authors and Affiliations

Pohang University of Science and Technology (POSTECH), Pohang, Republic of Korea
Jongwuk Lee, Hyunsouk Cho, Jin-Woo Park, Young-rok Cha & Seung-won Hwang
Microsoft Research Asia, Beijing, People’s Republic of China
Zaiqing Nie
Renmin University of China, Beijing, People’s Republic of China
Ji-Rong Wen

Authors

Jongwuk Lee
View author publications
You can also search for this author in PubMed Google Scholar
Hyunsouk Cho
View author publications
You can also search for this author in PubMed Google Scholar
Jin-Woo Park
View author publications
You can also search for this author in PubMed Google Scholar
Young-rok Cha
View author publications
You can also search for this author in PubMed Google Scholar
Seung-won Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Zaiqing Nie
View author publications
You can also search for this author in PubMed Google Scholar
Ji-Rong Wen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seung-won Hwang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, J., Cho, H., Park, JW. et al. Hybrid entity clustering using crowds and data. The VLDB Journal 22, 711–726 (2013). https://doi.org/10.1007/s00778-013-0328-8

Download citation

Received: 26 September 2012
Revised: 08 June 2013
Accepted: 01 July 2013
Published: 13 August 2013
Issue Date: October 2013
DOI: https://doi.org/10.1007/s00778-013-0328-8

Hybrid entity clustering using crowds and data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Clustering Image Search Results by Entity Disambiguation

Entity Matching Across Multiple Heterogeneous Data Sources

A Hybrid Data Deduplication Approach in Entity Resolution Using Chromatic Correlation Clustering

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Hybrid entity clustering using crowds and data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Clustering Image Search Results by Entity Disambiguation

Entity Matching Across Multiple Heterogeneous Data Sources

A Hybrid Data Deduplication Approach in Entity Resolution Using Chromatic Correlation Clustering

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation