Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3209978.3210158acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Do Not Pull My Data for Resale: Protecting Data Providers Using Data Retrieval Pattern Analysis

Published: 27 June 2018 Publication History

Abstract

Data providers have a profound contribution to many fields such as finance, economy, and academia by serving people with both web-based and API-based query service of specialized data. Among the data users, there are data resellers who abuse the query APIs to retrieve and resell the data to make a profit, which harms the data provider's interests and causes copyright infringement. In this work, we define the "anti-data-reselling" problem and propose a new systematic method that combines feature engineering and machine learning models to provide a solution. We apply our method to a real query log of over 9,000 users with limited labels provided by a large financial data provider and get reasonable results, insightful observations, and real deployments.

References

[1]
ACM Digital library. https://dl.acm.org/.
[2]
Bloomberg Indices. https://www.bloombergindices.com/.
[3]
L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001.
[4]
K. Brown and D. Doran. Contrasting web robot and human behaviors with network models. arXiv preprint arXiv:1801.09715, 2018.
[5]
CNKI. http://oversea.cnki.net/.
[6]
D. Doran and S. S. Gokhale. Web robot detection techniques: overview and limitations. Data Mining and Knowledge Discovery, 22(1--2), 2011.
[7]
FactSet. https://www.factset.com/.
[8]
Feature Importance Evaluation. http://scikit-learn.org/stable/modules/ensemble.html.
[9]
G. Jacob, E. Kirda, C. Kruegel, and G. Vigna. Pubcrawl: Protecting users and businesses from crawlers. In USENIX Security Symposium, 2012.
[10]
F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. IEEE, 2008.
[11]
B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector method for novelty detection. In NIPS, 2000.
[12]
scikit-learn. http://scikit-learn.org/.
[13]
D. Stevanovic, A. An, and N. Vlajic. Feature evaluation for web crawler detection with data mining techniques. Expert Systems with Applications, 39(10), 2012.
[14]
D. Stevanovic, N. Vlajic, and A. An. Detection of malicious and non-malicious website visitors using unsupervised neural network learning. Applied Soft Computing, 13(1), 2013.
[15]
P.-N. Tan and V. Kumar. Discovery of web robot sessions based on their navigational patterns. In Intelligent Technologies for Information Analysis. Springer, 2004.
[16]
Thomson Reuters. https://www.thomsonreuters.com/en.html.
[17]
L. Von Ahn, M. Blum, N. J. Hopper, and J. Langford. CAPTCHA: Using hard AI problems for security. In Eurocrypt, 2003.
[18]
M. Zabihi, M. V. Jahan, and J. Hamidzadeh. A density based clustering approach for web robot detection. In ICCKE. IEEE, 2014.
[19]
M. Zabihimayvan, R. Sadeghi, H. N. Rude, and D. Doran. A soft computing approach for benign and malicious web robot detection. Expert Systems with Applications, 87, 2017.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
June 2018
1509 pages
ISBN:9781450356572
DOI:10.1145/3209978
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. anti-data-reselling
  2. anti-scraping
  3. behavioral analysis
  4. data retrieval
  5. outlier detection

Qualifiers

  • Short-paper

Funding Sources

Conference

SIGIR '18
Sponsor:

Acceptance Rates

SIGIR '18 Paper Acceptance Rate 86 of 409 submissions, 21%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 178
    Total Downloads
  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media