Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3097983.3098193acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Public Access

Automated Categorization of Onion Sites for Analyzing the Darkweb Ecosystem

Published: 13 August 2017 Publication History

Abstract

Onion sites on the darkweb operate using the Tor Hidden Service (HS) protocol to shield their locations on the Internet, which (among other features) enables these sites to host malicious and illegal content while being resistant to legal action and seizure. Identifying and monitoring such illicit sites in the darkweb is of high relevance to the Computer Security and Law Enforcement communities. We have developed an automated infrastructure that crawls and indexes content from onion sites into a large-scale data repository, called LIGHTS, with over 100M pages. In this paper we describe Automated Tool for Onion Labeling (ATOL), a novel scalable analysis service developed to conduct a thematic assessment of the content of onion sites in the LIGHTS repository. ATOL has three core components -- (a) a novel keyword discovery mechanism (ATOLKeyword) which extends analyst-provided keywords for different categories by suggesting new descriptive and discriminative keywords that are relevant for the categories; (b) a classification framework (ATOLClassify) that uses the discovered keywords to map onion site content to a set of categories when sufficient labeled data is available; (c) a clustering framework (ATOLCluster) that can leverage information from multiple external heterogeneous knowledge sources, ranging from domain expertise to Bitcoin transaction data, to categorize onion content in the absence of sufficient supervised data. The paper presents empirical results of ATOL on onion datasets derived from the LIGHTS repository, and additionally benchmarks ATOL's algorithms on the publicly available 20 Newsgroups dataset to demonstrate the reproducibility of its results. On the LIGHTS dataset, ATOLClassify gives a 12% performance gain over an analyst-provided baseline, while ATOLCluster gives a 7% improvement over state-of-the-art semi-supervised clustering algorithms. We also discuss how ATOL has been deployed and externally evaluated, as part of the LIGHTS system.

References

[1]
Sugato Basu, Arindam Banerjee, and Raymond J. Mooney. 2002. Semi-supervised Clustering by Seeding. In ICML.
[2]
Sugato Basu, Mikhail Bilenko, and Raymond J. Mooney. 2004. A probabilistic framework for semi-supervised clustering. In KDD.
[3]
Natasha Bertrand. 2015. ISIS is taking full advantage of darkest corners of internet. Business Insider (2015).
[4]
A. Biryukov, I. Pustogarov, F. Thill, and R. P. Weinmann. 2014. Content and Popularity Analysis of Tor Hidden Services. In ICDCSW.
[5]
A. Biryukov, I. Pustogarov, and R.-P. Weinmann. 2013. Trawling for Tor Hidden Services: Detection, Measurement, Deanonymization. In IEEE-SP.
[6]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. JMLR (2003).
[7]
Léon Bottou and Yoshua Bengio. 1995. Convergence Properties of K-Means Algorithms. In NIPS.
[8]
Igor Burago and Daniel Lowd. 2015. Automated Attacks on Compression-Based Classifiers. In AISec.
[9]
Kevin M. Carter, Nwokedi C. Idika, and William W. Streilein. 2013. Probabilistic threat propagation for malicious activity detection. In ICASSP.
[10]
N. Christin. 2013. Traveling the Silk Road: A Measurement Analysis of a Large Anonymous Online Marketplace. In WWW.
[11]
Common Crawl Foundation. 2016. Common Crawl. (2016). http://commoncrawl. org.
[12]
Ariyam Das, Chittaranjan Mandal, and Chris Reade. 2013. Determining the User Intent Behind Web Search Queries by Learning from Past User Interactions with Search Results. In COMAD.
[13]
Ian Davidson and S. S. Ravi. 2005. Clustering with Constraints: Feasibility Issues and the k-Means Algorithm. In SDM.
[14]
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society, Series B 39, 1 (1977).
[15]
Inderjit S. Dhillon, Yuqiang Guan, and J. Fan. 2001. Data Mining for Scientific and Engg. Applications. Chapter Efficient Clustering of Very Large Document Collections.
[16]
Elastic. 2016. Elasticsearch. (2016). https://www.elastic.com/.
[17]
Farsight Security, Inc. 2016. SIE: The Security Information Exchange. (2016). https://www.farsightsecurity.com/SIE/.
[18]
James R. Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. 2016. On the theory and Practice of Privacy-Preserving Bayesian Data Analysis. CoRR abs/1603.07294 (2016).
[19]
David Freeman, Sakshi Jain, Markus Dürmuth, Battista Biggio, and Giorgio Giacinto. 2016. Who Are You? A Statistical Approach to Measuring User Authenticity. In NDSS.
[20]
David Mandell Freeman. 2013. Using Naive Bayes to Detect Spammy Names in Social Networks. In AISec.
[21]
Shalini Ghosh, Oriol Vinyals, Brian Strope, Scott Roy, Tom Dean, and Larry Heck. 2016. Contextual LSTM (CLSTM) models for Large scale NLP tasks. In KDD-DLKDD Workshop.
[22]
HERMES Center for Transparency and Digital Human Rights. 2016. Tor2web: Browse the Tor Onion Services. (2016). https://tor2web.org/.
[23]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997).
[24]
Matthew Hoffman, David M. Blei, and Francis Bach. 2010. Online Learning for Latent Dirichlet Allocation. In NIPS.
[25]
J. Nurmi. 2016. Ahmia Search Engine. (2016). https://ahmia.fi/.
[26]
B.J. Jansen and U. Pooch. 2001. A review of Web searching studies and a framework for future research. J. American Society of Information Science and Technology 52, 3 (2001).
[27]
In-Ho Kang and GilChang Kim. 2003. Query Type Classification for Web Document Retrieval. In SIGIR.
[28]
Uichin Lee, Zhenyu Liu, and Junghoo Cho. 2005. Automatic Identification of User Goals in Web Search. In WWW.
[29]
J. MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proc. of 5th Berkeley Symp. on Mathematical Statistics and Probability.
[30]
Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press.
[31]
Tor Metrics. 2016. Unique .onion Addresses. https://metrics.torproject.org/hidserv-dir-onions-seen.html(2016).
[32]
Dark Net. 2011--2015. Market Archives. www.gwern.net/Black-market%20archives(2011--2015).
[33]
F. Niu, C. Zhang, C. Re, and J. W. Shavlik. 2012. DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference. In VLDS.
[34]
G. Owen and N. Savage. 2016. Empirical analysis of Tor Hidden Services. IET Info. Sec. 10 (2016). Issue 3.
[35]
Paul Robertson and Robert Laddaga. 2012. Adaptive Security and Trust. In SASOW.
[36]
Thabit Sabbah, Ali Selamat, Md. Haafiz Selamat, Roliana Ibrahim, and Hamido Fujita. 2016. Hybridized term-weighting method for Dark Web classification. Neurocomputing 173, 3 (2016).
[37]
K. Soska and N. Christin. 2015. Measuring the Longitudinal Evolution of the Online Anonymous Marketplace Ecosystem. In USENIX.
[38]
Tor Project. 2015. Ethical Tor Research: Guidelines. https://blog.torproject.org/blog/ethical-tor-research-guidelines. (2015).
[39]
Tor Project. 2016. Stem. (2016). https://stem.torproject.org/.
[40]
Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schrödl. 2001. Constrained K-means Clustering with Background Knowledge. In ICML.
[41]
Michael L. Winterrose, Kevin M. Carter, Neal Wagner, and William W. Streilein. 2014. Adaptive Attacker Strategy Development Against Moving Target Cyber Defenses. CoRR abs/1407.8540 (2014).
[42]
Cao Xiao, David Mandell Freeman, and Theodore Hwa. 2015. Detecting Clusters of Fake Accounts in Online Social Networks. In AISec.
[43]
Haifeng Xu, Albert Xin Jiang, Arunesh Sinha, Zinovi Rabinovich, Shaddin Dughmi, and Milind Tambe. 2015. Security Games with Information Leakage: Modeling and Computation. In IJCAI.
[44]
Wei Xu, Xin Liu, and Yihong Gong. 2003. Document Clustering Based on Non-negative Matrix Factorization. In SIGIR.
[45]
Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma, and Jinwen Ma. 2004. Learning to Cluster Web Search Results. In SIGIR.

Cited By

View all
  • (2024)Weaponization of the Growing Cybercrimes inside the Dark Net: The Question of Detection and ApplicationBig Data and Cognitive Computing10.3390/bdcc80800918:8(91)Online publication date: 14-Aug-2024
  • (2024)Few Images, Many Insights: Illicit Content Detection Using a Limited Number of ImagesACM Transactions on Intelligent Systems and Technology10.1145/369645815:6(1-26)Online publication date: 20-Sep-2024
  • (2024)A Measurement Study on Tor Hidden Services via Keyword-Based Dark Web Collection FrameworkIEEE Access10.1109/ACCESS.2024.346562912(136936-136945)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2017
2240 pages
ISBN:9781450348874
DOI:10.1145/3097983
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. classification
  2. clustering
  3. darkweb
  4. keyword discovery
  5. onion sites
  6. semi-supervised learning

Qualifiers

  • Research-article

Funding Sources

Conference

KDD '17
Sponsor:

Acceptance Rates

KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)208
  • Downloads (Last 6 weeks)29
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Weaponization of the Growing Cybercrimes inside the Dark Net: The Question of Detection and ApplicationBig Data and Cognitive Computing10.3390/bdcc80800918:8(91)Online publication date: 14-Aug-2024
  • (2024)Few Images, Many Insights: Illicit Content Detection Using a Limited Number of ImagesACM Transactions on Intelligent Systems and Technology10.1145/369645815:6(1-26)Online publication date: 20-Sep-2024
  • (2024)A Measurement Study on Tor Hidden Services via Keyword-Based Dark Web Collection FrameworkIEEE Access10.1109/ACCESS.2024.346562912(136936-136945)Online publication date: 2024
  • (2024)LSTM and BERT based transformers models for cyber threat intelligence for intent identification of social media platforms exploitation from darknet forumsInternational Journal of Information Technology10.1007/s41870-024-02077-516:8(5277-5292)Online publication date: 14-Aug-2024
  • (2024)Security, information, and structure characterization of Tor: a surveyTelecommunication Systems10.1007/s11235-024-01149-y87:1(239-255)Online publication date: 20-May-2024
  • (2024)Systematic Literature Review and Assessment for Cyber Terrorism Communication and Recruitment ActivitiesTechnology Innovation for Business Intelligence and Analytics (TIBIA)10.1007/978-3-031-55221-2_5(83-108)Online publication date: 22-Mar-2024
  • (2023)Towards Safe Cyber Practices: Developing a Proactive Cyber-Threat Intelligence System for Dark Web Forum Content by Identifying CybercrimesInformation10.3390/info1406034914:6(349)Online publication date: 18-Jun-2023
  • (2023)Cutting Onions With Others' Hands: A First Measurement of Tor Proxies in the Wild2023 IFIP Networking Conference (IFIP Networking)10.23919/IFIPNetworking57963.2023.10186440(1-9)Online publication date: 12-Jun-2023
  • (2023)Dizzy: Large-Scale Crawling and Analysis of Onion ServicesProceedings of the 18th International Conference on Availability, Reliability and Security10.1145/3600160.3600167(1-11)Online publication date: 29-Aug-2023
  • (2023)On the gathering of Tor onion addressesFuture Generation Computer Systems10.1016/j.future.2023.02.024145:C(12-26)Online publication date: 1-Aug-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media