Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3052973.3053017acmconferencesArticle/Chapter ViewAbstractPublication Pagesasia-ccsConference Proceedingsconference-collections
research-article

Gossip: Automatically Identifying Malicious Domains from Mailing List Discussions

Published: 02 April 2017 Publication History

Abstract

Domain names play a critical role in cybercrime, because they identify hosts that serve malicious content (such as malware, Trojan binaries, or malicious scripts), operate as command-and-control servers, or carry out some other role in the malicious network infrastructure. To defend against Internet attacks and scams, operators widely use blacklisting to detect and block malicious domain names and IP addresses. Existing blacklists are typically generated by crawling suspicious domains, manually or automatically analyzing malware, and collecting information from honeypots and intrusion detection systems. Unfortunately, such blacklists are difficult to maintain and are often slow to respond to new attacks. Security experts set up and join mailing lists to discuss and share intelligence information, which provides a better chance to identify emerging malicious activities. In this paper, we design Gossip, a novel approach to automatically detect malicious domains based on the analysis of discussions in technical mailing lists (particularly on security-related topics) by using natural language processing and machine learning techniques. We identify a set of effective features extracted from email threads, users participating in the discussions, and content keywords, to infer malicious domains from mailing lists, without the need to actually crawl the suspect websites. Our result shows that Gossip achieves high detection accuracy. Moreover, the detection from our system is often days or weeks earlier than existing public blacklists.

References

[1]
Malc0de database. http://malc0de.com/database.
[2]
Malware domain list. http://www.malwaredomainlist.com.
[3]
Mozilla public suffic list. http://publicsuffix.org.
[4]
Phishtank. https://www.phishtank.com.
[5]
The spamhaus project. https://www.spamhaus.org.
[6]
The swiss security blog. https://www.abuse.ch.
[7]
S. Afroz, A. C. Islam, A. Stolerman, R. Greenstadt, and D. McCoy. Doppelganger finder: Taking stylometry to the underground. In IEEE Symposium on Security and Privacy, 2014.
[8]
H. Almuhimedi, A. P. Felt, R. W. Reeder, and S. Consolvo. Your reputation precedes you: History, reputation, and the Chrome malware warning. In Symposium on Usable Privacy and Security (SOUPS), 2014.
[9]
M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, and N. Feamster. Building a dynamic reputation system for DNS. In Proceedings of 19th USENIX Security Symposium, 2010.
[10]
M. Antonakakis, R. Perdisci, W. Lee, N. Vasiloglou II, and D. Dagon. Detecting malware domains at the upper DNS hierarchy. In Proceedings of 20th USENIX Security Symposium, 2011.
[11]
L. Bilge, E. Kirda, C. Kruegel, and M. Balduzzi. EXPOSURE: Finding malicious domains using passive DNS analysis. In Proceedings of the Network and Distributed System Security Symposium (NDSS), 2011.
[12]
D. Canali, M. Cova, G. Vigna, and C. Kruegel. Prophiler: A fast filter for the large-scale detection of malicious web pages. In Proceedings of the International World Wide Web Conference (WWW), 2011.
[13]
R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd International Conference on Machine Learning, 2006.
[14]
C.-M. Chen, J.-J. Huang, and Y.-H. Ou. Detecting web attacks based on domain statistics. In Intelligence and Security Informatics, pages 97--106. Springer, 2013.
[15]
M. Cova, C. Kruegel, and G. Vigna. Detection and analysis of drive-by-download attacks and malicious javascript code. In Proceedings of the World Wide Web Conference (WWW), 2010.
[16]
C. Curtsinger, B. Livshits, B. Zorn, and C. Seifert. Zozzle: Low-overhead mostly static Javascript malware detection. In Proceedings of 20th USENIX Security Symposium, 2011.
[17]
M. Cutts. Oxford guide to plain English. OUP Oxford, 2013.
[18]
M. Darling, G. Heileman, G. Gressel, A. Ashok, and P. Poornachandran. A lexical approach for classifying malicious URLs. In IEEE International Conference on High Performance Computing & Simulation (HPCS), pages 195--202, 2015.
[19]
G. groups. 10,000 most common English words. https://github.com/first20hours/google-10000-english.
[20]
G. groups. VirusTotal. https://www.virustotal.com.
[21]
R. Gunning et al. How to take the fog out of writing. 1964.
[22]
N. Habash, O. Rambow, and R. Roth. Mada
[23]
tokan: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), 2009.
[24]
S. Hao, N. Feamster, and R. Pandrangi. Monitoring the initial DNS behavior of malicious domains. In Proceedings of the ACM Internet Measurement Conference, 2011.
[25]
S. Hao, A. Kantchelian, B. Miller, V. Paxson, and N. Feamster. Predator: Proactive recognition and elimination of domain abuse at time-of-registration. In ACM Conference on Computer and Communications Security, 2016.
[26]
S. Hao, M. Thomas, V. Paxson, N. Feamster, C. Kreibich, C. Grier, and S. Hollenbeck. Understanding the domain registration behavior of spammers. In Proceedings of the ACM Internet Measurement Conference, 2013.
[27]
Y. He, Z. Zhong, S. Krasser, and Y. Tang. Mining DNS for malicious domain registrations. In Proceedings of the 6th International Conference on Collaborative Computing: Networking, Applications and Worksharing, 2010.
[28]
L. Invernizzi, P. M. Comparetti, S. Benvenuti, C. Kruegel, M. Cova, and G. Vigna. EvilSeed: A guided approach to finding malicious web pages. In IEEE Symposium on Security and Privacy, 2012.
[29]
A. Kapravelos, M. Cova, C. Kruegel, and G. Vigna. Escape from monkey island: Evading high-interaction honeyclients. In Proceedings of the 8th Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA). 2011.
[30]
M. Kührer and T. Holz. An empirical analysis of malware blacklists. Praxis der Informationsverarbeitung und Kommunikation, 35(1):11--16, 2012.
[31]
M. Kührer, C. Rossow, and T. Holz. Paint it black: Evaluating the effectiveness of malware blacklists. In Symposium on Recent Advances in Intrusion Detection. 2014.
[32]
X. Liao, K. Yuan, X. Wang, Z. Li, L. Xing, and R. Beyah. Acing the IOC game: Toward automatic discovery and analysis of open-source cyber threat intelligence. In ACM Conference on Computer and Communications Security, 2016.
[33]
J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Beyond blacklists: Learning to detect malicious web sites from suspicious URLs. In Proceedings of the 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2009.
[34]
McAfee. https://www.siteadvisor.com.
[35]
D. K. McGrath and M. Gupta. Behind phishing: An examination of phisher modi operandi. In Proceedings of the USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET), 2008.
[36]
T. Moore and R. Clayton. Evaluating the wisdom of crowds in assessing phishing websites. In Proceedings of the Conference on Financial Cryptography and Data Security. 2008.
[37]
S. Mukherjee, G. Weikum, and C. Danescu-Niculescu-Mizil. People on drugs: Credibility of user statements in health communities. In Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2014.
[38]
A. Narayanan, H. Paskov, N. Z. Gong, J. Bethencourt, E. Stefanov, E. C. R. Shin, and D. Song. On the feasibility of internet-scale author identification. In IEEE Symposium on Security and Privacy, 2012.
[39]
L. Olshen, C. J. Stone, et al. Classification and regression trees. Wadsworth International Group, 93(99):101, 1984.
[40]
A. Pitsillidis, C. Kanich, G. M. Voelker, K. Levchenko, and S. Savage. Taster's choice: A comparative analysis of spam feeds. In Proceedings of the ACM Internet Measurement Conference, 2012.
[41]
M. F. Porter. Snowball: A language for stemming algorithms, 2001.
[42]
L. Richardson. Beautiful soup documentation. 2007.
[43]
K. Rieck and P. Laskov. Detecting unknown network attacks using language models. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. 2006.
[44]
S. Sheng, B. Wardman, G. Warner, L. F. Cranor, J. Hong, and C. Zhang. An empirical analysis of phishing blacklists. In Proceedings of Sixth Conference on Email and Anti-Spam (CEAS), 2009.
[45]
B. Steven, E. Klein, and E. Loper. Natural language processing with Python. OReilly Media, 2009.
[46]
P. Vadrevu, B. Rahbarinia, R. Perdisci, K. Li, and M. Antonakakis. Measuring and detecting malware downloads in live network traffic. In Proceedings of the European Symposium on Research in Computer Security, 2013.
[47]
K. Wang, C. Thrasher, and B.-J. P. Hsu. Web scale NLP: A case study on URL word breaking. In Proceedings of the 20th International Conference on World Wide Web, 2011.
[48]
K. Wang, C. Thrasher, E. Viegas, X. Li, and B.-j. P. Hsu. An overview of Microsoft Web N-gram corpus and applications. In Proceedings of the NAACL HLT 2010 Demonstration Session, pages 45--48. Association for Computational Linguistics, 2010.
[49]
W. Wang and K. E. Shirley. Breaking bad: Detecting malicious domains using word segmentation. In IEEE Web 2.0 Security and Privacy Workshop. 2015.
[50]
P. Willett. The Porter stemming algorithm: then and now. Program, 40(3):219--223, 2006.
[51]
G. Xiang, J. Hong, C. P. Rose, and L. Cranor. Cantina
[52]
: A feature-rich machine learning framework for detecting phishing web sites. ACM Transactions on Information and System Security (TISSEC), 14(2):21, 2011.
[53]
W. Zhang, W. Wang, X. Zhang, and H. Shi. Research on privacy protection of WHOIS information in DNS. In Computer Science and its Applications, pages 71--76. Springer, 2015.

Cited By

View all
  • (2022)HearMeOutProceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services10.1145/3498361.3538939(422-435)Online publication date: 27-Jun-2022
  • (2022)MOE/RF: A Novel Phishing Detection Model Based on Revised Multiobjective Evolution Optimization Algorithm and Random ForestIEEE Transactions on Network and Service Management10.1109/TNSM.2022.316288519:4(4461-4478)Online publication date: Dec-2022
  • (2022)Analysis of uniform resource locator using boosting algorithms for forensic purposeComputer Communications10.1016/j.comcom.2022.04.002190:C(69-77)Online publication date: 1-Jun-2022
  • Show More Cited By

Index Terms

  1. Gossip: Automatically Identifying Malicious Domains from Mailing List Discussions

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASIA CCS '17: Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security
    April 2017
    952 pages
    ISBN:9781450349444
    DOI:10.1145/3052973
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 April 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. blacklists
    2. malware detection
    3. natural language processing

    Qualifiers

    • Research-article

    Conference

    ASIA CCS '17
    Sponsor:

    Acceptance Rates

    ASIA CCS '17 Paper Acceptance Rate 67 of 359 submissions, 19%;
    Overall Acceptance Rate 418 of 2,322 submissions, 18%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)23
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 14 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)HearMeOutProceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services10.1145/3498361.3538939(422-435)Online publication date: 27-Jun-2022
    • (2022)MOE/RF: A Novel Phishing Detection Model Based on Revised Multiobjective Evolution Optimization Algorithm and Random ForestIEEE Transactions on Network and Service Management10.1109/TNSM.2022.316288519:4(4461-4478)Online publication date: Dec-2022
    • (2022)Analysis of uniform resource locator using boosting algorithms for forensic purposeComputer Communications10.1016/j.comcom.2022.04.002190:C(69-77)Online publication date: 1-Jun-2022
    • (2021)#Twiti: Social Listening for Threat IntelligenceProceedings of the Web Conference 202110.1145/3442381.3449797(92-104)Online publication date: 19-Apr-2021
    • (2020)Diverse Datasets and a Customizable Benchmarking Framework for PhishingProceedings of the Sixth International Workshop on Security and Privacy Analytics10.1145/3375708.3380313(35-41)Online publication date: 16-Mar-2020
    • (2020)An In-Depth Benchmarking and Evaluation of Phishing Detection Research for Security NeedsIEEE Access10.1109/ACCESS.2020.29697808(22170-22192)Online publication date: 2020
    • (2019)AULD: Large Scale Suspicious DNS Activities Detection via Unsupervised Learning in Advanced Persistent ThreatsSensors10.3390/s1914318019:14(3180)Online publication date: 19-Jul-2019
    • (2019)Locality Preserving Projection via Deep Neural Network2019 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN.2019.8852218(1-8)Online publication date: Jul-2019
    • (2019)Collecting Indicators of Compromise from Unstructured Text of Cybersecurity Articles using Neural-Based Sequence Labelling2019 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN.2019.8852142(1-8)Online publication date: Jul-2019
    • (2019)OFS-NN: An Effective Phishing Websites Detection Model Based on Optimal Feature Selection and Neural NetworkIEEE Access10.1109/ACCESS.2019.29206557(73271-73284)Online publication date: 2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media