Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2911451.2914677acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

ArabicWeb16: A New Crawl for Today's Arabic Web

Published: 07 July 2016 Publication History

Abstract

Web crawls provide valuable snapshots of the Web which enable a wide variety of research, be it distributional analysis to characterize Web properties or use of language, content analysis in social science, or Information Retrieval (IR) research to develop and evaluate effective search algorithms. While many English-centric Web crawls exist, existing public Arabic Web crawls are quite limited, limiting research and development. To remedy this, we present ArabicWeb16, a new public Web crawl of roughly 150M Arabic Web pages with significant coverage of dialectal Arabic as well as Modern Standard Arabic. For IR researchers, we expect ArabicWeb16 to support various research areas: ad-hoc search, question answering, filtering, cross-dialect search, dialect detection, entity search, blog search, and spam detection. Combined use with a separate Arabic Twitter dataset we are also collecting may provide further value.

References

[1]
P. Bailey, N. Craswell, and D. Hawking. Engineering a multi-purpose test collection for web retrieval experiments. Information Processing & Management, 39(6):853--871, 2003.
[2]
M. Baroni and S. Bernardini. BootCaT: Bootstrapping Corpora and Terms from the Web. In Proceedings of the Language Resources and Evaluation Conf. (LREC), 2004.
[3]
H. Bouamor, N. Habash, and K. Oflazer. A Multidialectal Parallel Corpus of Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC), pages 1240--1245, 2014.
[4]
C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling for large collections. Information retrieval, 10(6):491--508, 2007.
[5]
J. Callan, M. Hoy, C. Yoo, and L. Zhao. The ClueWeb09 Dataset, 2009. Presentation Nov. 19, 2009 at NIST TREC. Slides online at boston.lti.cs.cmu.edu/classes/11-742/S10-TREC/TREC-Nov19-09.pdf.
[6]
S. Chakrabarti, M. Van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, 31(11):1623--1640, 1999.
[7]
C. W. Cleverdon. The evaluation of systems used in information retrieval. In Proceedings of the international conference on scientific information, volume 1, pages 687--698. National Academy of Sciences, 1959.
[8]
R. Cotterell and C. Callison-Burch. A Multi-Dialect, Multi-Genre Corpus of Informal Written Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC), pages 241--245, 2014.
[9]
K. Darwish and W. Magdy. Arabic Information Retrieval. Foundations and Trends in Information Retrieval, 7(4):239--342, 2014.
[10]
F. C. Gey and D. W. Oard. The TREC-2001 Cross-Language Information Retrieval Track: Searching Arabic Using English, French or Arabic Queries. In Proc. of the Tenth Text REtrieval Conference (TREC 10), 2001.
[11]
D. Hawking, E. Voorhees, N. Craswell, and P. Bailey. Overview of the TREC-8 Web Track. In Proceedings of the Eighth Text REtrieval Conference (TREC 8), 1999.
[12]
V. Kolias, I. Anagnostopoulos, and E. Kayafas. Exploratory analysis of a terabyte scale web corpus. arXiv preprint arXiv:1409.5443, 2014.
[13]
F. Pedregosa, G. Varoquaux, A. Gramfort, et al. Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825--2830, 2011.
[14]
N. Shuyo. Language detection library for java, 2010. http://code.google.com/p/language-detection/.

Cited By

View all
  • (2024)Kashif: A Chrome Extension for Classifying Arabic Content on Web Pages Using Machine LearningApplied Sciences10.3390/app1420922214:20(9222)Online publication date: 11-Oct-2024
  • (2023)Challenges and Progress in Constructing Arabic Dialect Corpora and Linguistic tools: A Focus on Moroccan and Tunisian Dialects2023 7th IEEE Congress on Information Science and Technology (CiSt)10.1109/CiSt56084.2023.10410009(293-298)Online publication date: 16-Dec-2023
  • (2022)An effective approach for Arabic document classification using machine learningGlobal Transitions Proceedings10.1016/j.gltp.2022.03.0033:1(267-271)Online publication date: Jun-2022
  • Show More Cited By

Index Terms

  1. ArabicWeb16: A New Crawl for Today's Arabic Web

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
    July 2016
    1296 pages
    ISBN:9781450340694
    DOI:10.1145/2911451
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 July 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. arabic search
    2. evaluation
    3. multi-dialect
    4. web collection

    Qualifiers

    • Short-paper

    Funding Sources

    Conference

    SIGIR '16
    Sponsor:

    Acceptance Rates

    SIGIR '16 Paper Acceptance Rate 62 of 341 submissions, 18%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Kashif: A Chrome Extension for Classifying Arabic Content on Web Pages Using Machine LearningApplied Sciences10.3390/app1420922214:20(9222)Online publication date: 11-Oct-2024
    • (2023)Challenges and Progress in Constructing Arabic Dialect Corpora and Linguistic tools: A Focus on Moroccan and Tunisian Dialects2023 7th IEEE Congress on Information Science and Technology (CiSt)10.1109/CiSt56084.2023.10410009(293-298)Online publication date: 16-Dec-2023
    • (2022)An effective approach for Arabic document classification using machine learningGlobal Transitions Proceedings10.1016/j.gltp.2022.03.0033:1(267-271)Online publication date: Jun-2022
    • (2021)Systematic Literature Review of Dialectal Arabic: Identification and DetectionIEEE Access10.1109/ACCESS.2021.30595049(31010-31042)Online publication date: 2021
    • (2021)Corpulyzer: A Novel Framework for Building Low Resource Language CorporaIEEE Access10.1109/ACCESS.2021.30497939(8546-8563)Online publication date: 2021
    • (2020)ArTest: The First Test Collection for Arabic Web Search with Relevance RationalesProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401223(2017-2020)Online publication date: 25-Jul-2020
    • (2019)The Role of Transliteration in the Process of Arabizi Translation/Sentiment AnalysisRecent Advances in NLP: The Case of Arabic Language10.1007/978-3-030-34614-0_6(101-128)Online publication date: 30-Nov-2019
    • (2017)Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web PagesACM Transactions on Information Systems10.1145/304165636:1(1-34)Online publication date: 5-Jun-2017

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media