Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

The WEKA data mining software: an update

Published: 16 November 2009 Publication History

Abstract

More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

References

[1]
I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludscher, and S. Mock. Kepler: An extensible system for design and execution of scientific workflows. In In SSDBM, pages 21--23, 2004.
[2]
K. Bennett and M. Embrechts. An optimization perspective on kernel partial least squares regression. In J.S. et al., editor, Advances in Learning Theory: Methods, Models and Applications, volume 190 of NATO Science Series, Series III: Computer and System Sciences, pages 227--249. IOS Press, Amsterdam, The Netherlands, 2003.
[3]
L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. Wadsworth International Group, Belmont, California, 1984.
[4]
S. Celis and D.R. Musicant. Weka-parallel: machine learning in parallel. Technical report, Carleton College, CS TR, 2002.
[5]
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[6]
T.G. Dietterich, R.H. Lathrop, and T. Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell., 89(1-2):31--71, 1997.
[7]
J. Dietzsch, N. Gehlenborg, and K. Nieselt. Maydaya microarray data analysis workbench. Bioinformatics, 22(8):1010--1012, 2006.
[8]
L. Dong, E. Frank, and S. Kramer. Ensembles of balanced nested dichotomies for multi-class problems. In Proc 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, pages 84--95. Springer, 2005.
[9]
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning. Research, 9:1871--1874, 2008.
[10]
E. Frank and S. Kramer. Ensembles of nested dichotomies for multi-class problems. In Proc 21st International Conference on Machine Learning, Banff, Canada, pages 305--312. ACM Press, 2004.
[11]
R. Gaizauskas, H. Cunningham, Y. Wilks, P. Rodgers, and K. Humphreys. GATE: an environment to support research and development in natural language engineering. In In Proceedings of the 8th IEEE International Conference on Tools with Artificial Intelligence, pages 58--66, 1996.
[12]
J. Gama. Functional trees. Machine Learning, 55(3):219--250, 2004.
[13]
A. Genkin, D.D. Lewis, and D. Madigan. Largescale bayesian logistic regression for text categorization. Technical report, DIMACS, 2004.
[14]
J.E. Gewehr, M. Szugat, and R. Zimmer. BioWeka-extending the weka framework for bioinformatics. Bioinformatics, 23(5):651--653, 2007.
[15]
M. Hall and E. Frank. Combining naive Bayes and decision tables. In Proc 21st Florida Artificial Intelligence Research Society Conference, Miami, Florida. AAAI Press, 2008.
[16]
K. Hornik, A. Zeileis, T. Hothorn, and C. Buchta. RWeka: An R Interface to Weka, 2009. R package version 0.3-16.
[17]
L. Jiang and H. Zhang. Weightily averaged onedependence estimators. In Proceedings of the 9th Biennial Pacific Rim International Conference on Artificial Intelligence, PRICAI 2006, volume 4099 of LNAI, pages 970--974, 2006.
[18]
R. Khoussainov, X. Zuo, and N. Kushmerick. Gridenabled Weka: A toolkit for machine learning on the grid. ERCIM News, 59, 2004.
[19]
M.-A. Krogel and S. Wrobel. Facets of aggregation approaches to propositionalization. In T. Horvath and A. Yamamoto, editors, Work-in-Progress Track at the Thirteenth International Conference on Inductive Logic Programming (ILP), 2003.
[20]
I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. Yale: Rapid prototyping for complex data mining tasks. In L. Ungar, M. Craven, D. Gunopulos, and T. Eliassi-Rad, editors, KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 935--940, New York, NY, USA, August 2006. ACM.
[21]
D. Nadeau. Balie-baseline information extraction : Multilingual information extraction from text with machine learning and natural language techniques. Technical report, University of Ottawa, 2005.
[22]
G. Piatetsky-Shapiro. KDnuggets news on SIGKDD service award. http://www.kdnuggets.com/news/2005/n13/2i.html, 2005.
[23]
R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2006. ISBN 3-900051-07-0.
[24]
J.J. Rodriguez, L.I. Kuncheva, and C.J. Alonso. Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1619--1630, 2006.
[25]
K. Sandberg. The haar wavelet transform. http://amath.colorado.edu/courses/5720/2000Spr/Labs/Haar/haar.html, 2000.
[26]
M. Seeger. Gaussian processes for machine learning. International Journal of Neural Systems, 14:2004, 2004.
[27]
C. Shearer. The CRISP-DM model: The new blueprint for data mining. Journal of Data Warehousing, 5(4), 2000.
[28]
H. Shi. Best-first decision tree learning. Master's thesis, University of Waikato, Hamilton, NZ, 2007. COMP594.
[29]
N. Slonim, N. Friedman, and N. Tishby. Unsupervised document classification using sequential information maximization. In Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 129--136, 2002.
[30]
J. Su, H. Zhang, C.X. Ling, and S. Matwin. Discriminative parameter learning for bayesian networks. In ICML 2008, 2008.
[31]
D. Talia, P. Trunfio, and O. Verta. Weka4ws: a wsrfenabled weka toolkit for distributed data mining on grids. In Proc. of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2005, pages 309--320. Springer-Verlag, 2005.
[32]
K.M. Ting and I.H. Witten. Stacking bagged and dagged models. In D. H. Fisher, editor, Fourteenth international Conference on Machine Learning, pages 367--375, San Francisco, CA, 1997. Morgan Kaufmann Publishers.
[33]
J.S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37--57, 1985.
[34]
I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, 2000.
[35]
I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2 edition, 2005.
[36]
I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin, and C.G. Nevill-Manning. Kea: Practical automatic keyphrase extraction. In Y.-L. Theng and S. Foo, editors, Design and Usability of Digital Libraries: Case Studies in the Asia Pacific, pages 129--152. Information Science Publishing, London, 2005.
[37]
X. Xu. Statistical learning in multiple instance problems. Master's thesis, Department of Computer Science, University of Waikato, 2003.
[38]
Y. Yang, X. Guan, and J. You. CLOPE: a fast and effective clustering algorithm for transactional data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 682--687. ACM New York, NY, USA, 2002.
[39]
F. Zheng and G.I. Webb. Efficient lazy elimination for averaged-one dependence estimators. In Proceedings of the Twenty-third International Conference on Machine Learning (ICML 2006), pages 1113--1120. ACM Press, 2006.

Cited By

View all
  • (2025)Testability-driven development: An improvement to the TDD efficiencyComputer Standards & Interfaces10.1016/j.csi.2024.10387791(103877)Online publication date: Jan-2025
  • (2024)A machine learning-based selection approach for solving the single machine scheduling problem with early/tardy jobsBizinfo Blace10.5937/bizinfo2401001A15:1(1-10)Online publication date: 2024
  • (2024)Técnicas de clasificación para predecir el desempeño de los estudiantes en pruebas estandarizadasEDUCATECONCIENCIA10.58299/edutec.v32i2.79332:2Online publication date: 7-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter
ACM SIGKDD Explorations Newsletter  Volume 11, Issue 1
June 2009
56 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1656274
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 November 2009
Published in SIGKDD Volume 11, Issue 1

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)485
  • Downloads (Last 6 weeks)42
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Testability-driven development: An improvement to the TDD efficiencyComputer Standards & Interfaces10.1016/j.csi.2024.10387791(103877)Online publication date: Jan-2025
  • (2024)A machine learning-based selection approach for solving the single machine scheduling problem with early/tardy jobsBizinfo Blace10.5937/bizinfo2401001A15:1(1-10)Online publication date: 2024
  • (2024)Técnicas de clasificación para predecir el desempeño de los estudiantes en pruebas estandarizadasEDUCATECONCIENCIA10.58299/edutec.v32i2.79332:2Online publication date: 7-May-2024
  • (2024)Discovering hidden patterns: Association rules for cardiovascular diseases in type 2 diabetes mellitusWorld Journal of Methodology10.5662/wjm.v14.i2.9260814:2Online publication date: 20-Jun-2024
  • (2024)Assessing the Effectiveness of Textual Recommendations in KoopaMLInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.34072720:1(1-21)Online publication date: 9-Apr-2024
  • (2024)Beyond SupervisionRecent Trends and Future Direction for Data Analytics10.4018/979-8-3693-3609-0.ch007(170-196)Online publication date: 12-Jul-2024
  • (2024)Machine Learning and Big Data Analytics for Precision Cardiac RiskStratification and Heart DiseasesInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24FEB155(2041-2046)Online publication date: 18-Apr-2024
  • (2024)Unveiling the Power: A Comparative Analysis of Data Mining Tools through Decision Tree Classification on the Bank Marketing DatasetWSEAS TRANSACTIONS ON COMPUTERS10.37394/23205.2024.23.923(95-105)Online publication date: 13-May-2024
  • (2024)Classifying PDO Kalamata Olive Oil from Geographic Origins of the Messenia Region based on Statistical Machine LearningWSEAS TRANSACTIONS ON ENVIRONMENT AND DEVELOPMENT10.37394/232015.2024.20.1520(137-147)Online publication date: 13-May-2024
  • (2024)Precipitation Modeling Based on Spatio-Temporal Variation in Lake Urmia Basin Using Machine Learning MethodsWater10.3390/w1609124616:9(1246)Online publication date: 26-Apr-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media