Mulan: A Java Library for Multi-Label Learning

Datasets

The following multi-label datasets are properly formatted for use with Mulan. We initially provide a table with dataset statistics, followed by the actual files and sources.

Statistics

			attributes
name	domain	instances	nominal	numeric	labels	cardinality	density	distinct
bibtex	text	7395	1836	0	159	2.402	0.015	2856
birds	audio	645	2	258	19	1.014	0.053	133
bookmarks	text	87856	2150	0	208	2.028	0.010	18716
CAL500	music	502	0	68	174	26.044	0.150	502
corel5k	images	5000	499	0	374	3.522	0.009	3175
corel16k (10 samples)	images	13811±87	500	0	161±9	2.867±0.033	0.018±0.001	4937±158
delicious	text (web)	16105	500	0	983	19.020	0.019	15806
emotions	music	593	0	72	6	1.869	0.311	27
enron	text	1702	1001	0	53	3.378	0.064	753
EUR-Lex (directory codes)	text	19348	0	5000	412	1.292	0.003	1615
EUR-Lex (subject matters)	text	19348	0	5000	201	2.213	0.011	2504
EUR-Lex (eurovoc descriptors)	text	19348	0	5000	3993	5.310	0.001	16467
flags	images (toy)	194	9	10	7	3.392	0.485	54
genbase	biology	662	1186	0	27	1.252	0.046	32
mediamill	video	43907	0	120	101	4.376	0.043	6555
medical	text	978	1449	0	45	1.245	0.028	94
NUS-WIDE	images	269648	0	128/500	81	1.869	0.023	18430
rcv1v2 (subset1)	text	6000	0	47236	101	2.880	0.029	1028
rcv1v2 (subset2)	text	6000	0	47236	101	2.634	0.026	954
rcv1v2 (subset3)	text	6000	0	47236	101	2.614	0.026	939
rcv1v2 (subset4)	text	6000	0	47229	101	2.484	0.025	816
rcv1v2 (subset5)	text	6000	0	47235	101	2.642	0.026	946
scene	image	2407	0	294	6	1.074	0.179	15
tmc2007	text	28596	49060	0	22	2.158	0.098	1341
yahoo	text	5423±1259	0	32786±7990	31±6	1.481±0.154	0.051±0.012	321±139
yeast	biology	2417	0	103	14	4.237	0.303	198

Files and Sources

bibtex
files (sparse): Train and test sets along with their union and the XML header [bibtex.rar]
source: I. Katakis, G. Tsoumakas, I. Vlahavas, "Multilabel Text Classification for Automated Tag Suggestion", Proceedings of the ECML/PKDD 2008 Discovery Challenge, Antwerp, Belgium, 2008.

birds
files: Train and test set along with the XML header [birds.rar]
source: F. Briggs, Yonghong Huang, R. Raich, K. Eftaxias, Zhong Lei, W. Cukierski, S. Hadley, A. Hadley, M. Betts, X. Fern, J. Irvine, L. Neal, A. Thomas, G. Fodor, G. Tsoumakas, Hong Wei Ng, Thi Ngoc Tho Nguyen, H. Huttunen, P. Ruusuvuori, T. Manninen, A. Diment, T. Virtanen, J. Marzat, J. Defretin, D. Callender, C. Hurlburt, K. Larrey, M. Milakov. "The 9th annual MLSP competition: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment", in proc. 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).

bookmarks
files (sparse): Union of train and test sets along with the XML header [bookmarks.rar]
source: I. Katakis, G. Tsoumakas, I. Vlahavas, "Multilabel Text Classification for Automated Tag Suggestion", Proceedings of the ECML/PKDD 2008 Discovery Challenge, Antwerp, Belgium, 2008.

CAL500
files: Dataset along with the XML header [CAL500.rar]
source: Douglas Turnbull, Luke Barrington, David Torres and Gert Lanckriet, Semantic Annotation and Retrieval of Music and Sound Effects, IEEE Transactions on Audio, Speech and Language Processing 16(2), pp. 467-476, 2008.
More information: http://cosmal.ucsd.edu/cal/projects/AnnRet/

corel5k
files: Train and test sets along with their union and the XML header [corel5k.rar] [corel5k-sparse.rar]
source: Pinar Duygulu, Kobus Barnard, Nando de Freitas, and David Forsyth, Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary , 7th European Conference on Computer Vision, pp IV:97-112, 2002.
More information: http://kobus.ca/research/data/eccv_2002/

corel16k
files: 10 different samples containing the train, test and test3 disjoint sets along with their union and the XML header [corel16k.rar]
source: "Matching Words and Pictures", by Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan, Journal of Machine Learning Research, Vol 3, pp 1107-1135.
More information: http://kobus.ca/research/data/jmlr_2003/

delicious
files (sparse): Train and test sets along with their union and the XML header [delicious.rar]
source: G. Tsoumakas, I. Katakis, I. Vlahavas, "Effective and Efficient Multilabel Classification in Domains with Large Number of Labels", Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD'08), Antwerp, Belgium, 2008.

emotions
files: Train and test sets along with their union and the XML header [emotions.rar]
source: K. Trohidis, G. Tsoumakas, G. Kalliris, I. Vlahavas. "Multilabel Classification of Music into Emotions". Proc. 2008 International Conference on Music Information Retrieval (ISMIR 2008), pp. 325-330, Philadelphia, PA, USA, 2008.

enron
files (sparse): Train and test sets along with their union, the XML header and a file describing the categories [enron.rar]
sources: a) Jesse Read's Web Page, b) UC Berkeley Enron Email Analysis Project

EUR-Lex
files: Cross validation splits of TF-IDF representation of the documents with the first 5000 most frequent features selected, as used in the experiments. Usable for a direct comparison. XML header included. [eurlex-directory-codes.rar] [eurlex-subject-matters.rar] [eurlex-eurovoc-descriptors.rar]
source: Eneldo Loza Mencía and Johannes Fürnkranz. Efficient pairwise multilabel classification for large-scale problems in the legal domain. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD-2008), Part II, pages 50-65, Antwerp, Belgium, 2008.Springer-Verlag
More information: Knowledge Engineering Group, TU Darmstadt

flags
files: Train and test sets along with their union, the XML header and a readme file [flags.zip]
source: The dataset was used for Multi-label Classification in "Gonçalves, Eduardo Corrêa, Alexandre Plastino, and Alex A. Freitas. A Genetic Algorithm for Optimizing the Label Ordering in Multi-Label Classifier Chains. ICTAI 2013." The original data can be found at the UCI repository.

genbase
files: Train and test sets along with their union and the XML header [genbase.rar]
source: S. Diplaris, G. Tsoumakas, P. Mitkas and I. Vlahavas. Protein Classification with Multiple Algorithms, Proc. 10th Panhellenic Conference on Informatics (PCI 2005), pp. 448-456, Volos, Greece, November 2005.
note: The first attribute in this dataset is just an identification of the instance. There are several attributes with constant values (yes/no).

mediamill
files: Train and test sets along with their union and the XML header [mediamill.rar]
source: C.G.M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek, and A.W.M. Smeulders. The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia. In Proceedings of ACM Multimedia, pp. 421-430, Santa Barbara, USA, October 2006.
related URL: The Mediamill challenge

medical
files: Train and test sets along with their union and the XML header [medical.rar]
source: John P. Pestian, Christopher Brew, Pawel Matykiewicz, D. J. Hovermale, Neil Johnson, K. Bretonnel Cohen, and Wodzislaw Duch. 2007. A shared task involving multi-label classification of clinical free text. In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing (BioNLP '07). Association for Computational Linguistics, Stroudsburg, PA, USA, 97-104.

NUS-WIDE
We provide two versions of the full NUS-WIDE dataset. In the first version, images are represented using 500-D bag of visual words features provided by the creators of the dataset [1]. In the second version, images are represented using 128-D cVLAD+ features described in [2]. In both cases, we provide train and test sets (splitted as described in [1]). The 1st attirube in all datasets is the image id.
files: 128-D cVLAD+ [nuswide-cVLADplus.rar] / 500-D bag of visual words [nuswide-bow.rar]
[1] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao Zheng. "NUS-WIDE: A Real-World Web Image Database from National University of Singapore", ACM International Conference on Image and Video Retrieval. Greece. Jul. 8-10, 2009.
[2] E. Spyromitros-Xioufis, S. Papadopoulos, Y. Kompatsiaris, G. Tsoumakas, I. Vlahavas, "A Comprehensive Study over VLAD and Product Quantization in Large-scale Image Retrieval", IEEE Transactions on Multimedia, 2014.

rcv1v2 subsets
files (sparse): Train and test sets along with their union and the XML header [rcv1subset1.rar]
[rcv1subset2.rar] [rcv1subset3.rar] [rcv1subset4.rar] [rcv1subset5.rar]
source: David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361-397, 2004.

scene
files: Train and test sets along with their union and the XML header [scene.rar]
source: M.R. Boutell, J. Luo, X. Shen, and C.M. Brown. Learning multi-labelscene classiffication. Pattern Recognition, 37(9):1757-1771, 2004.

tmc2007
files (sparse): Train and test sets along with their union and the XML header [tmc2007.rar]
A shorter version of this dataset, after feature selection (top 500 features selected) is also available:
files: [tmc2007-500.rar]
source: A. Srivastava, B. Zane-Ulman: Discovering recurring anomalies in text reports regarding complex space systems. In: 2005 IEEE Aerospace Conference. (2005)
related URL: SIAM Text Mining Workshop 2007

yahoo
files: 11 train and test sets along with their union and the XML header [yahoo.rar]
source: N. Ueda, K. Saito: Parametric mixture models for multi-labeled text, In Neural Information Processing Systems 15 (NIPS 15), MIT Press, pp. 737-744, 2002.

yeast
files: Train and test sets along with their union and the XML header [yeast.rar]
source: A. Elisseeff and J. Weston. A kernel method for multi-labelled classification. In T.G. Dietterich, S. Becker, and Z. Ghahramani, (eds), Advances in Neural Information Processing Systems 14, 2001.

Links