Datasets
The following multi-label datasets are properly formatted for use with Mulan. We initially provide a table with dataset statistics, followed by the actual files and sources.
Statistics
|
|
|
attributes |
|
|
|
|
name |
domain |
instances |
nominal |
numeric |
labels |
cardinality |
density |
distinct |
bibtex |
text |
7395 |
1836 |
0 |
159 |
2.402 |
0.015 |
2856 |
birds
|
audio |
645 |
2 |
258 |
19 |
1.014 |
0.053 |
133 |
bookmarks |
text |
87856 |
2150 |
0 |
208 |
2.028 |
0.010 |
18716 |
CAL500 |
music |
502 |
0 |
68 |
174 |
26.044 |
0.150 |
502 |
corel5k |
images |
5000 |
499 |
0 |
374 |
3.522 |
0.009 |
3175 |
corel16k (10 samples) |
images |
13811±87 |
500 |
0 |
161±9 |
2.867±0.033 |
0.018±0.001 |
4937±158 |
delicious |
text (web) |
16105 |
500 |
0 |
983 |
19.020 |
0.019 |
15806 |
emotions |
music |
593 |
0 |
72 |
6 |
1.869 |
0.311 |
27 |
enron |
text |
1702 |
1001 |
0 |
53 |
3.378 |
0.064 |
753 |
EUR-Lex (directory codes) |
text |
19348 |
0 |
5000 |
412 |
1.292 |
0.003 |
1615 |
EUR-Lex (subject matters) |
text |
19348 |
0 |
5000 |
201 |
2.213 |
0.011 |
2504 |
EUR-Lex (eurovoc descriptors) |
text |
19348 |
0 |
5000 |
3993 |
5.310 |
0.001 |
16467 |
flags |
images (toy) |
194 |
9 |
10 |
7 |
3.392 |
0.485 |
54 |
genbase |
biology |
662 |
1186 |
0 |
27 |
1.252 |
0.046 |
32 |
mediamill |
video |
43907 |
0 |
120 |
101 |
4.376 |
0.043 |
6555 |
medical |
text |
978 |
1449 |
0 |
45 |
1.245 |
0.028 |
94 |
NUS-WIDE |
images |
269648 |
0 |
128/500 |
81 |
1.869 |
0.023 |
18430 |
rcv1v2 (subset1) |
text |
6000 |
0 |
47236 |
101 |
2.880 |
0.029 |
1028 |
rcv1v2 (subset2) |
text |
6000 |
0 |
47236 |
101 |
2.634 |
0.026 |
954 |
rcv1v2 (subset3) |
text |
6000 |
0 |
47236 |
101 |
2.614 |
0.026 |
939 |
rcv1v2 (subset4) |
text |
6000 |
0 |
47229 |
101 |
2.484 |
0.025 |
816 |
rcv1v2 (subset5) |
text |
6000 |
0 |
47235 |
101 |
2.642 |
0.026 |
946 |
scene |
image |
2407 |
0 |
294 |
6 |
1.074 |
0.179 |
15 |
tmc2007 |
text |
28596 |
49060 |
0 |
22 |
2.158 |
0.098 |
1341 |
yahoo |
text |
5423±1259 |
0 |
32786±7990 |
31±6 |
1.481±0.154 |
0.051±0.012 |
321±139 |
yeast |
biology |
2417 |
0 |
103 |
14 |
4.237 |
0.303 |
198 |
Files and Sources
- birds
files: Train and test set along with the XML header [birds.rar]
source: F. Briggs, Yonghong Huang, R. Raich, K. Eftaxias, Zhong Lei, W. Cukierski, S. Hadley, A. Hadley, M. Betts, X. Fern, J. Irvine, L. Neal, A. Thomas, G. Fodor, G. Tsoumakas, Hong Wei Ng, Thi Ngoc Tho Nguyen, H. Huttunen, P. Ruusuvuori, T. Manninen, A. Diment, T. Virtanen, J. Marzat, J. Defretin, D. Callender, C. Hurlburt, K. Larrey, M. Milakov.
"The 9th annual MLSP competition: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment",
in proc. 2013 IEEE International Workshop on Machine Learning for
Signal Processing (MLSP).
- CAL500
files: Dataset along with the XML header [CAL500.rar]
source: Douglas Turnbull, Luke Barrington, David Torres and Gert Lanckriet, Semantic Annotation and Retrieval of Music and Sound Effects, IEEE Transactions on Audio, Speech and Language Processing 16(2), pp. 467-476, 2008.
More information: http://cosmal.ucsd.edu/cal/projects/AnnRet/
- corel5k
files: Train and test sets along with their union and the XML header [corel5k.rar]
[corel5k-sparse.rar]
source: Pinar Duygulu, Kobus Barnard, Nando de Freitas, and David Forsyth, Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary , 7th European Conference on Computer Vision, pp IV:97-112, 2002.
More information:
http://kobus.ca/research/data/eccv_2002/
- corel16k
files: 10 different samples containing the train, test and test3 disjoint sets along with their union and the XML header [corel16k.rar]
source: "Matching Words and Pictures", by Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan, Journal of Machine Learning Research, Vol 3, pp 1107-1135.
More information: http://kobus.ca/research/data/jmlr_2003/
- emotions
files: Train and test sets along with their union and the XML header [emotions.rar]
source: K. Trohidis, G. Tsoumakas, G. Kalliris, I. Vlahavas. "Multilabel Classification of Music into Emotions". Proc. 2008 International Conference on Music Information Retrieval (ISMIR 2008), pp. 325-330, Philadelphia, PA, USA, 2008.
- EUR-Lex
files: Cross validation splits of TF-IDF representation of the documents with the first 5000 most frequent features selected, as used in the experiments. Usable for a direct comparison. XML header included. [eurlex-directory-codes.rar] [eurlex-subject-matters.rar] [eurlex-eurovoc-descriptors.rar]
source: Eneldo Loza Mencía and Johannes Fürnkranz. Efficient pairwise multilabel classification for large-scale problems in the legal domain. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD-2008), Part II, pages 50-65, Antwerp, Belgium, 2008.Springer-Verlag
More information: Knowledge Engineering Group, TU Darmstadt
- flags
files: Train and test sets along with their union, the XML header and a readme file [flags.zip]
source: The dataset was used for Multi-label Classification in "Gonçalves, Eduardo Corrêa, Alexandre Plastino, and Alex A. Freitas. A Genetic Algorithm for Optimizing the Label Ordering in Multi-Label Classifier Chains. ICTAI 2013." The original data can be found at the UCI repository.
- genbase
files: Train and test sets along with their union and the XML header [genbase.rar]
source: S. Diplaris, G. Tsoumakas, P. Mitkas and I. Vlahavas. Protein Classification with Multiple Algorithms, Proc. 10th Panhellenic Conference on Informatics (PCI 2005), pp. 448-456, Volos, Greece, November 2005.
note: The first attribute in this dataset is just an identification of the instance. There are several attributes with constant values (yes/no).
- mediamill
files: Train and test sets along with their union and the XML header [mediamill.rar]
source: C.G.M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek, and A.W.M. Smeulders. The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia. In Proceedings of ACM Multimedia, pp. 421-430, Santa Barbara, USA, October 2006.
related URL: The Mediamill challenge
- medical
files: Train and test sets along with their union and the XML header [medical.rar]
source:
John P. Pestian, Christopher Brew, Pawel Matykiewicz, D. J.
Hovermale, Neil Johnson, K. Bretonnel Cohen, and Wodzislaw Duch.
2007. A shared task involving multi-label classification of
clinical free text. In Proceedings
of the Workshop on BioNLP 2007: Biological, Translational, and
Clinical Language Processing (BioNLP
'07). Association for Computational Linguistics, Stroudsburg, PA,
USA, 97-104.
- NUS-WIDE
We provide two versions of the full NUS-WIDE dataset. In the first version, images are represented using 500-D bag of visual words features provided by the creators of the dataset [1]. In the second version, images are represented using 128-D cVLAD+ features described in [2]. In both cases, we provide train and test sets (splitted as described in [1]). The 1st attirube in all datasets is the image id.
files: 128-D cVLAD+ [nuswide-cVLADplus.rar] / 500-D bag of visual words [nuswide-bow.rar]
[1] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao Zheng. "NUS-WIDE: A Real-World Web Image Database from National University of Singapore", ACM International Conference on Image and Video Retrieval. Greece. Jul. 8-10, 2009.
[2] E. Spyromitros-Xioufis, S. Papadopoulos, Y. Kompatsiaris, G. Tsoumakas, I. Vlahavas, "A Comprehensive Study over VLAD and Product Quantization in Large-scale Image Retrieval", IEEE Transactions on Multimedia, 2014.
- scene
files: Train and test sets along with their union and the XML header [scene.rar]
source: M.R. Boutell, J. Luo, X. Shen, and C.M. Brown. Learning multi-labelscene classiffication. Pattern Recognition, 37(9):1757-1771, 2004.
- tmc2007
files (sparse): Train and test sets along with their union and the XML header [tmc2007.rar]
A shorter version of this dataset, after feature selection (top 500 features selected) is also available:
files: [tmc2007-500.rar]
source: A. Srivastava, B. Zane-Ulman: Discovering recurring anomalies in text reports regarding complex space systems. In: 2005 IEEE Aerospace Conference. (2005)
related URL: SIAM Text Mining Workshop 2007
- yahoo
files: 11 train and test sets along with their union and the XML header [yahoo.rar]
source: N. Ueda, K. Saito: Parametric mixture models for multi-labeled text,
In Neural Information Processing Systems 15 (NIPS 15), MIT Press, pp. 737-744, 2002.
- yeast
files: Train and test sets along with their union and the XML header [yeast.rar]
source: A. Elisseeff and J. Weston. A kernel method for multi-labelled classification. In T.G. Dietterich, S. Becker, and Z. Ghahramani, (eds), Advances in Neural Information Processing Systems 14, 2001.
Links
|