Continuous Learning Graphical Knowledge Unit for Cluster Identification in High Density Data Sets
<p>Definition of marker in graphical knowledge unit (GKU): The marker is a circle (radius = 10 pixels), and the RGB colour value of the circle is (0, 0, X), where 0 ≤ X ≤ 255. The centre of the circle represents the data point (highlighted).</p> "> Figure 2
<p>(<b>A</b>) Two overlapped markers; and (<b>B</b>) overlapped markers. The data point is represented by the pixel in the centre of the marker (the data point is highlighted in orange).</p> "> Figure 3
<p>GKU with borders to record missing and out of range values: (1) <span class="html-italic">y</span> > <span class="html-italic">y</span><sub>max</sub> at <span class="html-italic">x</span> = <span class="html-italic">x</span>; (2) <span class="html-italic">x</span> > <span class="html-italic">x</span><sub>max</sub> and <span class="html-italic">y</span> > <span class="html-italic">y</span><sub>max</sub>; (3) <span class="html-italic">x</span> > <span class="html-italic">x</span><sub>max</sub> at <span class="html-italic">y</span> = <span class="html-italic">y</span>; (4) <span class="html-italic">x</span> > <span class="html-italic">x</span><sub>max</sub> and <span class="html-italic">y</span> < <span class="html-italic">y</span><sub>min</sub>; (5) <span class="html-italic">y</span> < <span class="html-italic">y</span><sub>min</sub> at <span class="html-italic">x</span> = <span class="html-italic">x</span>; (6) <span class="html-italic">x</span> < <span class="html-italic">x</span><sub>min</sub> and <span class="html-italic">y</span> < <span class="html-italic">y</span><sub>min</sub>; (7) <span class="html-italic">x</span> < <span class="html-italic">x</span><sub>min</sub> at <span class="html-italic">y</span> = <span class="html-italic">y</span>; (8) <span class="html-italic">x</span> < <span class="html-italic">x</span><sub>min</sub> and <span class="html-italic">y</span> > <span class="html-italic">y</span><sub>max</sub>; (9) <span class="html-italic">y</span> is missing and <span class="html-italic">x</span> < <span class="html-italic">x</span><sub>min</sub>; (10) <span class="html-italic">y</span> is missing at <span class="html-italic">x</span> = <span class="html-italic">x</span>; (11) <span class="html-italic">y</span> is missing and <span class="html-italic">x</span> > <span class="html-italic">x</span><sub>max</sub>; (12) both <span class="html-italic">x</span> and <span class="html-italic">y</span> are missing; (13) <span class="html-italic">x</span> is missing and <span class="html-italic">y</span> > <span class="html-italic">y</span><sub>max</sub>; (14) <span class="html-italic">x</span> is missing at <span class="html-italic">y</span> = <span class="html-italic">y</span>; and (15) <span class="html-italic">x</span> is missing and <span class="html-italic">y</span> < <span class="html-italic">y</span><sub>min</sub>. * Shading is used in the figure to highlight different areas. In the real GKU, there will be no shading.</p> "> Figure 4
<p>GKUs for the same data set with 35,620 data points using a circle as the marker with different sizes and colours. The correct selection of shape, size and initial colour of the data point will produce clusters that are visually clear and separated by colour borders similar to contour lines. For data set of plots in this figure, see <a href="#app1-symmetry-08-00152" class="html-app">Supplementary Materials, File S1</a>. (<b>A</b>): GKU for 35,620 data points generated using a circle as the marker, where diameter is 5 pixels and RGB colour is (0, 0, 1); (<b>B</b>): GKU for 35,620 data points generated using a circle as the marker, where diameter is 5 pixels and RGB colour is (0, 0, 5); (<b>C</b>): GKU for 35,620 data points generated using a circle as the marker, where diameter is 5 pixels and RGB colour is (0, 0, 10); (<b>D</b>): GKU for 35,620 data points generated using a circle as the marker, where diameter is 10 pixels and RGB colour is (0, 0, 1); (<b>E</b>): GKU for 35,620 data points generated using a circle as the marker, where diameter is 10 pixels and RGB colour is (0, 0, 5); (<b>F</b>): GKU for 35,620 data points generated using a circle as the marker, where diameter is 10 pixels and RGB colour is (0, 0, 10); (<b>G</b>): GKU for 35,620 data points generated using a circle as the marker, where diameter is 20 pixels and RGB colour is (0, 0, 1); (<b>H</b>): GKU for 35,620 data points generated using a circle as the marker, where diameter is 20 pixels and RGB colour is (0, 0, 5); (<b>I</b>): GKU for 35,620 data points generated using a circle as the marker, where diameter is 20 pixels and RGB colour is (0, 0, 10).</p> "> Figure 5
<p>GKUs for the same data set with 35,620 data points using a square as the marker with different sizes and colours. The correct selection of shape, size and initial colour of the data point will produces clusters that are visually clear and separated by colour borders similar to contour lines. For data set of plots in this figure, see <a href="#app1-symmetry-08-00152" class="html-app">Supplementary Materials, File S1</a>. (<b>A</b>): GKU for 35,620 data points generated using a square as the marker, where length is 10 pixels and RGB colour is (0, 0, 1); (<b>B</b>): GKU for 35,620 data points generated using a square as the marker, where length is 10 pixels and RGB colour is (0, 0, 5); (<b>C</b>): GKU for 35,620 data points generated using a square as the marker, where length is 10 pixels and RGB colour is (0, 0, 10); (<b>D</b>): GKU for 35,620 data points generated using a square as the marker, where length is 20 pixels and RGB colour is (0, 0, 1); (<b>E</b>): GKU for 35,620 data points generated using a square as the marker, where length is 20 pixels and RGB colour is (0, 0, 5); (<b>F</b>): GKU for 35,620 data points generated using a square as the marker, where length is 20 pixels and RGB colour is (0, 0, 10); (<b>G</b>): GKU for 35,620 data points generated using a square as the marker, where length is 40 pixels and RGB colour is (0, 0, 1); (<b>H</b>): GKU for 35,620 data points generated using a square as the marker, where length is 40 pixels and RGB colour is (0, 0, 5); (<b>I</b>): GKU for 35,620 data points generated using a square as the marker, where length is 40 pixels and RGB colour is (0, 0, 10).</p> "> Figure 6
<p>Relation between bitmap and matrix versions of a GKU. A GKU matrix is a simple way to represent the same GKU. Marker: circle, radius: 10 pixels, marker colour: (0, 0, 254). The table shows the colour values of 10 × 10 pixels in the bitmap, which is a portion of the GKU matrix.</p> "> Figure 7
<p>Contour lines in a GKU. Representation of 35,620 data points; marker: circle, radius: 20 pixels, marker colour: (0, 0, 1). This shows clear colour borders that can be considered as contour lines. Contour lines are numbered from the outside to the inside of the cluster. Contour lines with same contour line number have the same green channel value. For such contour lines, blue channel values are in the same range. The higher the number of contour lines, the higher the data density. Therefore, it is possible to understand cluster density without a colour scale or legend. For data set of plots in this figure, see <a href="#app1-symmetry-08-00152" class="html-app">Supplementary Materials, File S1</a>.</p> "> Figure 8
<p>Development of a GKU over time. Bitmaps (<b>A</b>–<b>D</b>) show GKUs with 5000, 10,000, 20,000 and 35,620 data points, respectively. Marker: circle, radius: 10 pixels, initial colour of the data point: (0, 0, 1). For data set of plots in this figure, see <a href="#app1-symmetry-08-00152" class="html-app">Supplementary Materials, File S1</a>.</p> "> Figure 9
<p>Representation of 35,864 data points in a GKU with borders to record missing values, out of range values and GKU specific information. Marker: circle, radius: 20 pixels, colour of the data point: (0, 0, 50). * Refer to <a href="#symmetry-08-00152-f003" class="html-fig">Figure 3</a> for structure information and usage. ** Refer to <a href="#symmetry-08-00152-t003" class="html-table">Table 3</a> for structure information about the GKU specific information. For data set of plots in this figure, see <a href="#app1-symmetry-08-00152" class="html-app">Supplementary Materials, File S2</a>.</p> "> Figure 10
<p>Outlier identification using GKU by defining a border manually. Areas with low colour values are defined as outliers (noise) and vice versa. Shape of the data point: circle, radius: 20 pixels, colour of the data point: (0, 0, 254). For data set of plots in this figure, see <a href="#app1-symmetry-08-00152" class="html-app">Supplementary Materials, File S1</a>.</p> "> Figure 11
<p>Visualization of 35,620 data points with: (<b>A</b>) scatter plot; (<b>B</b>) heat map; and (<b>C</b>) contour plot. The scatter plot shows the distribution of the data, whereas the heat map and the contour plot show density clusters. However, compared to the GKU, the heat map and contour plot do not show density clusters. For data set of plots in this figure, see <a href="#app1-symmetry-08-00152" class="html-app">Supplementary Materials, File S1</a>.</p> ">
Abstract
:1. Introduction
1.1 Related Work
2. Methodology
2.1. Colour Coding Method
2.2. Data Preparation
2.3. Visualization of Missing and Out of Range Values
2.4. Embed GKU Specific Information into Bitmap
2.5. GKU Evaluation Method
3. Results and Discussion
3.1. Reading GKUs
3.2. Anytime Cluster Formation
3.3. Representation of Missing and Out of Range Values and GKU Specific Data
3.4. GKU as an Outlier Detection Method
4. Conclusions
Supplementary Materials
Acknowledgments
Author Contributions
Conflicts of Interest
References
- Stone, M.C.; Fishkin, K.; Bier, E.A. The Movable Filter as a User Interface Tool. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, Boston, MA, USA, 24–28 April 1994; pp. 306–312.
- Woodruff, A.; Landay, J.; Stonebraker, M. Constant density visualizations of non-uniform distributions of data. In Proceedings of the 11th Annual ACM Symposium on User Interface Software and Technology, San Francisco, CA, USA, 1–4 November 1998.
- Yang, J.; Ward, M.O.; Rundensteiner, E.A. Visual hierarchical dimension reduction for exploration of high dimensional datasets. In Proceedings of the Eurographics/IEEE TCVG Symposium on Visualization, Grenoble, France, 26–28 May 2003.
- Ellis, G.; Dix, A. A Taxonomy of Clutter Reduction for Information Visualisation. IEEE Trans. Vis. Comput. Graph. 2007, 13, 1216–1223. [Google Scholar] [CrossRef] [PubMed]
- Chen, H.; Chen, W.; Mei, H.; Liu, Z.; Zhou, K.; Chen, W.; Gu, W.; Ma, K.L. Visual Abstraction and Exploration of Multi-class Scatterplots. IEEE Trans. Vis. Comput. Graph. 2014, 20, 1683–1692. [Google Scholar] [CrossRef] [PubMed]
- Cleveland, W.S. Visualizing Data; Hobart Press: Hobart, Australia, 1993. [Google Scholar]
- Bachthaler, S.; Weiskopf, D. Efficient and Adaptive Rendering of 2-D Continuous Scatterplots. Comput. Graph. Forum 2009, 28, 743–750. [Google Scholar] [CrossRef]
- Mai, S.T.; He, X.; Feng, J.; Plant, C.; Böhm, C. Anytime density-based clustering of complex data. Knowl. Inform. Syst. 2015, 45, 319–355. [Google Scholar] [CrossRef]
- Hoffman, P.; Grinstein, G. Visualizations for High Dimensional Data Mining-Table Visualizations. 1997. Available online: http://web.simmons.edu/~benoit/infovis/MIV-datamining.pdf (accessed on 28 January 2014).
- Salomon, D. Raster Graphics. In The Computer Graphics Manual; Springer: Berlin/Heidelberg, Germany, 2011; pp. 29–131. [Google Scholar]
- Salomon, D. Graphics Standards. In The Computer Graphics Manual; Springer: Berlin/Heidelberg, Germany, 2011; pp. 947–972. [Google Scholar]
- Everitt, B.S.; Landau, S.; Leese, M.; Stahl, D. Index. In Cluster Analysis; John Wiley & Sons, Ltd.: New York, NY, USA, 2011; pp. 321–330. [Google Scholar]
- Lee, R.C.T. Clustering Analysis and Its Applications. Adv. Inform. Syst. Sci. 1981, 8, 169–292. [Google Scholar]
- Næs, T.; Brockhoff, P.B.; Tomic, O. Cluster Analysis: Unsupervised Classification. In Statistics for Sensory and Consumer Science; John Wiley & Sons, Ltd.: New York, NY, USA, 2010; pp. 249–261. [Google Scholar]
- Okun, O.; Priisalu, H. Unsupervised data reduction. Signal Process. 2007, 87, 2260–2267. [Google Scholar] [CrossRef]
- Anderberg, M.R. Cluster Analysis for Applications; Academic Press: New York, NY, USA, 1973. [Google Scholar]
- Chui, C.K.; Filbir, F.; Mhaskar, H.N. Representation of functions on big data: Graphs and trees. Appl. Comput. Harmon. Anal. 2015, 38, 489–509. [Google Scholar] [CrossRef]
- Avramenko, Y.; Ani, E.-C.; Kraslawski, A.; Agachi, P.S. Mining of graphics for information and knowledge retrieval. Comput. Chem. Eng. 2009, 33, 618–627. [Google Scholar] [CrossRef]
- Yu, H.; Yang, J.; Han, J.; Li, X. Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing. Data Min. Knowl. Discov. 2005, 11, 295–321. [Google Scholar] [CrossRef]
- De Vito, E.; Rosasco, L.; Toigo, A. Learning sets with separating kernels. Appl. Comput. Harmon. Anal. 2014, 37, 185–217. [Google Scholar] [CrossRef]
- Galluccio, L.; Michel, O.; Comon, P.; Hero, A.O., III. Graph based k-means clustering. Signal Process. 2012, 92, 1970–1984. [Google Scholar] [CrossRef] [Green Version]
- Sebzalli, Y.M.; Li, R.F.; Chen, F.Z.; Wang, X.Z. Knowledge discovery from process operational data for assessment and monitoring of operator’s performance. Comput. Chem. Eng. 2000, 24, 409–414. [Google Scholar] [CrossRef]
- Barbará, D.; Chen, P. Using Self-Similarity to Cluster Large Data Sets. Data Min. Knowl. Discov. 2003, 7, 123–152. [Google Scholar] [CrossRef]
- David, G.; Averbuch, A. Hierarchical data organization, clustering and denoising via localized diffusion folders. Appl. Comput. Harmon. Anal. 2012, 33, 1–23. [Google Scholar] [CrossRef]
- Zhang, L.; Tang, C.; Song, Y.; Zhang, A.; Ramanathan, M. VizCluster and its Application on Classifying Gene Expression Data. Distrib. Parallel Databases 2003, 13, 73–97. [Google Scholar] [CrossRef]
- Johansson, J.; Ljung, P.; Jern, M.; Cooper, M. Revealing structure in visualizations of dense 2D and 3D parallel coordinates. Inform. Vis. 2006, 5, 125–136. [Google Scholar] [CrossRef]
- Wilkinson, L.; Friendly, M. The History of the Cluster Heat Map. Am. Stat. 2009, 63, 179–184. [Google Scholar] [CrossRef]
- Niida, A.; Tremmel, G.; Imoto, S.; Miyano, S. Multilayer Cluster Heat Map Visualizing Biological Tensor Data. In Proceedings of the 2013 8th Brazilian Symposium on Advances in Bioinformatics and Computational Biology, Recife, Brazil, 3–7 November 2013; Setubal, J., Almeida, N., Eds.; pp. 116–125.
- Weinstein, J.N. A Postgenomic Visual Icon. Science 2008, 319, 1772–1773. [Google Scholar] [CrossRef] [PubMed]
- Hao, M.C.; Dayal, U.; Sharma, R.K.; Keim, D.A.; Janetzko, H. Variable binned scatter plots. Inform. Vis. 2010, 9, 194–203. [Google Scholar] [CrossRef]
- Mayorga, A.; Gleicher, M. Splatterplots: Overcoming Overdraw in Scatter Plots. IEEE Trans. Vis. Comput. Graph. 2013, 19, 1526–1538. [Google Scholar] [CrossRef] [PubMed]
- Nievergelt, J.; Widmayer, P. Spatial data structures: Concepts and design choices. In Algorithmic Foundations of Geographic Information Systems; van Kreveld, M., Nievergelt, J., Roos, T., Widmayer, P., Eds.; Springer: Berlin/Heidelberg, Germany, 1997; pp. 153–197. [Google Scholar]
- Yoo, J.; Bow, M. Mining spatial colocation patterns: A different framework. Data Min. Knowl. Discov. 2012, 24, 159–194. [Google Scholar] [CrossRef]
- Gross, M.; Pfister, H. Point-Based Graphics; Morgan Kaufmann Publishers Inc.: San Mateo, CA, USA, 2007; p. 248. [Google Scholar]
- Carr, D.B.; Littlefield, R.J.; Nicholson, W.L.; Littlefield, J.S. Scatterplot Matrix Techniques for Large N. J. Am. Stat. Assoc. 1987, 82, 424–436. [Google Scholar] [CrossRef]
- Imhof, E. Cartographic Relief Presentation; ESRI Press: Redlands, CA, USA, 2007; p. 111. [Google Scholar]
- Bowman, A.; Foster, P. Density based exploration of bivariate data. Stat. Comput. 1993, 3, 171–177. [Google Scholar] [CrossRef]
- Lampe, O.D.; Hauser, H. Interactive visualization of streaming data with Kernel Density Estimation. In Proceedings of the 2011 IEEE Pacific Visualization Symposium (PacificVis), Hong Kong, China, 1–4 March 2011.
- George, G.R. New Methods of Mathematical Modeling of Human Behavior in the Manual Tracking Task. Ph.D. Thesis, University of New York, Binghamton, NY, USA, 2008; p. 190. [Google Scholar]
- Krapf, L.C.; Heuwinkel, H.; Schmidhalter, U.; Gronauer, A. The potential for online monitoring of short-term process dynamics in anaerobic digestion using near-infrared spectroscopy. Biomass Bioenergy 2013, 48, 224–230. [Google Scholar] [CrossRef]
- Huang, Z. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Min. Knowl. Discov. 1998, 2, 283–304. [Google Scholar] [CrossRef]
- Angiulli, F.; Fassetti, F. Exploiting domain knowledge to detect outliers. Data Min. Knowl. Discov. 2014, 28, 519–568. [Google Scholar] [CrossRef]
- Akoglu, L.; Tong, H.; Koutra, D. Graph based anomaly detection and description: A survey. Data Min. Knowl. Discov. 2015, 29. [Google Scholar] [CrossRef]
- Salomon, D. The Computer Graphics Manual; Springer: Berlin/Heidelberg, Germany, 2011; p. 967. [Google Scholar]
- Van Verth, J.M.; Bishop, L.M. Essential Mathematics for Games and Interactive Applications: A Programmer’s Guide, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2008; p. 264. [Google Scholar]
Value Type | Transformation Technique |
---|---|
Negative integer values | Base line correction. This will convert all negative values to positive values while maintaining the same regression. |
Very large values | Base line correction. This will convert large numbers to small numbers while maintaining the same regression. |
Decimal values | Multiplication by 10d (d ϵ {1, 2, 3,…}). This will convert decimal values to integers (we named d as “decimal to integer factor”). |
Small or large range | Scale up or down. This will change the range. |
Header Section | Offset | Size/Bytes | Value | Description |
---|---|---|---|---|
Bitmap (BMP) Header (14 Bytes) | 0 | 2 | “BM” | Identification (ID) field |
2 | 4 | Size of BMP header, DIB header, and Image | Size of the BMP file | |
6 | 2 | Unused* | Application specific | |
8 | 2 | Unused | Application specific | |
10 | 4 | 54 Bytes (14 + 40) | Offset where the pixel array (bitmap data) can be found | |
Device-independent bitmap (DIB) header | 12 | 40 Bytes | ||
… | ||||
50 | ||||
Bitmap data | 51 | m × n × 4 Bytes | ||
… | ||||
… |
GKU Specific Data | Offset of Pixels | No. of Pixels | Content in the Pixels, According to the Order | Pixel Format Used to Store Information | Example |
---|---|---|---|---|---|
Properties of point marker | K | 3 | Data point = Circle (1 = circle, 2 = square, …), radius of the circle, colour of the circle. | unsigned 24-bit pixel format | 1, 10, 1 |
Border widths | K + 1 | 5 | Out of range border, missing value border, GKU specific data border, border padding, offset. | unsigned 24-bit pixel format | 10, 10, 10, 1, 10 |
X value information | K + 2 | 8 | Minimum value, maximum value, decimal to integer factor, scale up/down factor. | two signed 24-bit pixel format | (65, 0), (90, 0), (10, 0), (2, 0) |
Y value information | K + 3 | 8 | Minimum value, maximum value, decimal to integer factor, scale up/down factor. | two signed 24-bit pixel format | (223, −2), (9055, −3), (10, 0), (3, 0) |
© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Adikaram, K.K.L.B.; Hussein, M.A.; Effenberger, M.; Becker, T. Continuous Learning Graphical Knowledge Unit for Cluster Identification in High Density Data Sets. Symmetry 2016, 8, 152. https://doi.org/10.3390/sym8120152
Adikaram KKLB, Hussein MA, Effenberger M, Becker T. Continuous Learning Graphical Knowledge Unit for Cluster Identification in High Density Data Sets. Symmetry. 2016; 8(12):152. https://doi.org/10.3390/sym8120152
Chicago/Turabian StyleAdikaram, K.K.L.B., Mohamed A. Hussein, Mathias Effenberger, and Thomas Becker. 2016. "Continuous Learning Graphical Knowledge Unit for Cluster Identification in High Density Data Sets" Symmetry 8, no. 12: 152. https://doi.org/10.3390/sym8120152