Nonsmooth Clustering
Big-Clust Software
BigClust [15] is a stochastic nonsmooth optimization-based incremental clustering software for solving the minimum sum-of-squares clustering (MSSC) problem in very large-scale and big data sets. BigClust consists of two different algorithms: an incremental algorithm is used to solve clustering problems globally. At each iteration of this algorithm, the stochastic limited memory bundle algorithm (SLMBA) is used to solve both the clustering and the auxiliary clustering problems with different starting points. In addition to the k-partition problem, BigClust solves all intermediate l-partition problems where l=1,…,k-1 due to the incremental approach used.
Code (version 0.1)
The BigClust code is available at github.
LMBM-Clust Software
LMBM-Clust [1,2] is a nonsmooth optimization (NSO) based clustering algorithm for solving the minimum sum-of-squares clustering (MSSC) problem in large data sets. The LMBM-Clust -method consists of two different algorithms: an incremental algorithm is used to solve clustering problems globally. At each iteration of this algorithm, the LMBM algorithm [3] is used to solve both the clustering and the auxiliary clustering problems with different starting points. In addition to the k-partition problem, LMBM-Clust solves all intermediate l-partition problems where l=1,…,k-1 due to the incremental approach used.
The software is free for academic teaching and research purposes but I ask you to refer to the corresponding reference given below if you use it.
Code (version 2)
lmbmclust.f95 | - Mainprogram for LMBM-Clust software. |
---|---|
parameters.f95 | - Parameters. |
initlmbmclust.f95 | - Initialization of clustering parameters and LMBM method. |
lmbmclustmod.f95 | - Subroutines for clustering software. |
functionmod.f95 | - Computation of function and subgradient values for clustering software. |
lmbm.f95 | - LMBM. |
objfun.f95 | - Computation of the function and (sub)gradients values for LMBM. |
subpro.f95 | - subprograms for lmbm. |
Makefile | - makefile. |
iris.txt | - Sample data set obtained from UCI machine learning repository [4]. |
lmbm_clust_v2.tar.gz | - all the above in compressed form. |
read_me | - Instructions file. |
To use the software modify initlmbmclust.f95 as needed and give your data set similar to iris.txt here.
You can also try the older version of the code.
Code (version 1)
lmbm_clust_v1.tar.gz | - Older version of LMBM-Clust in compressed form (may be a little bit more accurate than version 2 but also more time consuming). |
---|---|
DC-Clustering Algorithms
DCD-Bundle [5,6] and DC-Clust [7] are optimization based clustering algorithms that explicitly utilizes the DC structure (difference of two convex functions) of the minimum sum-of-squares clustering (MSSC) problem. In order to obtain a global solution of a MSSC problem, clusters are computed incrementally. That is, in addition to the k-partition problem, DC-Clust and DCD-Bundle solve also all intermediate l-partition problems where l=1,…,k-1.
The software below includes both DCD-Bundle and DC-Clust algorithms. You can switch between the algorithms by changing parameter "optmet" in initclust.f95. The original clustering code with DC-Clust algorithm is developed by Adil Bagirov (2015). The Fortran 95 version of the code and optimization with DCD-Bundle algorithm is made by Napsu Karmitsa (2016). The software is free for academic teaching and research purposes but I ask you to refer the corresbonding reference given below if you use it.
Code
dcclustering.f95 | - Mainprogram for clustering software. |
---|---|
parameters.f95 | - Parameters. |
initclust.f95 | - Initialization of clustering parameters and optimization methods. |
clusteringmod.f95 | - Subroutines for clustering software. |
functionmod.f95 | - Computation of function and (sub)gradients values for clustering software. |
dcclust_method.f95 | - DC-Clust method. |
dcdb.f95 | - DCD-Bundle method. |
dcfun.f95 | - Computation of the function and (sub)gradients values for DCD-Bundle. |
subpro.f95 | - subprograms for DCD-Bundle. |
Makefile | - makefile. |
iris.txt | - Sample data set obtained from UCI machine learning repository [4]. |
dcclustering.tar.gz | - all the above in compressed form. |
read_me | - Instructions file. |
To use the software modify initclust.f95 (and dcclustering.f95) as needed and give your data set similar to iris.txt here.
Data Sets
Some real world data sets for testing and validation of clustering algorithms (see e.g. [1,2,5,6,7]) are given below. The brief description of these data sets are given in Tables 1 and 2. The more detailed description of these sets and many more real world data sets can be found in UCI machine learning repository [4] and TSPLIB — A library of travelling salesman and related problem instance [8]. All the data sets given here contain only numeric features and they do not have missing values. In case of incomplete data you may want to preprocess it with the IviaCLR (Imputation via Clusterwise Linear Regression).
Table 1: Big Data Sets
Data set | No. instances | No. attributes | References |
ISOLET | 7797 | 616 | UCI |
Gisette | 13500 | 5000 | [9], UCI |
Gas Sensor Array Drift | 13910 | 128 | [10], UCI |
EEG Eye State | 14980 | 14 | UCI |
D15112 | 15112 | 2 | [8], TSPLIB |
Online News Popularity | 39644 | 58 | [11], UCI |
KEGG Metabolic | 53413 | 20 | [12], UCI |
Shuttle Control | 58000 | 9 | Thanks to NASA, UCI |
Sensorless Drive Diagnosis | 58509 | 49 | UCI |
Pla85900 | 85900 | 2 | [8], TSPLIB |
MiniBooNE particle identification | 130064 | 50 | UCI |
Skin Segmentation | 245057 | 3 | [13], UCI |
3D Road Network | 434874 | 3 | [14], UCI |
Table 2: Small Data Sets
Data set | No. instances | No. attributes | References |
Iris Plants | 150 | 4 | UCI |
pcb3038 | 3038 | 2 | [8], TSPLIB |
Diabetes | 768 | 8 | UCI |
Image Segmentation | 2310 | 19 | UCI |
References
- N. Karmitsa, A. Bagirov, S. Taheri, "Clustering in large data sets with the limited memory bundle method" (author version), Pattern Recognition, Vol. 83, pp. 245-259, 2018. The publication is available online at www.sciencedirect.com.
- N. Karmitsa, A. Bagirov, and S. Taheri, "MSSC Clustering of Large Data using the Limited Memory Bundle Method", TUCS Technical Report, No. 1164, Turku Centre for Computer Science, Turku, 2016.
- N. Haarala, K. Miettinen, M.M. Mäkelä, "Globally Convergent Limited Memory Bundle Method for Large-Scale Nonsmooth Optimization" (author version), Mathematical Programming, Vol. 109, No. 1, pp. 181-205, 2007. DOI 10.1007/s10107-006-0728-2. The original publication is available online at www.springerlink.com.
- Lichman, UCI machine learning repository . Irvine, CA: University of California, School of Information and Computer Science, 2013.
- N. Karmitsa, A. Bagirov, S. Taheri, "New diagonal bundle method for clustering problems in large data sets" (author version), European Journal of Operational Research, Vol. 263, No. 2, pp. 367-379, 2017. DOI: 10.1016/j.ejor.2017.06.010. The publication is available online at www.sciencedirect.com.
- N. Karmitsa, A. Bagirov, and S. Taheri, "Diagonal Bundle Method for Solving the Minimum Sum-of-Squares Clustering Problems ", TUCS Technical Report, No. 1156, Turku Centre for Computer Science, Turku, 2016.
- A. Bagirov, S. Taheri, and J. Ugon, "Nonsmooth DC programming approach to the minimum sum-of-squares clustering problems", Pattern Recognition, Vol. 53, pp. 12–24, 2016.
- B. Bixby and G. Reinelt, TSPLIB — A library of travelling salesman and related problem instance,1995.
- I. Guyon, S.R. Gunn, A. Ben-Hur, G. Dror, "Result analysis of the NIPS 2003 feature selection challenge". In: NIPS (2004).
- A. Vergara, S. Vembu, T. Ayhan, M.A. Ryan, M.L. Homer, and R. Huerta, "Chemical gas sensor drift compensation using classifier ensembles", Sensors and Actuators B: Chemical (2012) DOI: 10.1016/j.snb.2012.01.074.
- K. Fernandes, P. Vinagre, and P. Cortez, "A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News". Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal (2015).
- M. Naeem and S. Asghar, Centre of Research in Data Engineering Islamabad Pakistan, naeems.naeem '@' gmail.com, sohail.asghar '@' gmail.com. 2011.
- R. Bhatt and A. Dhall, "Skin Segmentation Dataset" UCI Machine Learning Repository
- M. Kaul, B. Yang, and C.S. Jensen, "Building Accurate 3D Spatial Networks to Enable Next Generation Intelligent Transportation Systems", Proceedings of International Conference on Mobile Data Management (IEEE MDM), June 3-6 2013, Milan, Italy.
- N. Karmitsa, V.-P. Eronen, M.M. Mäkelä, T. Pahikkala, and A. Airola, "Stochastic Limited Memory Bundle Algorithm for Clustering in Big Data", submitted, 2024.
Acknowledgements
The work was financially supported by the Academy of Finland (Project No. 289500) and Australian Research Counsil’s Discovery Projects funding scheme (Project No. DP140103213).