Nonsmooth Clustering

"Without mathematics, there's nothing you can do. Everything around you is mathematics. Everything around you is numbers."

- Shakuntala Devi

LMBM-Clust Software

LMBM-Clust [1,2] is a nonsmooth optimization (NSO) based clustering algorithm for solving the minimum sum-of-squares clustering (MSSC) problem in large data sets. The LMBM-Clust -method consist of two different algorithms: an incremental algorithm is used to solve clustering problems globally and at each iteration of this algorithm the LMBM algorithm [3] is used to solve both the clustering and the auxiliary clustering problems with different starting points. In addition to the k-partition problem, LMBM-Clust solves also all intermediate l-partition problems where l=1,…,k-1 due to the incremental approach used.

The software is free for academic teaching and research purposes but I ask you to refer the corresbonding reference given below if you use it.

Code (version 2)

lmbmclust.f95	- Mainprogram for LMBM-Clust software.
parameters.f95	- Parameters.
initlmbmclust.f95	- Initialization of clustering parameters and LMBM method.
lmbmclustmod.f95	- Subroutines for clustering software.
functionmod.f95	- Computation of function and subgradient values for clustering software.
lmbm.f95	- LMBM.
objfun.f95	- Computation of the function and (sub)gradients values for LMBM.
subpro.f95	- subprograms for lmbm.
Makefile	- makefile.

iris.txt	- Sample data set obtained from UCI machine learning repository [4].

lmbm_clust_v2.tar.gz	- all the above in compressed form.
read_me	- Instructions file.

lmbmclust.f95

- Mainprogram for LMBM-Clust software.

parameters.f95

- Parameters.

initlmbmclust.f95

- Initialization of clustering parameters and LMBM method.

lmbmclustmod.f95

- Subroutines for clustering software.

functionmod.f95

- Computation of function and subgradient values for clustering software.

lmbm.f95

- LMBM.

objfun.f95

- Computation of the function and (sub)gradients values for LMBM.

subpro.f95

- subprograms for lmbm.

Makefile

- makefile.

iris.txt

- Sample data set obtained from UCI machine learning repository [4].

lmbm_clust_v2.tar.gz

- all the above in compressed form.

read_me

- Instructions file.

To use the software modify initlmbmclust.f95 as needed and give your data set similar to iris.txt here.

You can also try the older version of the code.

Code (version 1)

lmbm_clust_v1.tar.gz	- Older version of LMBM-Clust in compressed form (may be a little bit more accurate than version 2 but also more time consuming).

lmbm_clust_v1.tar.gz

- Older version of LMBM-Clust in compressed form (may be a little bit more accurate than version 2 but also more time consuming).

DC-Clustering Algorithms

DCD-Bundle [5,6] and DC-Clust [7] are optimization based clustering algorithms that explicitly utilizes the DC structure (difference of two convex functions) of the minimum sum-of-squares clustering (MSSC) problem. In order to obtain a global solution of a MSSC problem, clusters are computed incrementally. That is, in addition to the k-partition problem, DC-Clust and DCD-Bundle solve also all intermediate l-partition problems where l=1,…,k-1.

The software below includes both DCD-Bundle and DC-Clust algorithms. You can switch between the algorithms by changing parameter "optmet" in initclust.f95. The original clustering code with DC-Clust algorithm is developed by Adil Bagirov (2015). The Fortran 95 version of the code and optimization with DCD-Bundle algorithm is made by Napsu Karmitsa (2016). The software is free for academic teaching and research purposes but I ask you to refer the corresbonding reference given below if you use it.

Code

dcclustering.f95	- Mainprogram for clustering software.
parameters.f95	- Parameters.
initclust.f95	- Initialization of clustering parameters and optimization methods.
clusteringmod.f95	- Subroutines for clustering software.
functionmod.f95	- Computation of function and (sub)gradients values for clustering software.
dcclust_method.f95	- DC-Clust method.
dcdb.f95	- DCD-Bundle method.
dcfun.f95	- Computation of the function and (sub)gradients values for DCD-Bundle.
subpro.f95	- subprograms for DCD-Bundle.
Makefile	- makefile.

iris.txt	- Sample data set obtained from UCI machine learning repository [4].

dcclustering.tar.gz	- all the above in compressed form.
read_me	- Instructions file.

dcclustering.f95

- Mainprogram for clustering software.

parameters.f95

- Parameters.

initclust.f95

- Initialization of clustering parameters and optimization methods.

clusteringmod.f95

- Subroutines for clustering software.

functionmod.f95

- Computation of function and (sub)gradients values for clustering software.

dcclust_method.f95

- DC-Clust method.

dcdb.f95

- DCD-Bundle method.

dcfun.f95

- Computation of the function and (sub)gradients values for DCD-Bundle.

subpro.f95

- subprograms for DCD-Bundle.

Makefile

- makefile.

iris.txt

- Sample data set obtained from UCI machine learning repository [4].

dcclustering.tar.gz

- all the above in compressed form.

read_me

- Instructions file.

To use the software modify initclust.f95 (and dcclustering.f95) as needed and give your data set similar to iris.txt here.

Data Sets

Some real world data sets for testing and validation of clustering algorithms (see e.g. [1,2,5,6,7]) are given below. The brief description of these data sets are given in Tables 1 and 2. The more detailed description of these sets and many more real world data sets can be found in UCI machine learning repository [4] and TSPLIB — A library of travelling salesman and related problem instance [8]. All the data sets given here contain only numeric features and they do not have missing values. In case of incomplete data you may want to preprocess it with the IviaCLR (Imputation via Clusterwise Linear Regression).

Table 1: Big Data Sets

Data set	No. instances	No. attributes	References
ISOLET	7797	616	UCI
Gisette	13500	5000	[9], UCI
Gas Sensor Array Drift	13910	128	[10], UCI
EEG Eye State	14980	14	UCI
D15112	15112	2	[8], TSPLIB
Online News Popularity	39644	58	[11], UCI
KEGG Metabolic	53413	20	[12], UCI
Shuttle Control	58000	9	Thanks to NASA, UCI
Sensorless Drive Diagnosis	58509	49	UCI
Pla85900	85900	2	[8], TSPLIB
MiniBooNE particle identification	130064	50	UCI
Skin Segmentation	245057	3	[13], UCI
3D Road Network	434874	3	[14], UCI

Table 2: Small Data Sets

Data set	No. instances	No. attributes	References
Iris Plants	150	4	UCI
pcb3038	3038	2	[8], TSPLIB
Diabetes	768	8	UCI
Image Segmentation	2310	19	UCI

References

N. Karmitsa, A. Bagirov, S. Taheri, "Clustering in large data sets with the limited memory bundle method" (author version), Pattern Recognition, Vol. 83, pp. 245-259, 2018. The publication is available online at www.sciencedirect.com (free access before August 07, 2018).
Napsu Karmitsa, Adil Bagirov, and Sona Taheri, "MSSC Clustering of Large Data using the Limited Memory Bundle Method", TUCS Technical Report, No. 1164, Turku Centre for Computer Science, Turku, 2016.
Napsu Haarala, Kaisa Miettinen, Marko M. Mäkelä, "Globally Convergent Limited Memory Bundle Method for Large-Scale Nonsmooth Optimization" (author version), Mathematical Programming, Vol. 109, No. 1, pp. 181-205, 2007. DOI 10.1007/s10107-006-0728-2. The original publication is available online at www.springerlink.com.
Lichman, UCI machine learning repository . Irvine, CA: University of California, School of Information and Computer Science, 2013.
N. Karmitsa, A. Bagirov, S. Taheri, "New diagonal bundle method for clustering problems in large data sets" (author version), European Journal of Operational Research, Vol. 263, No. 2, pp. 367-379, 2017. DOI: 10.1016/j.ejor.2017.06.010. The publication is available online at www.sciencedirect.com.
Napsu Karmitsa, Adil Bagirov, and Sona Taheri, "Diagonal Bundle Method for Solving the Minimum Sum-of-Squares Clustering Problems ", TUCS Technical Report, No. 1156, Turku Centre for Computer Science, Turku, 2016.
Adil Bagirov, Sona Taheri, and Julien Ugon, "Nonsmooth DC programming approach to the minimum sum-of-squares clustering problems", Pattern Recognition, Vol. 53, pp. 12–24, 2016.
Bob Bixby and Gerd Reinelt, TSPLIB — A library of travelling salesman and related problem instance,1995.
Isabelle Guyon, Steve R. Gunn, Asa Ben-Hur, Gideon Dror, "Result analysis of the NIPS 2003 feature selection challenge". In: NIPS (2004).
Alexander Vergara,Shankar Vembu,Tuba Ayhan,Margaret A. Ryan,Margie L. Homer and Ramon Huerta, "Chemical gas sensor drift compensation using classifier ensembles", Sensors and Actuators B: Chemical (2012) DOI: 10.1016/j.snb.2012.01.074.
K. Fernandes, P. Vinagre and P. Cortez, "A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News". Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal (2015).
Muhammad Naeem and Sohail Asghar, Centre of Research in Data Engineering Islamabad Pakistan, naeems.naeem '@' gmail.com, sohail.asghar '@' gmail.com. 2011.
Rajen Bhatt and Abhinav Dhall, "Skin Segmentation Dataset" UCI Machine Learning Repository
Manohar Kaul, Bin Yang and Christian S. Jensen, "Building Accurate 3D Spatial Networks to Enable Next Generation Intelligent Transportation Systems", Proceedings of International Conference on Mobile Data Management (IEEE MDM), June 3-6 2013, Milan, Italy.

Acknowledgements

The work was financially supported by the Academy of Finland (Project No. 289500) and Australian Research Counsil’s Discovery Projects funding scheme (Project No. DP140103213).