# Nonsmooth Clustering

## LMBM-Clust Software

**LMBM-Clust** [1,2] is a nonsmooth optimization (NSO) based clustering algorithm for solving the minimum sum-of-squares clustering (MSSC) problem in large data sets. The LMBM-Clust -method consist of two different algorithms: an incremental algorithm is used to solve clustering problems globally and at each iteration of this algorithm the LMBM algorithm [3] is used to solve both the clustering and the auxiliary clustering problems with different starting points. In addition to
the *k*-partition problem, LMBM-Clust solves also all intermediate
*l*-partition problems where *l*=1,…,*k*-1 due to the incremental approach used.

The software is free for academic teaching and research purposes but I ask you to refer the corresbonding reference given below if you use it.

### Code

lmbmclust.f95 | - Mainprogram for LMBM-Clust software. |
---|---|

parameters.f95 | - Parameters. |

initlmbmclust.f95 | - Initialization of clustering parameters and LMBM method. |

lmbmclustmod.f95 | - Subroutines for clustering software. |

functionmod.f95 | - Computation of function and subgradient values for clustering software. |

lmbm.f95 | - LMBM. |

objfun.f95 | - Computation of the function and (sub)gradients values for LMBM. |

subpro.f95 | - subprograms for lmbm. |

Makefile | - makefile. |

iris.txt | - Sample data set obtained from UCI machine learning repository [4]. |

lmbm_clust.tar.gz | - all the above in compressed form. |

read_me | - Instructions file. |

To use the software modify initlmbmclust.f95 as needed and give your data set similar to iris.txt here.

## DC-Clustering Algorithms

**DCD-Bundle** [5,6] and **DC-Clust** [7] are optimization based clustering algorithms that explicitly utilizes the DC structure (difference of two convex functions) of the minimum sum-of-squares clustering (MSSC) problem. In order to obtain a global solution of a MSSC problem, clusters are computed incrementally. That is, in addition to
the *k*-partition problem, DC-Clust and DCD-Bundle solve also all intermediate
*l*-partition problems where *l*=1,…,*k*-1.

The software below includes both DCD-Bundle and DC-Clust algorithms. You can switch between the algorithms by changing parameter "optmet" in initclust.f95. The original clustering code with DC-Clust algorithm is developed by Adil Bagirov (2015). The Fortran 95 version of the code and optimization with DCD-Bundle algorithm is made by Napsu Karmitsa (2016). The software is free for academic teaching and research purposes but I ask you to refer the corresbonding reference given below if you use it.

### Code

dcclustering.f95 | - Mainprogram for clustering software. |
---|---|

parameters.f95 | - Parameters. |

initclust.f95 | - Initialization of clustering parameters and optimization methods. |

clusteringmod.f95 | - Subroutines for clustering software. |

functionmod.f95 | - Computation of function and (sub)gradients values for clustering software. |

dcclust_method.f95 | - DC-Clust method. |

dcdb.f95 | - DCD-Bundle method. |

dcfun.f95 | - Computation of the function and (sub)gradients values for DCD-Bundle. |

subpro.f95 | - subprograms for DCD-Bundle. |

Makefile | - makefile. |

iris.txt | - Sample data set obtained from UCI machine learning repository [4]. |

dcclustering.tar.gz | - all the above in compressed form. |

read_me | - Instructions file. |

To use the software modify initclust.f95 (and dcclustering.f95) as needed and give your data set similar to iris.txt here.

## Data Sets

Some real world data sets for testing and validation of clustering algorithms (see e.g. [1,2,5,6,7]) are given below. The brief description of these data sets are given in Tables 1 and 2. The more detailed description of these sets and many more real world data sets can be found in UCI machine learning repository [4] and TSPLIB — A library of travelling salesman and related problem instance [8]. All the data sets given here contain only numeric features and they do not have missing values.

### Table 1: Big Data Sets

Data set | No. instances | No. attributes | References |

ISOLET | 7797 | 616 | UCI |

Gisette | 13500 | 5000 | [9], UCI |

Gas Sensor Array Drift | 13910 | 128 | [10], UCI |

EEG Eye State | 14980 | 14 | UCI |

D15112 | 15112 | 2 | [8], TSPLIB |

Online News Popularity | 39644 | 58 | [11], UCI |

KEGG Metabolic | 53413 | 20 | [12], UCI |

Shuttle Control | 58000 | 9 | Thanks to NASA, UCI |

Sensorless Drive Diagnosis | 58509 | 49 | UCI |

Pla85900 | 85900 | 2 | [8], TSPLIB |

MiniBooNE particle identification | 130064 | 50 | UCI |

Skin Segmentation | 245057 | 3 | [13], UCI |

3D Road Network | 434874 | 3 | [14], UCI |

### Table 2: Small Data Sets

Data set | No. instances | No. attributes | References |

Iris Plants | 150 | 4 | UCI |

pcb3038 | 3038 | 2 | [8], TSPLIB |

Diabetes | 768 | 8 | UCI |

Image Segmentation | 2310 | 19 | UCI |

## References

- Napsu Karmitsa, Adil Bagirov and Sona Taheri, "Clustering in Large Data Sets with the Limited Memory Bundle Method", submitted, 2016.
- Napsu Karmitsa, Adil Bagirov, and Sona Taheri, "MSSC Clustering of Large Data using the Limited Memory Bundle Method", TUCS Technical Report, No. 1164, Turku Centre for Computer Science, Turku, 2016.
- Napsu Haarala, Kaisa Miettinen, Marko M. Mäkelä, "Globally Convergent Limited Memory Bundle Method for Large-Scale Nonsmooth Optimization" (author version), Mathematical Programming, Vol. 109, No. 1, pp. 181-205, 2007. DOI 10.1007/s10107-006-0728-2. The original publication is available online at www.springerlink.com.
- Lichman, UCI machine learning repository . Irvine, CA: University of California, School of Information and Computer Science, 2013.
- Napsu Karmitsa, Adil Bagirov, and Sona Taheri, "New diagonal bundle method for clustering problems in very large data sets", submitted, 2016.
- Napsu Karmitsa, Adil Bagirov, and Sona Taheri, "Diagonal Bundle Method for Solving the Minimum Sum-of-Squares Clustering Problems ", TUCS Technical Report, No. 1156, Turku Centre for Computer Science, Turku, 2016.
- Adil Bagirov, Sona Taheri, and Julien Ugon, "Nonsmooth DC programming approach to the minimum sum-of-squares clustering problems", Pattern Recognition, Vol. 53, pp. 12–24, 2016.
- Bob Bixby and Gerd Reinelt, TSPLIB — A library of travelling salesman and related problem instance,1995.
- Isabelle Guyon, Steve R. Gunn, Asa Ben-Hur, Gideon Dror, "Result analysis of the NIPS 2003 feature selection challenge". In: NIPS (2004).
- Alexander Vergara,Shankar Vembu,Tuba Ayhan,Margaret A. Ryan,Margie L. Homer and Ramon Huerta, "Chemical gas sensor drift compensation using classifier ensembles", Sensors and Actuators B: Chemical (2012) DOI: 10.1016/j.snb.2012.01.074.
- K. Fernandes, P. Vinagre and P. Cortez, "A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News". Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal (2015).
- Muhammad Naeem and Sohail Asghar, Centre of Research in Data Engineering Islamabad Pakistan, naeems.naeem '@' gmail.com, sohail.asghar '@' gmail.com. 2011.
- Rajen Bhatt and Abhinav Dhall, "Skin Segmentation Dataset" UCI Machine Learning Repository
- Manohar Kaul, Bin Yang and Christian S. Jensen, "Building Accurate 3D Spatial Networks to Enable Next Generation Intelligent Transportation Systems", Proceedings of International Conference on Mobile Data Management (IEEE MDM), June 3-6 2013, Milan, Italy.

## Acknowledgements

The work was financially supported by the Academy of Finland (Project No. 289500) and Australian Research Counsil’s Discovery Projects funding scheme (Project No. DP140103213).