SOS-SDP is an exact algorithm based on the branch-and-bound technique for solving the Minimum Sum-of-Squares Clustering (MSSC) problem described in the paper "SOS-SDP: an Exact Solver for Minimum Sum-of-Squares Clustering". This repository contains the C++ source code, the MATLAB scripts, and the datasets used for the experiments.
V. Piccialli, A. M. Sudoso, A. Wiegele (2022). SOS-SDP: An Exact Solver for Minimum Sum-of-Squares Clustering, INFORMS Journal on Computing, https://doi.org/10.1287/ijoc.2022.1166.
SOS-SDP calls the semidefinite programming solver SDPNAL+ by using the MATLAB Engine API for C++. It requires the MATLAB engine library libMatlabEngine and the Matlab Data Array library libMatlabDataArray. SOS-SDP uses the Armadillo library to handle matrices and linear algebra operations efficiently. Before installing Armadillo, first install OpenBLAS and LAPACK along with the corresponding development files. SOS-SDP implements a configurable thread pool of POSIX threads to speed up the branch-and-bound search.
Ubuntu and Debian instructions:
- Install MATLAB (>= 2016b).
- Install CMake, OpenBLAS, LAPACK and Armadillo:
sudo apt-get update
sudo apt-get install cmake libopenblas-dev liblapack-dev libarmadillo-dev
- Open the makefile
clustering_c++/Makefileand set the variablematlab_pathwith your MATLAB root folder. - Compile the code:
cd clustering_c++/
make
- Download SDPNAL+, move the folder
clustering_matlabcontaining the MATLAB source code of SOS-SDP in the SDPNAL+ main directory and set the parameterSDP_SOLVER_FOLDERof the configuration file accordingly. This folder and its subfolders will be automatically added to the MATLAB search path when SOS-SDP starts.
The code has been tested on Ubuntu Server 20.04 with MATLAB R2020b and Armadillo 10.2.
Various parameters used in SOS-SDP can be modified in the configuration file clustering_c++/config.txt:
BRANCH_AND_BOUND_TOL- optimality tolerance of the branch-and-boundBRANCH_AND_BOUND_PARALLEL- thread pool size: single thread (1), multi-thread (> 1)BRANCH_AND_BOUND_MAX_NODES- maximum number of nodesBRANCH_AND_BOUND_VISITING_STRATEGY- best first (0), depth first (1), breadth first (2)SDP_SOLVER_SESSION_THREADS_ROOT- number of threads for the MATLAB session at the rootSDP_SOLVER_SESSION_THREADS- number of threads for the MATLAB session for the ML and CL nodesSDP_SOLVER_FOLDER- full path of the SDPNAL+ folderSDP_SOLVER_TOL- accuracy of SDPNAL+SDP_SOLVER_VERBOSE- do not display log (0), display log (1)SDP_SOLVER_MAX_CP_ITER_ROOT- maximum number of cutting-plane iterations at the rootSDP_SOLVER_MAX_CP_ITER- maximum number of cutting-plane iterations for the ML and CL nodesSDP_SOLVER_CP_TOL- cutting-plane tolerance between two consecutive cutting-plane iterationsSDP_SOLVER_MAX_INEQ- maximum number of valid inequalities to addSDP_SOLVER_INHERIT_PERC- fraction of inequalities to inheritSDP_SOLVER_EPS_INEQ- tolerance for checking the violation of the inequalitiesSDP_SOLVER_EPS_ACTIVE- tolerance for detecting the active inequalitiesSDP_SOLVER_MAX_PAIR_INEQ- maximum number of pair inequalities to separateSDP_SOLVER_PAIR_PERC- fraction of the most violated pair inequalities to addSDP_SOLVER_MAX_TRIANGLE_INEQ- maximum number of triangle inequalities to separateSDP_SOLVER_TRIANGLE_PERC- fraction of the most violated triangle inequalities to addKMEANS_SDP_BASED- constrained k-means with k-means++ initialization (0), sdp-based initialization (1)KMEANS_MAX_ITER- maximum number of k-means iterationsKMEANS_N_START- number of times k-means is runKMEANS_VERBOSE- do not display log (0), display log (1)
cd clustering_c++/
./bb <DATASET> <K> <LOG>
DATASET- the path of the datasetK- the number of clustersLOG- the path of the log file
The dataset file contains the data points x_ij and the must include an header line with the problem size n and the dimension d:
n d
x_11 x_12 ... x_1d
x_21 x_22 ... x_2d
...
...
x_n1 x_n2 ... x_nd
The log file reports the progress of the algorithm:
N- size of the current nodeNODE_PAR- id of the parent nodeNODE- id of the current nodeLB_PAR- lower bound of the parent nodeLB- lower bound of the current nodeFLAG- termination flag of SDPNAL+0- SDP is solved to the required accuracy1- SDP is not solved successfully-1, -2, -3- SDP is partially solved successfully
TIME (s)- computational time in seconds of the current nodeCP_ITER- number of cutting-plane iterationsCP_FLAG- termination flag of the cutting-plane procedure-3- current bound is worse than the previous one-2- SDP is not solved successfully-1- maximum number of iterations0- no violated inequalities1- maximum number of inequalities2- node must be pruned3- cutting-plane tolerance
CP_INEQ- number of inequalities added in the last cutting-plane iterationPAIR TRIANGLE CLIQUE- average number of added cuts for each class of inequalitiesGUB- global upper boundI J- current branching decisionNODE_GAP- gap at the current nodeGAP- overall gapOPEN- number of open nodes
The log file prints the optimal minimum sum-of-squares (MSSC) objective and the cluster indicator matrix when the algorithm ends (point_id, cluster_id).
Log file example:
DATA_PATH, n, d, k: /home/ubuntu/SOS-SDP/datasets/glass.txt 214 9 6
LOG_PATH: /home/ubuntu/SOS-SDP/results/glass_6.txt
BRANCH_AND_BOUND_TOL: 1e-04
BRANCH_AND_BOUND_PARALLEL: 16
BRANCH_AND_BOUND_MAX_NODES: 100
BRANCH_AND_BOUND_VISITING_STRATEGY: 0
SDP_SOLVER_SESSION_THREADS_ROOT: 16
SDP_SOLVER_SESSION_THREADS: 1
SDP_SOLVER_FOLDER: /home/ubuntu/SOS-SDP/SDPNAL+/
SDP_SOLVER_TOL: 1e-05
SDP_SOLVER_VERBOSE: 0
SDP_SOLVER_MAX_CP_ITER_ROOT: 80
SDP_SOLVER_MAX_CP_ITER: 40
SDP_SOLVER_CP_TOL: 1e-05
SDP_SOLVER_MAX_INEQ: 100000
SDP_SOLVER_INHERIT_PERC: 1
SDP_SOLVER_EPS_INEQ: 0.0001
SDP_SOLVER_EPS_ACTIVE: 1e-06
SDP_SOLVER_MAX_PAIR_INEQ: 100000
SDP_SOLVER_PAIR_PERC: 0.05
SDP_SOLVER_MAX_TRIANGLE_INEQ: 100000
SDP_SOLVER_TRIANGLE_PERC: 0.05
KMEANS_SDP_BASED: 1
KMEANS_MAX_ITER: 100
KMEANS_N_START: 50
KMEANS_VERBOSE: 0
| N| NODE_PAR| NODE| LB_PAR| LB| FLAG| TIME (s)| CP_ITER| CP_FLAG| CP_INEQ| PAIR TRIANGLE CLIQUE| GUB| I J| NODE_GAP| GAP| OPEN|
| 214| -1| 0| -inf| 72.9391| 0| 151| 7| 3| 3014| 639.857 4448 8.71429| 72.9709*| -1 -1| 0.000436087| 0.000436087| 0|
| 214| 0| 1| 72.9391| 72.9644| -1| 12| 0| 2| 3014| 0 0 0| 72.9709| 107 201| 8.8639e-05| 8.8639e-05| 0|
| 213| 0| 2| 72.9391| 72.9487| 0| 31| 1| -3| 2485| 0 922 0| 72.9647*| 107 201| 0.000220183| 0.000220183| 0|
| 212| 2| 3| 72.9487| 72.9643| 0| 8| 0| 2| 2043| 0 0 0| 72.9647| 32 51| 5.25193e-06| 5.25193e-06| 0|
| 213| 2| 4| 72.9487| 72.9789| 0| 12| 0| 2| 2485| 0 0 0| 72.9647| 32 51| -0.00019452| -0.00019452| 0|
WALL_TIME: 199 sec
N_NODES: 5
AVG_INEQ: 1203.71
AVG_CP_ITER: 1.6
ROOT_GAP: 0.000436087
GAP: 0
BEST: 72.9647
V. Piccialli, A. Russo Russo, A. M. Sudoso (2022). An exact algorithm for semi-supervised minimum sum-of-squares clustering. Computers & Operations Research.
V. Piccialli, A. M. Sudoso (2023). Global optimization for cardinality-constrained minimum sum-of-squares clustering via semidefinite programming. Mathematical Programming.