Wednesday, July 16, 2014

Comparing localsolver with Rglpk on k-medoids example

Recently I have coauthored a new localsolver package that can be used to solve large scale optimization problems from R. It is a wrapper around commercial solver that is free for academia. If you are interested why it is worthwhile to give it a look - read on.
There is a host of packages on optimization CRAN Task View so why bother with a new one? There are two reasons: (1) flexibility and speed of LocalSolver optimization engine and (2) ease of use - which is often a problem with traditional packages like Rglpk - especially for large scale problems.
Today I will introduce the package and focus on ease of use. Next time I will provide some performance comparisons to existing solvers.

Before we begin

First you need to setup your environment. There are two steps.
You should start with downloading LocalSolver and installing it.
Next simply install localsolver package (if you have problems with installation on Mac use type="source" option of install.pacakges should solve them).
In case LocalSolver is not in search path of your system after installation either add it or remember its path as in such cases you can pass it to computational routines.
Now we are ready to go.

The test problem

As this is an example of ease of use we will focus on an elementary integer programming task. Suppose that we have 20 locations. Out of them we have to select 5 depots and we want to assign every location to nearest depot. We know distances between all stores (distance matrix does not have to be symmetric). Our objective is to minimize sum of all distances between depots and locations assigned to them. We can see that the problem is simply a k-medoids clustering task but without assumption of symmetry of distance matrix.

Let us start with code that generates test matrices for our examples:

kmedoids.generate.data <- function(seed, N, P) {
  if(!is.null(seed)) {
    set.seed(seed)  
  }
  InfiniteDist <- 10000.0
  Dist <- matrix(round(runif(n=N * N) * 100, digits = 0),
                 nrow = N)
  diag(Dist) <- 0
  
  return(list(
    N = as.integer(N),
    P = as.integer(P),    
    Dist = Dist,
    InfiniteDist = InfiniteDist))
}

data <- kmedoids.generate.data(seed = 1234, N = 20L, P = 5L)

One thing that is worth explaining is that in the code we define InfiniteDist which will serve us as surrogate for infinity (all actual distances are shorter than it).

Now let me present how the problem can be solved with localsolver and Rglpk.

localsolver approach

Here is the code:

library(localsolver)

lsp.model <- "function model() {
  x[1..N] <- bool() ; // point i is in P iff x[i] = 1

  constraint sum[i in 1..N](x[i]) == P ;
  
  minDist[i in 1..N] <- min[j in 1..N](
    x[j] ? Dist[j][i] : InfiniteDist);
  // minimize sum of distances
  objective <- sum[i in 1..N]( minDist[i] ) ;
  minimize objective;
}"

lsp <- ls.problem(model.text.lsp = lsp.model)
lsp <- set.params(lsp = lsp, lsTimeLimit = 1)
lsp <- set.params(lsp = lsp, lsNbThreads = 1)
lsp <- add.output.expr(lsp=lsp, expr.text.lsp = "x",
                       dimensions = data$N)
lsp <- add.output.expr(lsp=lsp, expr.text.lsp = "objective",
                       dimensions = 1)
lsp.solution <- ls.solve(lsp = lsp, data = data)

Notice that we can define the optimization problem using a simple domain specific language as a string. Interestingly in the string we can use all variables defined in data list as localsolver is able to interact with extrernal data source. Additionally using set.params and add.output.expr we can set up the parameters of optimization engine and optimization output we will want to collect.
As a result we get a list with optimization output. Importantly the contents of the list can be controlled programatically as mentioned above.

Rglpk approach

Now let us solve the same problem using Rglpk:

library(Rglpk)

objective <- c(as.numeric(data$Dist), rep(0, data$N))

types <- c(rep("C", length(data$Dist)),
           rep("B", data$N))

# Exactly P centroids
# sum[i in 1..N] y[i] == P
mat <- matrix(data = c(rep(0, length(data$Dist)),
              rep(1, data$N)), nrow=1)
rhs <- c(data$P)
dir <- c("==")

# Each datapoint only to one centroid
# x[i,j] - ith datapoint to jth centroid
# For each j in 1..N sum[i in 1..N] x[i,j] == 1
for(j in 1:data$N) {
  mat <- rbind(mat,
               c(
                 rep(0, (j-1)*data$N),
                 rep(1, data$N),
                 rep(0, ncol(mat) - j*data$N)))  
}
rhs <- c(rhs, rep(1, data$N))
dir <- c(dir, rep("==", data$N))

# Client cannot be assigned to notselected centroid
# x[i,j] <= y[j]
# x[i,j] - y[j] <= 0
for(i in 1:data$N) {    
  for(j in 1:data$N) {
    row <- rep(0, ncol(mat))    
    row[(i-1)*data$N + j] <- 1
    row[length(data$Dist) + j] <- -1
    mat <- rbind(mat, row)                 
  }
}
rhs <- c(rhs, rep(0, data$N*data$N))
dir <- c(dir, rep("<=", data$N*data$N))

glpk.sol <- Rglpk_solve_LP(
  obj = objective,
  mat = mat,
  dir = dir,
  rhs = rhs,
  types = types,
  max = FALSE,
  control = list(verbose = TRUE,
                 canonicalize_status = TRUE))

In my opinion the code looks cryptic in comparison to localsolver specification. There are two reasons for such a situation:

  1. we have to have two mix two types of decision variables: location of depots and assignment of locations to depots
  2. we have to construct a complex constraints matrix combining different types of conditions in one big data structure
In fact even in order to directly learn which locations are selected as depots we have to dig into Rglpk output with  tail(glpk.sol$solution, n = data$N).

Summary

We have tested localsolver on linear integer programming problem. I wanted to show that the package allows the user to solve optimization tasks:

  • using natural (mathematical) formulation of objective function and constraints
  • allowing to dynamically update reference data: that lsp.model string did not have any hard coded constants - everything is loaded from metadata provided as R list.
In my opinion it is not only more elastic in comparison to Rglpk solution presented above but also more robust. Coding Rglpk solution took much more time as it required much more thought in the design of Rglpk_solve_LP input variables and a lot of corrections of errors in the code.

What is more localsolver is not only more user friendly but is also not limited to mixed integer linear optimization - it can solve highly complex non-linear tasks as well.

Now a natural question arises: what is the performance of localsolver in terms of speed and the quality of obtained solution. I will tackle them in my next post.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.