Friday, June 26, 2015

Groptim: A Groovy DSL for Linear Programming

In this post I present how Groovy meta-programming features can be leveraged to implement a Linear Programming (hereafter called LP) DSL, called Groptim. Internally, Groptim uses the Apache Commons Mathematics Library and its Simplex implementation (see org.apache.commons.math3.optim.linear.SimplexSolver) to solve the LP instances.

Source code, Dependencies, License

Groptim is an open source project licensed under Apache License, Version 2.0
The code is hosted by github under the groptim repository.
Gradle is used for building and dependency management.
The only compile time dependency is the Apache Commons Mathematics Library, while testing is done with Spock.

Usage for Groovy Developers

Plenty of examples on how to use Groptim from Groovy code can be found at LPSolverSpec.groovy. In this paragraph, I pick some representative examples to present the flexibility and the limitations of the syntax.

Every LP instance can be written in the LP canonical form, which defines all the constrains to be "less than or equals to" and all the variables non-negative:

Some flexibility beyond the canonical form is offered:
  1. you can define min instead of max to minimize the objective function;
  2. constants are allowed in the objective function; 
  3. variables are allowed to be negative (if you want one to be strictly non-negative, you need to make this explicit with a constraint).
Groptim strictly adheres to the canonical form for constraints. All the constraints are deemed "less than or equals to", no matter what the comparison symbol is ('<=', '>=' or '='). The reason for this decision will be described in later sections.

Let's see how an LP instance looks like in Groptim:

The first line obviously creates an object of the main Groptim class, which is LPSolver.groovy. To define a maximization problem we write lp.max (for minimization write lp.min).

Immediately after the max (or min), we have to define the Objective Function (hereafter OF). in brackets {} (the space between max and the brackets is optional). In this case the OF is 15*x0 + 10*x1 + 7. Notice that we write 7.c to denote the constant number 7. This is needed as every term of the OF (and itself as a whole) has to be evaluated as an Expression object by Groovy. We will dive into implementation details in the next paragraph. Notice that constants are not needed in the OF. Finding a vector x that maximizes (or minimizes) 15*x0 + 10*x1 + 7 is equivalent with finding a vector x that maximizes (resp. minimizes) 15*x0 + 10*x1. This is why the canonical LP form does not allow for constants in the OF. For Groptim a constant in the OF is optional.

After the objective function block, the keyword 'subjectTo' has to follow to denote that the block that follows defines the constraints. Instead of 'subjectTo', one can write 'subject() to', but not  'subject to'. Looking at the constraints, the first two are straightforward, but the 3rd and 4rth are the result of canonizing the non-canonical constraint x0 + x1 = 4. This is equivalent with x0 + x1 <= 4 and x0 + x1 >= 4 that should both be satisfied. The first is canonical, the second can be easily canonized by multiplying -1 both sides of the inequality and therefore changing '>' with '<': -x0 - x1 <= -4.

That's it, this is valid Groovy code. After the subjectTo block you can access the lp fields to read the solution of the LP instance, e.g.:

which means that the OF is maximized at (x0, x1) = (2, 2) and its maximum value is 57.

Notice that any string (except for Groovy reserved keywords) can be used for variable names, e.g. instead of x0, x1, x2, one can write x, y, or myvar. Also coefficients can be arbitrary real numbers and standard arithmetic rules apply, e.g.
  • 0*x1 is equivalent with 0.c
  • 1.7*x0 - x1 is equivalent with 1.7*x0 + -x1
Also, non-linear OF, or constraints are not allowed in LP, therefore Groptim throws an IllegalStateException if you write a something like x0*x1 or (2*x0 + 5.c)*(4*x1-3.c).

Usage for Everyone

One of the many reasons why Groovy is "groovy", is that it can be used as a scripting language. In the script/ directory I have included the groptim script along with two input LP files written in Groptim:

Notice that the lp object is not needed in this case. I will explain in the following paragraph how the script silently injects it. You can run groptim script to solve the LP instances described in these files and get the solution printed in your console as:

To allow this magic to happen, you need to install Groovy and have the groovy command in your path. Notice that groptim script has 2 dependencies (managed by Grape), apache math ('org.apache.commons:commons-math3:3.5') and Groptim itself ('org.ytheohar:groptim:1.0'). Both will be fetched from the maven central repository. When groptim script runs it will have Ivy copy these artifacts under your .groovy/grapes/ directory.

How it works

The objective function as well as both the left and right part of each constraint is evaluated as an Expression or more precisely as one of its subclasses, i.e. NumberExpression, LPVar or BinaryExpression. Take for instance 15*x0 + 10*x1 + 7.c:
  1. x0 is evaluated as an LPVar
  2. 15*x0 as a BinaryExpression
  3. x1 as an LPVar
  4. 10*x1 as a BinaryExpression
  5. 7.c as a NumberExpression
  6.  15*x0 + 10*x1 as a BinaryExpression
  7.  15*x0 + 10*x1 + 7.c as a BinaryExpression
The simplest form of an expression is a NumberExpression. To evaluate (step 5) a constant into an expression I had to add a new method getC in the java.lang.Number class. This is done in the LPSolver constructor as follows: 
Number.metaClass.getC { new NumberExpression(delegate) }

Then, we can append any number with .c and Groovy will evaluate this by delegating to the method getC, e.g. 7.c is equivalent with 7.getC() which in turn is evaluated into new NumberExpression(7).

The evaluation of x0 and x1 (steps 1, 3) as LPVar objects is more interesting. This is a good point to introduce the LPSolver class which implements the core of the Groptim DSL:

The max method takes as argument the OF block, which is evaluated in a closure by Groovy and passes that closure to optimize method. This changes the delegate object from the object that owns the closure (i.e. the one it was defined in) to this (i.e. LPSolver instance). Then it calls the closure, which triggers the above evaluation of the OF block into a BinaryExpression.

On the first occurrence of x0 in the OF block, Groovy will search for a field x0 in the delegate object (which is an LPSolver instance). It will not find any, so it will call the propertyMissing method on the delegate object. The implementation of this method first looks if a variable with the name 'x0' has been registered before and if not it registers it. It is important to understand that if I had not set the delegate to be this in the optimize method, then Groovy would have called the propertyMissing method on the object that owns the LPSolver instance, possible throwing a MethodMissingException. This would be client's class, e.g. it wiould be LPSolverSpec, when we run a test.

Both LPVar and NumberExpression define a negative() method, to which Groovy delegates the symbol '-', e.g. -x0 (strictly no space between '-' and x0) is equivalent with x0.negative() and -7.c to 7.getC().negative().

A number multiplied with an LPVar is evaluated (steps 2 and 4) as a BinaryExpression. To do this, I had to add the multiply method into the Number.metaclass (see LPSolver constructor). This makes 15*x0 equivalent with 15.multiply(x0) which evaluates in a BinaryExpression.

Groovy delegates the symbol '+','-' (when there is one or more spaces between '-' and a variable, e.g. x0) and '*'  to plus, minus and multiply method respectively (steps 6, 7). It should be no surprise to you that Expression class defines all of them, to make possible the composition of Expressions (BinaryExpressions to be more precise) from other Expressions:

Finally, Groovy delegates all the comparison operators ('<','>','<=','>=') to the compareTo method, e.g. x0 + x1 <= 4.c is equivalent with (x0 + x1).compareTo(4.c) <= 0. Notice that the compareTo implementation creates a new constraint and registers it in the solver object. This poses a limitation for Groptim as any comparison operator will be delegated to the same method and there is no way to know which comparison operator triggered this call. Therefore, any constraint is considered as '<=' constraint from Groptim and this is the reason why Groptim strictly follows the canonical LP form for constraints.

It should be stressed that Groovy delegates the '==' operator into the equals method (e.g. x0 == 2.c is equivalent with x0.equals(2.c)), except if the delegate's class implements the Comparable interface, in which case it delegates to compareTo (e.g. x0 == 2.c is equivalent with x0.compareTo(2.c) == 0). Therefore, Groptim will consider a '==' constraint as an '<=' constraint too.

More precisely, if the delegate class implements the Comparable interface, the '==' operator is not directly delegated to the compareTo method. Take for instance, the constraint x0 == 2.c. If neither x0 (LPVar instance) is of a subtype of 2.c (NumberExpression instance) or 2.c is  of a subtype of x0, then Groovy will not call compareTo at all, as it can conclude that the two objects are not equal. This check is done in the Groovy SDK class org.codehaus.groovy.runtime.typehandling.DefaultTypeTransformation and its compareToWithEqualityCheck method:

Due to this reason, I defined both LPVar and BinaryExpression as subclasses of NumberExpression. Otherwise, compareTo would not be called for '==' constraints and I would loose them, since constraint objects are created in the compareTo method.

Another interesting point is the return type of the optimize (and therefore max and min) method. It is a map that maps the String 'subject' to the closure { it -> ['to': { Closure cTo-> subjectTo(cTo)}]} and the String 'subjectTo' to the closure { it -> subjectTo(it)}. As you may have realized, this allows for the two alternative Groptim syntaxes for constraints, i.e. 'subject() to {}' and 'subjectTo {}' respectively.

When you write 'subjectTo {}', effectively you ask for the value that 'subjectTo' maps to. This value is a closure that takes as argument another closure denoted by it (represents the constraints block) and passes that to the subjectTo method. Pure fun! Isn't it?

The case that you write 'subject() to {}' is a little more interesting (more fun). In this case, you effectively ask for the value that the String 'subject' maps to, which is a closure. By writing () you call this closure which returns the map ['to': { Closure cTo-> subjectTo(cTo)}]. Obviously, this maps the String 'to' to another closure {Closure cTo-> subjectTo(cTo)}. This closure is returned when you write 'to'. In turn this takes as argument another closure (denoted by cTo) which represents the constraints block. This is called command chain. In Groovy, when you call a method with arguments, the () can be omitted (e.g. 'subjectTo {}' is equivalent with subjectTo({})), but when the method has no arguments, () is mandatory. This is why you can not write 'subject to {}'.

Reducing Boilerplate when Running as a Script

Groovy offers a GroovyShell class with which you can run Groovy code as if it was a script. One can initialize a GroovyShell object with a Binding object, which implements what is mostly known as 'Environment' in programming languages, e.g. a set of bindings in key-value pairs. I used this mechanism to inject the LPSolver object to any input file, so that the groptim script user does not need to create the LPSolver object. Take a look at the groptim script code:


Whenever an input file writes max (min), this is delegated to lp.max (lp.min) method. This is achieved by binding the max (min) keyword with a method reference to the max (min) method of the lp object. By calling the evaluate method on the GroovyShell object you can then evaluate Groovy code, that can either be contained in a file passed as argument (as in this case) or directly passed as an argument in the form of a String.

Internal Simplex Engine

By default (see the no-argument LPSolver constructor) Groptim uses the Apache Math Simplex implementation to solve LP instances. However, one can replace it with any other implementation as long as it implements the SolverEngine interface.

Saturday, June 20, 2015

Not as 'final' as you think

The final keyword is used on methods to prevent subclasses from overriding them and on classes to prevent subclass definition. The idea behind it, is to prevent the class behavior modification. In many cases, this requirement comes to enforce security, e.g. with String objects passed to Class.forName(), or used as keys of a HashMap. But how 'final' these classes are? This posts shows they are not as 'final' as you think. More precisely, I show how you can:
  • add methods to final classes;
  • override/redefine methods of final classes;
  • override/redefine final methods.
To achieve these goals two different approaches are followed, namely a) exploiting the meta-programming Groovy features and b) exploiting the Byte Buddy tool in Java for byte code manipulation.

The Groovy Way

The code presented in this section is on github. With Groovy you can add or redefine methods, e.g. I am going to add methods on the final java.lang.String class and override its startsWith method.

Let's take a closer look at the addMethod method. String.metaclass is equivalent with String.class.getMetaClass(). In Groovy classes are first class citizens and so you don't need to write the .class ending. String.class is an object of type Class and each Groovy object has an associated Metaclass object. All Groovy classes implement the GroovyObject interface, which exposes a getMetaClass() method for each object.

With String.metaClass.someMethodName = {...} you add a new method called someMethodName into the Metaclass object of String class. Its implementation is included in the closure (the '=' symbol is optional). Also groovy allows for String variables to be used as method names, e.g. if I have a String s and another String name="length", then s.length() and s."$name"() are equivalent. So I believe you are now able to understand that addMethod adds a method on the String class that
  • is named after the name parameter;
  • has no parameters;
  • returns the String "funny method: String."+ name (the return keyword is optional in the last command of a method/closure).
From Groovy for Domain-Specific Languages:
"A method invocation on an object is always dispatched in the first place to the GroovyObject.invokeMethod() of the object. In the default case, this is relayed onto the MetaClass.invokeMethod() for the class and the MetaClass is responsible for looking up the actual method."

The same mechanism is also used in overriddeStartsWith method. But since a method startsWith: String => boolean already exists in String class this will be hidden. Instead the implementation we set in the closure will be called by the MetaClass.invokeMethod().

You can exercise the above code with Spock:

Run the tests with Gradle:  gradlew test

When people from the Groovy community present this short of power in public, they often get the question: "does Groovy open a security hole into Java?". This question is justified by the fact that Groovy code can be called by Java code as they both translate into JVM bytecode. Groovy guys, usually respond that groovy is entirely written in Java, so it does not add a security issue, but exposes what already existed in Java. Clever answer, certainly says the truth but not the whole truth, as is shown in the following paragraph.

Leverage Groovy Magic to Redefine Methods when Running Java?

The above paragraph, was about how to redefine methods of final Java classes, when you launch the jvm with the groovy command. This paragraph is about the same but when you launch the jvm with the java command. In this case the Metaclass mechanism is not there and therefore a method invocation invokes the class method directly.

I have prepared another test project to show this, clone it from github. This uses the Groovy code presented in the previous paragraph as a dependency. So you need to install it in your local maven repo. To do this run from the command line (assuming you have cloned the groovy project):
  • cd not-as-final-as-you-think
  • gradlew publishToMavenLocal
The following Java test calls the addMethod of the StringModifier class which was written in Groovy.

The test passes as a NoSuchMethodException is thrown when calling the getDeclaredMethod method on String.class.

Even if Groovy code can be called from Java, the Groovy meta-programming "magic" is lost. Groovy can't help in achieving our goal as long as we run java. Instead we need to use a byte code manipulation tool. I will use Byte Buddy, as this is the most easy and flexible byte code manipulation tool I am aware of.

With Byte Buddy

Let's say we have a final Java class with a final returnFalse method as follows:

Then we can use Byte Buddy, to redefine the returnFalse implementation, e.g. to make it return always true. This can be done as follows:

The next question is can we do the same for a method of String or any other of the final JDK classes? E.g. would the following work?

The above code attempts to redefine the isEmpty method of the String class so that it always returns true. Compare redefineMethodOfStringClass with redefineFinalMethodOfAclass. They are almost identical, but redefineMethodOfStringClass throws a NullPointerException in line:
.load(String.class.getClassLoader(), classReloadingStrategy);

This is due to the fact that JDK classes are loaded with the Bootstrap Classloader which is not exposed to the developer. Therefore String.class.getClassLoader() returns null. Byte Buddy reloads the class whose method is redefined with a custom classloader. To do this it needs the parent classloader so that all the class dependencies can be resolved by the custom classloader. Notice that this is not a Byte Buddy's limitation, this is inherent for any tool that reloads classes.

We conclude that in Java it is possible to change the behavior of final methods, or methods of final classes, for all classes except for those included in the JDK.

Saturday, June 13, 2015

Spring Batch with csv4j: The logical choice

In this post I show how spring batch can leverage the power of csv4j to initialize the database of a spring application. In this scenario, CSV input files with incomplete data are hydrated by csv4j into objects of a predefined domain type. Spring data JPA is then used to map these objects into tuples of a database table. For the shake of simplicity, I use H2 memory database, so that you don't need to install any database server to run the code. As usually, I let spring boot to do the orchestration and provide all the configuration in java classes (strictly no xml files). The code is on github.

What is Spring Batch?

Spring batch is a framework for batch processing of repeated, usually heavy jobs. Each job is separated into sequential steps, each of which is decomposed into read-process-write activities. Those activities are modeled by spring batch interfaces ItemReader, ItemProcessor and ItemWriter respectively as follows:

This should sound familiar to java 8 developers, as it has the stream-map-reduce flavor.  

ItemReader is the first node in the pipeline. It produces objects of the domain type to be consumed by the next node in the pipeline in an one-at-a-time fashion. This is analog to calling list.stream() in java 8 (given a list of domain objects).

ItemProcessor is the analog of the mapper Function object that is passed as argument in the map(mapper)method, as process method effectively maps an instance of I type into an instance of O type.

ItemWriter is the final node in the pipeline. It waits until all objects are processed and then consumes them by writing them either on the filesystem or a database. Although, this has some similarity with the java 8 reduce phase, it differs in that the write method does not return anything, as java 8 reduce Function implementations do. Instead, it has the side effect of persisting objects in some format.

If you want to learn more about spring batch, read the Spring Batch Tutorial.

Use Case

Let's say we have an app, whose database schema is as follows:

Spring batch loads any SQL script found in the classpath before running any job. To recognize them they have to end with '.sql'. The -all ending means this script should run against any database that may be in use. Alternatively, we could provide a different schema per database server.

Also, let's say we have the following 4 input CSV files.

Notice how abnormal and incomplete the above data is. We consider that the fields relevant to our app are field0, field1, field2 and field3. Also we may know that field3 in data2.csv refers to the same entity as field1 in other files. That sort of dataset is common when you deal with data compiled from different web sources. It is under this scenario that csv4j shines.

Finally, we need a domain type, that needs to be mapped to both CSV files (for csv4j to work) and database schema (for spring data jpa to work).

The @CsvFields annotation is used to map a java field to one or more CSV fields. JPA annotations are used to map the class to the csvdata relation. Notice that JPA requires an id field that curries no business semantics. This is why we had to create the hibernate_sequence in the schema-all.sql.

Spring Batch Configuration

The whole spring batch configuration class can be found on github: BatchConfiguration.java.
Here I focus on the ItemReader as I don't need an ItemProcessor and I use the standard JpaItemWriter for persisting objects in the database.

We make the assumption that CSV input files are located under the 'data' directory in the classpath. For each input file, csv4j produces a list of domain objects. An Iterator over all the objects derived from all input files is created and passed to IteratorItemReader (one of the spring batch provided implementations of ItemReader) constructor. And that's all.

Run and Verify

Spring batch offers an abstraction for pre- and post- job completion activities. All you have to do is to subclass JobExecutionListenerSupport class and override beforeJob or/and afterJob methods. In this example I override afterJob, with an implementation that queries the database and logs all the domain objects found there after the batch job is completed.

Run the app with
  • maven: mvn spring-boot:run
  • gradle: gradlew bootRun
and get the following log:

Saturday, June 6, 2015

Csv4j - Deserialize CSV Files into Java Objects

Have you ever read CSV files in order to create java objects of some domain type proper for your app? Boring and cumbersome task. And if data is incomplete, e.g. some CSV files have more or less fields than those you care of, it becomes more boring, more cumbersome. It should not be! Java offers the means to do this easily. All you have to do is create a java domain type (class) with the fields of interest and match java fields to CSV header fields. Leave the rest to csv4j!

Brief Summary

Csv4j is a standalone application written in java 8. Optionally annotations are used to match a java field into one or more CSV header fields. In the absence of annotations, csv4j matches java fields to CSV fields of the same name. Then, csv4j makes extensive use of reflection in order to set the proper values to proper fields. 
Basic assumptions follow:
  1. The first line of a CSV input file has to be the header line that defines the fields.
  2. The rest lines contain data and each line corresponds to an instance of a domain type.
  3. A java domain type has to be given as input, with:
    • a public, no-argument constructor;
    • non-final fields to be set with values from CSV data (additional fields are allowed and may or may not be final, it's irrelevant to csv4j);
    • a standard setter method per field that needs to be set;
    • each field's type has to define a static factory "valueOf" method. Primitive types are allowed as they are wrapped by csv4j, well by Guava :)

License and Source Code

Csv4j is an open source project licensed under Apache License, Version 2.0
The code is hosted by github under csv4j repository.
Comments, pull requests or any other sort of contribution is more than welcome.
Maven is used for building and dependency management.

Dependencies

The only csv4j's compile dependency is Guava. It is used for wrapping primitive types into wrapper types (see com.google.common.primitives.Primitives) and also for checking method preconditions. 
For testing, csv4j uses TestNG.

Release

Csv4j is released in the central maven repository. Add it as a dependency in your project with maven:

or gradle:

Use Cases

Use case 1

Let's say you have the 3 following CSV files.

Also let's say that the fields that are relevant for your app are field0, field1 and field2. Notice that only data.csv has the same set of fields. More precisely, additionalField.csv has a field more (i.e. field4), while missingField.csv has a field less (i.e. field1). This is a typical scenario when you cope with incomplete data. Additional fields are ignored by csv4j, while for missing fields the setter method is never called (consider a default value when declaring the field, if you wish).
You would model this info in a domain type, let's call it SimpleDomainType, as follows:

Mapping each CSV file into a list of SimpleDomainType objects can be done by:

Find the complete code for this example at SimpleDomainType.java and HydratorTest.java (simpleDomainType test).

Use case 2


Java fields are not restricted to instances of String and primitive types, as shown in use case 1, but can be of any type T as long as it defines a static factory method valueOf: String => T. This is needed to map the string read from the CSV file into an instance of type T. Fields field0 and field2 in use case 1 were auto-boxed into Integer, Double respectively, which both define a proper valueOf method. To make this clear you could write another domain type, let's call it ComplexDomainType as follows:

where MyInt is defined as follows:

The following code maps a CSV file into a list of ComplexDomainType objects:

Find the complete code for this example at ComplexDomainType.java and HydratorTest.java (complexDomainType test).

Use case 3

 

Consider now that you have the following 2 CSV files:

And let's say you have a domain type AnnotatedDomainType with fields:
  •  int field0;
  •  String att1;
  •  double att2;
Obviously java field field0 will be matched with CSV field field0. You want to match java field att2 with CSV field2. Finally, you know that both field1 in data.csv and field3 in data2.csv refer to the same entity, so you want java field att1 to match both field1 and field3. You can achieve this by defining AnnotatedDomainType as follows:

The following code maps a CSV file into a list of AnnotatedDomainType objects:
 
The current assumption is that each CSV file contains either field1 or field3. If a file contains both, setAtt1 method will be called twice and att1 will finally have the value of the last CSV field in the order they appear in the CSV header line.
Find the complete code for this example at AnnotatedDomainType.java and HydratorTest.java (annotatedDomainType test).

Outro

That is it, I wish you enjoy csv4j as much as I do. Until the next post have fun and take a look at Sane.java, but that deserves a separate post. Stay tuned.