Thursday, December 10, 2015

XML Transformation: XSLT vs Groovy

In this post I show how Groovy can be used for XML transformation, instead of XSLT. The idea is to combine XmlSulper with MarkupBuilder for a powerful and concise syntax for XML transformation.
Everything that is presented in this post is uploaded on github.

XML Transformation Example

As a demo XML input, I use the Sky Sports Football News Feed. This contains a list of news items, as shown below.

Let's say we would like to transform this XML into an HTML file with a table of two columns, i.e. the timestamp (news:publication_date element) and the link (loc element) accompanied with the title (news:title element), like the following HTML snippet.

The XSLT Case

You could do this transformation with the following XSL file

To check the effect of this XSL stylesheet on your own, I have downloaded the XML input into sitemap_news_football.xml and have added the following line
<?xml-stylesheet type="text/xsl" href="skyFeed.xsl" ?>
in order to point to the skyFeed.xsl stylesheet. Clone the gihub repository and just open the sitemap_news_football.xml on the browser of your wish.

The Groovy Case

You can do the same transformation with the following Groovy script.

To run this script you need to have Groovy installed and the groovy binary in your path. Run:
> cd xslt-vs-groovy 
> skyfeed sitemap_news_football.xml

This will create the output file out.html in the same directory. Open it on your browser to verify the result is the same as opening sitemap_news_football.xml on the browser. You can also run the script with no input file argument:
> skyfeed

This will consider as input the current XML feed content.

The magic happens by blurring XmlSlurper with MarkupBuilder.

XmlSlurper's parse method returns a GPathResult object (e.g. the object referenced by root in this example) and then you can navigate over XML elements inside your Groovy code with GPath. GPath is a path expression language integrated into groovy (as a Groovy DSL). Its syntax is very similar to XPath, but has an object oriented flavor, e.g. instead of '/' it uses '.'. For instance, root matches the XML root element, which is the urlset elem. Similarly, root.url matches all the url elements that are children of urlset. Groovy evaluates root.url into a NodeChildren object, which then can be iterated over with a groovy closure as shown in the script.

MarkupBuilder allows you to produce HTML (or XML) by using normal Groovy contracts, like methods -more precisely missing (not implemented) methods which are delegated to methodMissing method- and closures. For more details on Builders, you can reed "Chapter 7: Building a Builder" of "Groovy for Domain-Specific Languages".

One should consider though, that browsers do not run Groovy, while they do run XSLT processors. This means that XML transformation with Groovy can only happen at the server side and then you need to send the output over HTTP to the client. If you need to do the XML transformation in the client side you need to stick with XSLT.

Thursday, November 26, 2015

Hibernate: Good but not Enough

ORM is an abstraction of data (sql/nosql) management systems that help developers decouple the application code from the underlying data management system. Besides the benefits that decoupling of any short of components brings into software development, like easiness to test components separately, flexibility to change the implementation of the one component with no effect on the other, ORM tools also decrease the developer skillset required to built n-tier applications. For instance, a java developer needs to know how to use Hibernate. JPQL allows him to carry on thinking in terms of objects even when he accesses the data layer. It seems there is no need for knowledge on relational algebra, indexing or query optimization.

This perception can be true for many small/medium sized companies with non data intensive applications. If your database is relatively small and all the data as well as the query computation on the data can fit in the database server memory, then clearly a good Hibernate knowledge is enough. In this setting, inefficient mapping of JPQL into SQL queries by Hibernate, or inefficient query evaluation strategy by the database might not significantly harm the application performance. But as long as the database size increases and especially if the query computation can not fit in the database server memory, these inefficiencies are revealed.

This post is not an argument against Hibernate (or ORM in general). The point I want to make is that  the general perception that Hibernate allows the developers to be oblivious of the underlying data storage model (e.g. the relational) or data management system implementation (e.g. PostgreSQL) will not work if you want to handle datasets of big size.

Setup

As a data management system I will use PostgreSQL, v9.4. Postgres is an open-source object-relational database. Although, it supports some functionality inherited from object-oriented databases (e.g. subtables and tuple identifiers), it is commonly used in practice as a merely relational database. One of the reasons explaining its popularity against other open-source relational databases is its advanced query optimizer.

It is critical to refer to the shared_buffers option of the postgresql.conf file. This is the memory Postgres has allocated from the OS to run all its computations. Postgres implements its own buffer replacement policy (LRU by default) for this memory fragment, on top of the OS buffer replacement policy.  It does not matter how match memory your machine has, Postgres will only use the amount selected for this parameter. It is critical to select a number that is much less than the database size, for the performance issues to be revealed. For this demo, where I use a database of 1GB, I select shared_buffers=128MB.

As a dataset I will use a synthetic relational dataset generated using the famous DELL dvdstore generator, that is also used (and extended) by VMWare to benchmark database management systems (hereafter DBMS) on its VMs. The generator allows you to specify the size of the database, the DBMS and the operating system. Steps to generate the dataset on Windows follow:
  1. Download ds21.tar.gz  and also ds21_postgresql.tar.gz on the same directory.
  2. Unzip (e.g. select 'Extract here' with WinRAR) both of them on the same directory.
  3. Go to the /ds2 directory and run from the command line perl Install_DVDStore.pl You will be promted to input some parameters, select  
    1. 1 GB (it first asks for the number and then for specifying if the number denotes MB or GB)
    2. PGSQL
    3. WIN (change to LINUX if you run on Linux)
  4. Step (3) creates the dataset in csv files under the directories cust/, orders/, prod/ under the  ds2/data_files directory. These are to be loaded in next steps into the database by using the 'COPY' Postgres command. Unfortunately, every second line in the csv files is empty, while 'COPY' command does not allow empty lines, it translates them into missing column values and fails. You will need to remove empty lines. You can do it in many ways, I did it with a groovy script. Just put the csv files in a directory, e.g. 'input/' and run (assuming you have groovy installed) from command line removeBlankLines input. This will create a directory 'clean/' with the output csv files.
  5. Create a user ds2 and a database ds2 on Postgres, e.g. on pgAdmin3.
  6. Run the script ds2/pgsqlds2/pgsqlds2_create_all.sh, or open the script and run the commands one-by-one.
I have built a demo spring web app on top of this database using spring-data-jpa and spring-web-mvc components and let spring-boot do the orchestration. I create a domain model for customer and order entities and map them to the existing database schema with JPA annotations. For this demo, I will not touch the rest entities. I use 'spring.jpa.hibernate.ddl-auto=validate' on application.properties file to tell spring to use the existing database, rather than creating the schema based on the JPA annotations every time the app starts. As my focus is on the data layer, I annotate the controller with @RestController, to tell spring to produce JSON responses. The code is uploaded on github.

To log the SQL queries that reach the database I use P6spy. Notice that setting 'hibernate.show_sql=true' shows all generated SQL statements with '?' in place of SQL parameter values. If you want to log the actual SQL queries as they reach the database, you need to use a proxy, like P6spy.

Run the application with:
mvn spring-boot:run

Information Need

Let's say you are asked to develop a REST service that returns the customers that have ordered at least once, along with their orders, based on age, gender, income and country. For instance, this could be done with a GET request on the url http://localhost:8080/report1/18/M/20000/UK which asks for customers at the age of 18 and are males and have 20000 income and are UK residents.

For demo purposes I have implemented 3 versions of this rest service, report1, report2, report3. They return the same results, but are implemented in different ways. The controller, service and repository code follow.


JPQL to SQL Query Translation

Report1

Request: GET http://localhost:8080/report1/18/M/20000/UK

DVDStoreService.report1 method calls the DVDStoreRepository.findByAgeAndGenderAndIncomeAndCountry method. Hibernate first generates a selection SQL query on the customers table to get all the customers that match the given inputs, as follows.


This query returns all the matching customers irrespective if they have done an order or not. Therefore, DVDStoreService.report1 method needs to filter them by calling the getOrders method on each one of them. Although, the Customer objects of the returned list have not populated the orders field, due to the LAZY strategy defined in the Customer class, each getOrders call triggers a selection SQL query on the orders table to get the orders of each customer if any. As 106 customers match the first query, 106 queries will run of the following form (1997369 is one of the 106 customerids ).


Notice that only 3 out of the 106 customers have ordered, i.e. the rest 103 queries return nothing.

Report2

Request: GET http://localhost:8080/report2/18/M/20000/UK

DVDStoreService.report2 method calls the DVDStoreRepository.findByAgeAndGenderAndIncomeAndCountryJoin method. As you can observe this method is annotated by a JPQL query that joins customers with orders and returns all the distinct customers. Distinct is needed, as a customer may have ordered multiple times. This query exactly computes what the service needs to return, so no additional logic is needed in the report2 method. Hibernate first generates an SQL join of the following form.

Although this SQL query computes the join of customers and orders relations, it only returns the customers table attributes. Therefore, Hibernate does not populate the orders field of the matching Customer objects. However, during JSON serialization phase, Jackson (the library Spring uses for JSON serialization by default) calls the getOrders method on each one of the 3 matching Customer objects. This triggers Hibernate to send to the database 3 more SQL queries of the oders_selection.sql form, one per matching customerid.

Report3

Request: GET http://localhost:8080/report3/18/M/20000/UK

DVDStoreService.report3 method calls the DVDStoreRepository.findByAgeAndGenderAndIncomeAndCountryFetchJoin method. This is annotated with a very similar JPQL query as findByAgeAndGenderAndIncomeAndCountryJoin, with the only difference that instead of JOIN, it uses JOIN FETCH. This notifies Hibernate to eagerly fetch all the information needed to reconstruct the matching Customer objects. This JPQL query is translated in a single SQL query, which follows.


If you compare this (join_fetch.sql) with the SQL join query (join.sql) generated for report2 method, they only differ in the first line, i.e. in the attributes of the join relation they return (or project in relational algebra terminology). Join_fetch.sql returns all the attributes of the join relation including orders attributes, while join.sql returns only the customer relation attributes.

Concluding, I have shown how the same information need can result to 107 SQL queries (report1), 4 SQL queries (report2) or a single SQL query (report3). Generally, a big number of SQL queries can cause a significant performance problem, especially if each one of them is slow and no caching is used. Minimizing this number is generally beneficial.

SQL Query Evaluation

In this section I focus on the Postgres evaluation plans for each SQL query referred in the previous section. The EXPLAIN ANALYZE Postgres command is used to get the plans.

Report1

The evaluation plan for customer_selection.sql follows.


As one can observe, to evaluate this query Postgres sequentially scans the whole customers table to find the matching customers. Since only 106 out of 2,000,000 customers match, unnecessary I/Os take place to read from disc all the rest, resulting to 2.1 sec execution time.

Each one of the 106 orders_selection.sql queries are evaluated as follows.

This query is evaluated very fast (i.e. 0.075 ms) due to a BTree index on  orders.customerid attribute. This index is created by the dvdstore benchmark. Postgres does a Bitmap Index Scan on the index to find the pointers to all the matching tuples and then does a Bitmap Heap Scan on the table, to fetch from disc the pages containing these tuples.

We conclude that customer_selection.sql should be a target for optimization.

Report2

The evaluation plan for join.sql follows.

Postgres accesses the two relations and does a Nested-Loop join and finally removes duplicates with a Hash Aggregate. The query is slow (1.966 sec) and to find why, we should focus on the access paths used to touch the two relations. For orders relation Postgres does not touch the table at all. The only attribute it needs to touch is customerid and the BTree index ix_order_custid is build on that attribute. Therefore, Postgress does an Index Only Scan, which means it accesses only the index, not the table. Since the index is much smaller than the table, less I/O's are done. Moreover the index fits in main memory so that the Nested-Loop join algorithm can use it as the inner relation. This part of the plan is clearly optimal. Let's focus on the other access path, which is a Sequential Scan on customers table. As I explained for the customer_selection.sql query, this is far from optimal. We keep join.sql as a second optimization target.

After join.sql, 3 orders_selection.sql queries are run. As I explained in the above section this is evaluated very fast.

Report3

The evaluation plan for join_fetch.sql follows.

As one can observe this plan is very similar with the one produced for join.sql, except for the fact that join_fetch.sql needs to return all the attributes of the orders relation in addition to customers attributes. So, instead of an Index Only Scan, it does a Bitmap Index Scan on the BTree index and then reads the pages containing the matching orders tuples from the disc with a Bitmap Heap Scan. The response time is 2,023 sec which is unacceptable. We have a third optimization target.

SQL Query Evaluation Optimization

In the above section we found 3 queries with bad execution time, i.e. customers_selection.sql, join.sql, join_fetch.sql. For all of them, the bottleneck is the same, i.e. the Sequential Scan on the customers table that does unnecessary I/Os due to the fact that all the 2,000,000 tuples are read from disc, while only 106 for customers_selection.sql or 3 for join.sql and join_fetch.sql are matching. Therefore, we have a single optimization task, which is to create an alternative access path to the customers relation that avoids as much as possible to read from disc pages containing only irrelevant tuples.

The way to do this is to build an index on the customers table on one or more of the attributes for which we have filtering conditions, i.e. age, gender, income, country. But which of them? To answer this, we need to know the selectivity of each condition, i.e. the percentage of tuples each one matches. As one can easily verify by running count queries: 
  1. age=18 matches 27,303 tuples, i.e. 1.3%,  
  2. gender='M' matches 1,000,265 tuples, i.e. 50%, 
  3. income=20000 matches 398642 tuples, i.e. 20%
  4. country='UK' matches 100169 tuples, i.e. 5%
The lower the matching tuples percentage the higher the benefit from an index on the related attribute. In this case, age and country are two good candidates. On the other hand, gender and income are not, as they match too many tuples. It should be stressed that an Index Scan on income with condition income=20000 would not touch only the 20% of the tuples that match, it would touch all the pages containing at least one matching tuple. And notice that a page typical contains many tuples, how many it depends on the tuple size in bits. Therefore, the number of total tuples read from disc would be much higher than 20%. Notice also, that comparing with a Sequential Scan, an Index Scan has an additional cost of fetching the index from the index, and following the pointers of the matching tuples to the table pages, let alone the cost to create and maintain the index during updates. Therefore, there is a point higher of which a Sequential Scan is preferred from an Index Scan and although this point depends on many factors, it is generally estimated that there is no benefit if the matching tuples are more than 20%.

Having found age and country as the two candidate attributes for indexing, the next question is which type of an index. In theory (and in practice for other DBMS like Oracle) a Hash index is best for equality ('=') or inequality ('!=') conditions, while a BTree one is best for range ('<', '>', '<=', '>=', ) conditions. In practice however, BTrees are always preferred with Postgres, as the Hash index implementation has not very good performance and is not WAL-logged, meaning that in case of a DBMS failure, it is lost and you have to rebuild it from scratch.

The final question is if it is better to build a BTree index on age or on country, or on the combination (age, country). As previously discussed an index comes with an overhead to created and maintain it during updates, e.g. the BTree is a balanced tree and may need to re-balance during an insert or a delete. Therefore, it is a good practice to build an index that can benefit more than one queries. In this demo example we only have 3-4 queries, but in a real app you typically have many. A compound index on (age, country) combination, seems to match our optimization need best, but can only be used by queries with a condition on age or both on age and country. It can not be used for queries with a condition on country only. If it is better to create two separate indices one on age and one on country, or a compound on (age, country) depends on the query workload of the application. But, creating all the three indices is out of question.

I proceed with selecting the compound index for this demo. To create it, run:
CREATE INDEX ix_cust_age_country ON customers USING btree (age, country);

The evaluation plan for customer_selection.sql follows.

As one can observe the access path is changed from Sequential Scan to Bitmap Index Scan on the compound index and the response time is reduced from 2.1 sec to 4.15 ms.

Due to the same change in the access path, the response time for join.sql and join_fetch.sql is reduced from 1.96 sec to 6.12 ms and from 2.02 sec to 7.76 ms respectively, as shown by the following evaluation plans.

Monday, November 16, 2015

Integrate Cucumber with WebDriver: BDD in automation

In this post, I show how Cucumber can be integrated with Selenium WebDriver for bringing BDD in end-to-end testing. As a system under testing, I choose the BBC website and test basic navigation on BBC news pages. The code is available on github.

Background

Make no mistake, cucumber's main purpose is not to be a testing tool, but to enforce collaboration between scrum teem members. The main objective is to ensure that all the team members have the same understanding of the requirements of the feature they have to develop and test. Business analysts that, together with the product owner, have the basic idea of the feature requirements, introduce it to the team and the team as a whole finalizes the requirements in the form of Cucumber Scenarios. These scenarios are written in Gherkin, a DSL very close to natural language. Scenarios can be created in different level of granularity, e.g. for unit, integration or end-to-end tests. Cucumber initiates the tests that are initially failing, as the feature has not yet been developed and then development starts. Scenarios do not only guide the testing, but also consist a live documentation of the feature. 

In this post, I focus on end-to-end tests (frequently called 'automation'). These are typically written using the Selenium WebDriver framework. As it is well known, end-to-end tests have a high maintenance cost, e.g. when the UI changes, tests might fail, as some web elements may no longer be present or may have been moved. The work you need to fix them highly depends on the presence or absence in/from your test code of design patterns such as the Page Object and the Loadable Component patterns. In this post, I show the benefit for using them.

Setup

As a build and dependency management tool, I use maven. The pom file follows.

The Hamcrest library is optional, you can integrate Cucumber with Selenium WebDriver without it, but it is nice to have it, as it provides an elegant way to write assertions. Also I use 'cucumber-java8', as this allows for steps to be defined in lamdas instead of methods and this reduces the boilerplate.

BBC News Feature

Let's say we are in the team that develops/maintains the BBC website and in particular the news part. The feature we need to develop is that the user should be able to navigate from the home page to the news page, then select a news category, e.g. 'Science', then select a (let's say the first for simplicity) video page from the 'Watch/Listen' section and then share this video page on Facebook. As the feature is implemented by the real BBC team, you can try it. The sequence of the links could be the following, but always keep in mind that the Cucumber scenarios should be written before the development: 
  1. BBC Home
  2. BBC News
  3. BBC News Science category
  4. BBC Science Video Page
  5. Facebook referral page
The Cucumber feature file follows.

The first two scenarios test that the user can navigate to the BBC News page by clicking either on the 'Latest news' link of the homepage or on the 'News' link on the navigation bar at the top of the home page.

The third scenario, which more precisely is a 'scenario outline' or a set of scenarios if you wish, tests the main feature which is to navigate from home page to news page, then to a news category page, then to the first video page of the category and finally to share it on facebook.

Gherkin is very easy to follow, shortly Feature, Scenario, Given, When, Then, And are cucumber reserved keywords, while the rest is English. Each sentence starting with Given, When, Then defines a step. For more info about Gherkin, steps and scenario outline, you can read the post BDD with Cucumber vs Spock. The main thing to keep is a cucumber scenario is the analog of a junit/testng test method, while a scenario outline is the analog of a set of calls to the same testng test method that takes an input whose value is provided by a testng data provider. The cucumber analog for the data provider is the Examples table. In short, category is the input, and the scenario outline tests that the feature works for any BBC news category appearing in the Examples table.

The Page Object Pattern

The main goal of the Page object pattern is to encapsulate details related to a page into a class object and expose only methods relevant with the functionality this page offers. Take for instance the news category page, the locator of the first video page link in the 'Watch/Listen' section is an internal detail that should be hidden from tests. The news category page object only needs to provide a 'clickFirstVideo' method and that's the only thing test code needs to be aware of. 

The benefit is that when the UI changes and the locator is no longer valid, you will have to fix it on a single place, on the related page object. Experience shows that if you don't use the page object pattern, the end-to-end code maintenance cost will explode sooner or later.

The Loadable Component Pattern

While the page object pattern hides the details behind functionality offered by the page, the Loadable Component pattern hides the details behind the navigation to the page object. To apply this, WebDriver offers an abstract LoadableComponent class which your page object class has to extend. It has two abstract methods 'isLoaded' and 'load' that the page object class has to implement and a concrete method 'get' that calls 'isLoaded' to check if the page is already loaded and if not it calls the 'load' method and then the 'isLoaded' to verify it is finally loaded. The idea is to pass to the constructor of a page object class a reference to the parent page object and in the 'load' method of the child class to call first the 'get' method on the parent object (to load the parent page) and then call some functionality method on the parent object that will load the child page.

The implementation of the NewsCategoryPage class that implements both the page object and the loadable component patterns follows.

Notice the implementation of the 'load' method. It first calls 'parent.get()' to load the news page and then calls the 'selectCategory' method on the news page object to load the news category page.

It should be stressed that the loadable component pattern does not necessarily tight a page object class with a single navigation path. Take for instance the News page, you have 2 ways to navigate to it from the BBC home page by clicking on either 'Latest news' link on the main section, or on the 'News' link on the navigation bar. The relevant part of the news page object class follows (you can find the whole class code on github).

Steps Implementation

The Steps implementation follows.

You can run the scenarios either on eclipse by right clicking on the BBCNewsTest.java -> Run As -> Junit test, or with maven: mvn clean test

Further Improvement

A nice observation for the loadable component pattern is that that if the page object class implements it and you call the 'get' method on it, you implicitly run assertions that verify the page has been 'correctly' loaded, where 'correctly' is defined in the 'isLoaded' method implementation. Despite its name that starts with 'is' this method does not return a boolean value. Instead, its return type is void, while it 'throws Error'. This typically contains all the assertions needed to verify it was loaded properly.

To understand this, recall how get method is implemented in the LoadableComponent class.

It should now be obvious that the second scenario is redundant. If the news page does not load when you click on 'News' link on the home page navigation bar, then the Step "When the user clicks 'News' on the navigation bar", which is shared in the second scenario and the scenatrio outline, will fail, due to the assertions in the NewsCategoryPage 'isLoaded' method. Therefore the second scenario can be safely omitted.

Saturday, November 7, 2015

Groovy Shell Scripting Meets Hadoop

In this post I show how Hadoop map and reduce functions can be written as Groovy shell scripts in a concise and elegant way.

The most common way to kick of a hadoop map-reduce task by command line is a python script. Alternatively, one could write the map and reduce functions in java, but he has to pay the additional cost of packaging them in a jar and running the jar in the command line. The difference between python and java is that python is a scripting language, while java not. For instance, you can run directly python as a script, by adding "#!/usr/bin/env python" at the top of your files.

Groovy fills this gap, being the first scripting language among the family of jvm languages. Reasons to prefer groovy from python inlcude:
  1. You are a java/groovy developer and you want to write the map reduce functions in a familiar language;
  2. You want to leverage the power of jvm as well as of a vast amount of libraries written in any jvm language;
  3. You find python's syntax pathetic, e.g. change indentation and you change the code semantics;
  4. Groovy goodies allow you to quickly write concise and elegant code.

Requirements

For running hadoop, I use the cloudera vm (you can download it here for free). Of course, you can use the hadoop infrastructure of your wish. Hereafter, I refer to cloudera vm and you need to adjust properly if you use an alternative.

Groovy needs to be installed in the machines that run hadoop. It is not on cloudera vm, so you need to do it, read instructions here.

The Wordcount Example

This is probably the most popular "hello word" hadoop example. The problem is stated as follows: Given a collection of files that contain strings separated by blank spaces, compute the number each word occurs across the collection. The output should be a file with records of the form <word #occurences>.

The mapper groovy script follows.


The logic is clearly communicated by the code: for each line read from the standard input, trim it, get all the words (assuming they are separated with each other by blank spaces) and for each word emit  "word    1" (a tab separates the key from the value), meaning that we found one occurrence of the word. The first line #!/usr/bin/env groovy tells the OS to use groovy to interpret this file.

The reducer groovy script follows.

Again, the logic is straightforward: for each line read from the standard input, trim it and get the word, #occurence fields out of it. If the current word is different than the previous emit the record "word counter" and reset the counter. In any case, update the previous word with the current one and sum the current value (#occurence) to the total counter. In the end, emit the record for the last word.

You need to make these 2 scripts executable:
chmod +x wordcount_mapper.groovy
chmod +x wordcount_reducer.groovy


Assuming, you have put the input files under the hdfs directory '/user/cloudera/wordcount_input', you can run the hadoop with:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/cloudera/wordcount_input -output /user/cloudera/wordcount_output -mapper wordcount_mapper.groovy -reducer wordcount_reducer.groovy

If you get a failure including the message "Container killed on request. Exit code is 143", this means that groovy is not visible by the datanode (cloudera vm includes just one data node) that needs to run the map or reduce jobs. You may have installed groovy on the cluster node, but you need to install it on all data nodes too, as hadoop pushes computation to them.

A workaround for cloudera vm is to replace "groovy" in the first line of your scripts with the absolute path of the groovy file. To find where groovy is installed, run: 
printenv GROOVY_HOME

For instance, if you have installed it with sdkman, groovy is located under '/home/cloudera/.sdkman/candidates/groovy/current/bin/' and you have to replace the first line of both scripts with:
#!/usr/bin/env /home/cloudera/.sdkman/candidates/groovy/current/bin/groovy

Although not needed for this problem, you can reuse any published jvm library. Grape will fetch the dependency from a repository and you need to add import statements as you normally do in java. An example can be found here.

Thursday, October 29, 2015

LP-reduction: A Groovy Framework for Linear Programming Reductions

In this post I present a Groovy framework for Linear Programming (hereafter LP) reductions. The term 'framework' denotes that its expressive power covers the reduction of any problem P into LP.

Given a Problem P with a domain D and range (or solution domain if you wish) R, we say that a problem P reduces to LP, if you can use an algorithm that solves LP to solve P. The reduction includes 3 steps (a nice visual representation can be found at page 9/46 of this):
1. given an instance x of P, map it to an instance y of LP;
2. solve y by an algorithm (in this case Simplex) for LP, to derive a solution, let's denote it with LP(y);
3. map LP(y) to the solution of x.

The mapping functions that are needed by steps (1) and (3) are passed as inputs in the form of Groovy closures, while step (2) is delegated to Apache Math3 Simplex implementation by default, although abstractions for replacing this with the solver of your wish are provided.

Of course, not any possible combination of map functions in step (1) and (3) lead to a reduction. You need to prove that a certain combination is proper. The framework presented here does not verify the combination is proper (there are certain limitations even if we wished to do so), but assumes it is proper.

Background

Generally, algorithmic reductions are very useful for both theoretical and practical reasons. For instance, if A reduces to B polynomially (i.e. the mapping functions are polynomial) and B is a polynomial problem, then A is polynomial as well (for more details about the significance of reductions you can read this or this). In addition, if you have an efficient implementation for an algorithm B, but not for A, you can leverage this implementation.

Specifically for B=LP:
1. LP is polynomial. The ellipsoid algorithm solves this polynomially, although the simplex algorithm (which is exponential in the worst case) is much faster in practice.
2. The expressive power of LP and its massive application in the financial sector (e.g. see operations research) has led to very efficient, commercial simplex implementations offered by many business analyst vendors, like IBM.

Source code, Dependencies, License

The lp-reduction framework is an open source project licensed under Apache License, Version 2.0
The code is hosted by github under the lp-reduction repository.
Gradle is used for building and dependency management.
The only compile time dependency is the Apache Commons Mathematics Library, while testing is done with Spock.

How it works

The implementation is straightforward as the main class (LPReduction) suggests:


Step (1) is implemented by 'mapToLP' method. It takes as input an instance x of P and returns the LP instance y. To do so it calls the 'vars'/'objFunc'/'constraints' closures that are defined as class fields in order to create the LP variables/objective function/constraints respectively. How they are initialized will be clarified later with an example.

Notice that the LPInstance class has two more boolean fields, namely 'max' and 'nonNegative'. The former indicates whether this is a maximization (default option) or a minimization (set max=false for this) problem.

Moreover, it is possible that an LP instance has empty (technically any constant, usually 0) objective function and in this case it is often called constraint satisfaction problem. This happens when you only need to know if there is a solution that simultaneously satisfies all the constraints, without having any quantity to maximize/minimize. This case is handled by the line:
objFunc ? objFunc(x) : new LPFunction([:], 0)
which means "if the closure objFunc is set (i.e. not null), call it, else use the empty objective function (i.e. f(x)=0)".

Step (2) is implemented by 'solve' method, which takes as input an LP instance y and returns its solution. If no solver is set (default option), the Apache Math3 Simplex is used to solve y, else the user provided solver is used. To allow this LPSolver interface provides the proper abstraction:

The solution is modeled by Solution class:


The whole reduction is implemented by 'reduce' method. Obviously, it runs the first two steps as explained above and then it calls the 'result' closure to map the solution of the LP instance y to the solution of P for x.

The Max Flow Reduction Example

The code below exercises the lp-reduction framework for reducing the Max Flow problem to LP. You can read here what the Max Flow problem is and how an instance of it can be mapped to an LP instance.

As shown above, you can initialize an LPReduction object fields with the 'with' Groovy keyword. After defining the LPReduction object, you can exercise it as shown in MaxFlowSpec.groovy. Notice that the code inside the closures vars/objFunc/constraints/result can be written in Java instead of Groovy. However, you can not replace the entire closure with a Java 8 lambda as Groovy is not compatible with lamdas.

The full code can be found github under the lp-reductions repository.

The Graph Realization Reduction Example

The following code exercises the lp-reduction framework for reducing the Graph Realization problem to LP.


Notice that in its original version the Graph Realization is a decision problem, i.e. is there any graph with the given sequence, YES or NO? In practice, it is often useful to ask for a graph with the given sequence if such a graph exists. To define the decision version you need to define the result closure as:
result = { Solution s-> s.state==LPIState.FEASIBLE }
while for the modified version you need to define it as
result = { Solution s -> s.state==LPIState.FEASIBLE ? s.at : null }

After defining the LPReduction object, you can exercise it as shown in GraphRealizationSpec.groovy.

Saturday, July 11, 2015

BDD with Cucumber vs Spock

In this post I compare Cucumber with Spock for BDD. To do this I consider two simple test cases of
  • a book library that offers a 'search by publication year' feature, and
  • a salary manager that modifies the salary of one or more employees. 
I take the cucumber code from this bitbucket repository. The Spock code can be found in the spock-example repository. The source code (domain model) is the same in both cases. Only the test code changes.

The Book Library Test Case


With Cucumber
Cucumber offers a DSL for writing feature specifications in English, called Gherkin, that is then associated with test code. A feature contains one or more scenarios. Each scenario consists of steps. A simple feature with one scenario for a book library could look like:

Feature, Scenario, Given, When, Then, And are cucumber reserved keywords, while the rest is English. Each sentence starting with Given, When, Then defines a step. If you run this test, cucumber will search for step method implementations associated with the steps defined in Gherkin. The association is done with annotations containing proper regular expressions. If no such implementation exists Gherkin will print in the console the signature of the methods you have to implement, like:

You need to copy paste this code into a test class and then provide implementations for each of them. Notice however that some methods can be merged with each other if we provide smarter regular expressions. For instance, the three methods annotated with @Given can be merged into one by providing regular expressions for the title, author and publishing date. The most concise implementation is:

This class name ends with Steps to denote that it provides step implementations and starts with BookSearh to denote that these steps are associated with the search_book.feature, but the association is not done with the class name, but solely relies on the regular expressions with which methods are annotated. Cucumber will search any class in the 'src/test' directory to find a method whose annotation matches a Gherkin step.

Cucumber also needs a test class that links to the Cucumber runner like:


With Spock
Spock is a DSL for BDD written in Groovy. The search by publication year scenario can be written in Spock as:

This is very similar to the Gherkin syntax and equally easy to read. The difference is that Gherkin is a text document that is parsed by the Cucumber framework and associated to Java (or other language) code, while a Spock method is at the same time a natural language specification and valid Groovy code. Therefore the implementation is done in the same Spock file as follows:

The code that implemented Cucumber step methods is mixed with the natural language specification in Spock. Spock allows the implementation to be done in Groovy or Java.

The Salary Manager Test Case


With Cucumber
A more complex test case is that of a salary manager of many employees that modifies employee salaries. Although the original example considered the modification of a single employee's salary, I change the test to allow for different test input data in the same manner as a DataProvider works in TestNG:

To run the same test with different input data, Cucumber offers the Scenario Outline and Examples keywords. Scenario Outline denotes that this scenario (test) needs to run multiple times and the table after the Examples keyword offers the input data. The first table row contains column headers which act as variables that can be used in the steps text. The scenario runs once per data line. Notice that the first table contains initialization data, e.g. the employees that this manager manages. The same set of employees will be created for each run.

The step method implementation follows:

Notice, that the initialization data table is mapped to a List<Employee> object and passed by Cucumber as an argument of the the_salary_management_system_is_initialized_with_the_following_data method.

With Spock
The same feature specification could be written with Spock as follows:

Spock offers data tables for test input data (the analog for Cucumber Examples table), but not for initialization data. This is a clear drawback. A workaround is to define the init data as a two dimensional array. The implementation then needs to create a list of employees out of the array, as shown:


Conclusion

Both Cucumber and Spock tight couple the natural language specification with the test code. This is a direct consequence of the BDD paradigm that both frameworks have been created to support. But Cucumber does this in a more strict manner. Changing the natural language will break the test code, e.g. with a missing step implementation if no method's regular expression matches the given step text. While for spock the text is an arbitrary string after the ':' symbol. It is not validated against the implementation that follows.

Also Cucumber offers a clear distinction between natural language specification and test code. This is an advantage for people other than developers having to write or read the specs. In the end, the very essence of BDD is the close collaboration of product owners, BAs, QAs, architects and developers so that the specification is agreed and understood by all before the development starts.

On the other hand, Spock offers a fast, concise and one file solution. Groovy's ability to use any string as a method name allows for concise test case names. It offers to developers a single point to read and understand the spec and the code that implements it, let alone the goodies with which Spock comes, like internal mock support or advanced data table capabilities (e.g. using closures in table cells).

To conclude, Cucumber looks to be a better fit for integration or end-to-end tests, that involve interaction of people of different skills and backgrounds, while Spock for unit tests, that are more or less solely developers responsibility.

Friday, June 26, 2015

Groptim: A Groovy DSL for Linear Programming

In this post I present how Groovy meta-programming features can be leveraged to implement a Linear Programming (hereafter called LP) DSL, called Groptim. Internally, Groptim uses the Apache Commons Mathematics Library and its Simplex implementation (see org.apache.commons.math3.optim.linear.SimplexSolver) to solve the LP instances.

Source code, Dependencies, License

Groptim is an open source project licensed under Apache License, Version 2.0
The code is hosted by github under the groptim repository.
Gradle is used for building and dependency management.
The only compile time dependency is the Apache Commons Mathematics Library, while testing is done with Spock.

Usage for Groovy Developers

Plenty of examples on how to use Groptim from Groovy code can be found at LPSolverSpec.groovy. In this paragraph, I pick some representative examples to present the flexibility and the limitations of the syntax.

Every LP instance can be written in the LP canonical form, which defines all the constrains to be "less than or equals to" and all the variables non-negative:

Some flexibility beyond the canonical form is offered:
  1. you can define min instead of max to minimize the objective function;
  2. constants are allowed in the objective function; 
  3. variables are allowed to be negative (if you want one to be strictly non-negative, you need to make this explicit with a constraint).
Groptim strictly adheres to the canonical form for constraints. All the constraints are deemed "less than or equals to", no matter what the comparison symbol is ('<=', '>=' or '='). The reason for this decision will be described in later sections.

Let's see how an LP instance looks like in Groptim:

The first line obviously creates an object of the main Groptim class, which is LPSolver.groovy. To define a maximization problem we write lp.max (for minimization write lp.min).

Immediately after the max (or min), we have to define the Objective Function (hereafter OF). in brackets {} (the space between max and the brackets is optional). In this case the OF is 15*x0 + 10*x1 + 7. Notice that we write 7.c to denote the constant number 7. This is needed as every term of the OF (and itself as a whole) has to be evaluated as an Expression object by Groovy. We will dive into implementation details in the next paragraph. Notice that constants are not needed in the OF. Finding a vector x that maximizes (or minimizes) 15*x0 + 10*x1 + 7 is equivalent with finding a vector x that maximizes (resp. minimizes) 15*x0 + 10*x1. This is why the canonical LP form does not allow for constants in the OF. For Groptim a constant in the OF is optional.

After the objective function block, the keyword 'subjectTo' has to follow to denote that the block that follows defines the constraints. Instead of 'subjectTo', one can write 'subject() to', but not  'subject to'. Looking at the constraints, the first two are straightforward, but the 3rd and 4rth are the result of canonizing the non-canonical constraint x0 + x1 = 4. This is equivalent with x0 + x1 <= 4 and x0 + x1 >= 4 that should both be satisfied. The first is canonical, the second can be easily canonized by multiplying -1 both sides of the inequality and therefore changing '>' with '<': -x0 - x1 <= -4.

That's it, this is valid Groovy code. After the subjectTo block you can access the lp fields to read the solution of the LP instance, e.g.:

which means that the OF is maximized at (x0, x1) = (2, 2) and its maximum value is 57.

Notice that any string (except for Groovy reserved keywords) can be used for variable names, e.g. instead of x0, x1, x2, one can write x, y, or myvar. Also coefficients can be arbitrary real numbers and standard arithmetic rules apply, e.g.
  • 0*x1 is equivalent with 0.c
  • 1.7*x0 - x1 is equivalent with 1.7*x0 + -x1
Also, non-linear OF, or constraints are not allowed in LP, therefore Groptim throws an IllegalStateException if you write a something like x0*x1 or (2*x0 + 5.c)*(4*x1-3.c).

Usage for Everyone

One of the many reasons why Groovy is "groovy", is that it can be used as a scripting language. In the script/ directory I have included the groptim script along with two input LP files written in Groptim:

Notice that the lp object is not needed in this case. I will explain in the following paragraph how the script silently injects it. You can run groptim script to solve the LP instances described in these files and get the solution printed in your console as:

To allow this magic to happen, you need to install Groovy and have the groovy command in your path. Notice that groptim script has 2 dependencies (managed by Grape), apache math ('org.apache.commons:commons-math3:3.5') and Groptim itself ('org.ytheohar:groptim:1.0'). Both will be fetched from the maven central repository. When groptim script runs it will have Ivy copy these artifacts under your .groovy/grapes/ directory.

How it works

The objective function as well as both the left and right part of each constraint is evaluated as an Expression or more precisely as one of its subclasses, i.e. NumberExpression, LPVar or BinaryExpression. Take for instance 15*x0 + 10*x1 + 7.c:
  1. x0 is evaluated as an LPVar
  2. 15*x0 as a BinaryExpression
  3. x1 as an LPVar
  4. 10*x1 as a BinaryExpression
  5. 7.c as a NumberExpression
  6.  15*x0 + 10*x1 as a BinaryExpression
  7.  15*x0 + 10*x1 + 7.c as a BinaryExpression
The simplest form of an expression is a NumberExpression. To evaluate (step 5) a constant into an expression I had to add a new method getC in the java.lang.Number class. This is done in the LPSolver constructor as follows: 
Number.metaClass.getC { new NumberExpression(delegate) }

Then, we can append any number with .c and Groovy will evaluate this by delegating to the method getC, e.g. 7.c is equivalent with 7.getC() which in turn is evaluated into new NumberExpression(7).

The evaluation of x0 and x1 (steps 1, 3) as LPVar objects is more interesting. This is a good point to introduce the LPSolver class which implements the core of the Groptim DSL:

The max method takes as argument the OF block, which is evaluated in a closure by Groovy and passes that closure to optimize method. This changes the delegate object from the object that owns the closure (i.e. the one it was defined in) to this (i.e. LPSolver instance). Then it calls the closure, which triggers the above evaluation of the OF block into a BinaryExpression.

On the first occurrence of x0 in the OF block, Groovy will search for a field x0 in the delegate object (which is an LPSolver instance). It will not find any, so it will call the propertyMissing method on the delegate object. The implementation of this method first looks if a variable with the name 'x0' has been registered before and if not it registers it. It is important to understand that if I had not set the delegate to be this in the optimize method, then Groovy would have called the propertyMissing method on the object that owns the LPSolver instance, possible throwing a MethodMissingException. This would be client's class, e.g. it wiould be LPSolverSpec, when we run a test.

Both LPVar and NumberExpression define a negative() method, to which Groovy delegates the symbol '-', e.g. -x0 (strictly no space between '-' and x0) is equivalent with x0.negative() and -7.c to 7.getC().negative().

A number multiplied with an LPVar is evaluated (steps 2 and 4) as a BinaryExpression. To do this, I had to add the multiply method into the Number.metaclass (see LPSolver constructor). This makes 15*x0 equivalent with 15.multiply(x0) which evaluates in a BinaryExpression.

Groovy delegates the symbol '+','-' (when there is one or more spaces between '-' and a variable, e.g. x0) and '*'  to plus, minus and multiply method respectively (steps 6, 7). It should be no surprise to you that Expression class defines all of them, to make possible the composition of Expressions (BinaryExpressions to be more precise) from other Expressions:

Finally, Groovy delegates all the comparison operators ('<','>','<=','>=') to the compareTo method, e.g. x0 + x1 <= 4.c is equivalent with (x0 + x1).compareTo(4.c) <= 0. Notice that the compareTo implementation creates a new constraint and registers it in the solver object. This poses a limitation for Groptim as any comparison operator will be delegated to the same method and there is no way to know which comparison operator triggered this call. Therefore, any constraint is considered as '<=' constraint from Groptim and this is the reason why Groptim strictly follows the canonical LP form for constraints.

It should be stressed that Groovy delegates the '==' operator into the equals method (e.g. x0 == 2.c is equivalent with x0.equals(2.c)), except if the delegate's class implements the Comparable interface, in which case it delegates to compareTo (e.g. x0 == 2.c is equivalent with x0.compareTo(2.c) == 0). Therefore, Groptim will consider a '==' constraint as an '<=' constraint too.

More precisely, if the delegate class implements the Comparable interface, the '==' operator is not directly delegated to the compareTo method. Take for instance, the constraint x0 == 2.c. If neither x0 (LPVar instance) is of a subtype of 2.c (NumberExpression instance) or 2.c is  of a subtype of x0, then Groovy will not call compareTo at all, as it can conclude that the two objects are not equal. This check is done in the Groovy SDK class org.codehaus.groovy.runtime.typehandling.DefaultTypeTransformation and its compareToWithEqualityCheck method:

Due to this reason, I defined both LPVar and BinaryExpression as subclasses of NumberExpression. Otherwise, compareTo would not be called for '==' constraints and I would loose them, since constraint objects are created in the compareTo method.

Another interesting point is the return type of the optimize (and therefore max and min) method. It is a map that maps the String 'subject' to the closure { it -> ['to': { Closure cTo-> subjectTo(cTo)}]} and the String 'subjectTo' to the closure { it -> subjectTo(it)}. As you may have realized, this allows for the two alternative Groptim syntaxes for constraints, i.e. 'subject() to {}' and 'subjectTo {}' respectively.

When you write 'subjectTo {}', effectively you ask for the value that 'subjectTo' maps to. This value is a closure that takes as argument another closure denoted by it (represents the constraints block) and passes that to the subjectTo method. Pure fun! Isn't it?

The case that you write 'subject() to {}' is a little more interesting (more fun). In this case, you effectively ask for the value that the String 'subject' maps to, which is a closure. By writing () you call this closure which returns the map ['to': { Closure cTo-> subjectTo(cTo)}]. Obviously, this maps the String 'to' to another closure {Closure cTo-> subjectTo(cTo)}. This closure is returned when you write 'to'. In turn this takes as argument another closure (denoted by cTo) which represents the constraints block. This is called command chain. In Groovy, when you call a method with arguments, the () can be omitted (e.g. 'subjectTo {}' is equivalent with subjectTo({})), but when the method has no arguments, () is mandatory. This is why you can not write 'subject to {}'.

Reducing Boilerplate when Running as a Script

Groovy offers a GroovyShell class with which you can run Groovy code as if it was a script. One can initialize a GroovyShell object with a Binding object, which implements what is mostly known as 'Environment' in programming languages, e.g. a set of bindings in key-value pairs. I used this mechanism to inject the LPSolver object to any input file, so that the groptim script user does not need to create the LPSolver object. Take a look at the groptim script code:


Whenever an input file writes max (min), this is delegated to lp.max (lp.min) method. This is achieved by binding the max (min) keyword with a method reference to the max (min) method of the lp object. By calling the evaluate method on the GroovyShell object you can then evaluate Groovy code, that can either be contained in a file passed as argument (as in this case) or directly passed as an argument in the form of a String.

Internal Simplex Engine

By default (see the no-argument LPSolver constructor) Groptim uses the Apache Math Simplex implementation to solve LP instances. However, one can replace it with any other implementation as long as it implements the SolverEngine interface.