Java/groovy and sql/nosql dbs, all live here: November 2015

Thursday, November 26, 2015

Hibernate: Good but not Enough

ORM is an abstraction of data (sql/nosql) management systems that help developers decouple the application code from the underlying data management system. Besides the benefits that decoupling of any short of components brings into software development, like easiness to test components separately, flexibility to change the implementation of the one component with no effect on the other, ORM tools also decrease the developer skillset required to built n-tier applications. For instance, a java developer needs to know how to use Hibernate. JPQL allows him to carry on thinking in terms of objects even when he accesses the data layer. It seems there is no need for knowledge on relational algebra, indexing or query optimization.

This perception can be true for many small/medium sized companies with non data intensive applications. If your database is relatively small and all the data as well as the query computation on the data can fit in the database server memory, then clearly a good Hibernate knowledge is enough. In this setting, inefficient mapping of JPQL into SQL queries by Hibernate, or inefficient query evaluation strategy by the database might not significantly harm the application performance. But as long as the database size increases and especially if the query computation can not fit in the database server memory, these inefficiencies are revealed.

This post is not an argument against Hibernate (or ORM in general). The point I want to make is that the general perception that Hibernate allows the developers to be oblivious of the underlying data storage model (e.g. the relational) or data management system implementation (e.g. PostgreSQL) will not work if you want to handle datasets of big size.

Setup

As a data management system I will use PostgreSQL, v9.4. Postgres is an open-source object-relational database. Although, it supports some functionality inherited from object-oriented databases (e.g. subtables and tuple identifiers), it is commonly used in practice as a merely relational database. One of the reasons explaining its popularity against other open-source relational databases is its advanced query optimizer.

It is critical to refer to the shared_buffers option of the postgresql.conf file. This is the memory Postgres has allocated from the OS to run all its computations. Postgres implements its own buffer replacement policy (LRU by default) for this memory fragment, on top of the OS buffer replacement policy. It does not matter how match memory your machine has, Postgres will only use the amount selected for this parameter. It is critical to select a number that is much less than the database size, for the performance issues to be revealed. For this demo, where I use a database of 1GB, I select shared_buffers=128MB.

As a dataset I will use a synthetic relational dataset generated using the famous DELL dvdstore generator, that is also used (and extended) by VMWare to benchmark database management systems (hereafter DBMS) on its VMs. The generator allows you to specify the size of the database, the DBMS and the operating system. Steps to generate the dataset on Windows follow:

Download ds21.tar.gz and also ds21_postgresql.tar.gz on the same directory.
Unzip (e.g. select 'Extract here' with WinRAR) both of them on the same directory.
Go to the /ds2 directory and run from the command line perl Install_DVDStore.pl You will be promted to input some parameters, select

1 GB (it first asks for the number and then for specifying if the number denotes MB or GB)
PGSQL
WIN (change to LINUX if you run on Linux)

Step (3) creates the dataset in csv files under the directories cust/, orders/, prod/ under the ds2/data_files directory. These are to be loaded in next steps into the database by using the 'COPY' Postgres command. Unfortunately, every second line in the csv files is empty, while 'COPY' command does not allow empty lines, it translates them into missing column values and fails. You will need to remove empty lines. You can do it in many ways, I did it with a groovy script. Just put the csv files in a directory, e.g. 'input/' and run (assuming you have groovy installed) from command line removeBlankLines input. This will create a directory 'clean/' with the output csv files.
Create a user ds2 and a database ds2 on Postgres, e.g. on pgAdmin3.
Run the script ds2/pgsqlds2/pgsqlds2_create_all.sh, or open the script and run the commands one-by-one.

I have built a demo spring web app on top of this database using spring-data-jpa and spring-web-mvc components and let spring-boot do the orchestration. I create a domain model for customer and order entities and map them to the existing database schema with JPA annotations. For this demo, I will not touch the rest entities. I use 'spring.jpa.hibernate.ddl-auto=validate' on application.properties file to tell spring to use the existing database, rather than creating the schema based on the JPA annotations every time the app starts. As my focus is on the data layer, I annotate the controller with @RestController, to tell spring to produce JSON responses. The code is uploaded on github.

To log the SQL queries that reach the database I use P6spy. Notice that setting 'hibernate.show_sql=true' shows all generated SQL statements with '?' in place of SQL parameter values. If you want to log the actual SQL queries as they reach the database, you need to use a proxy, like P6spy.

Run the application with:
mvn spring-boot:run

Information Need

Let's say you are asked to develop a REST service that returns the customers that have ordered at least once, along with their orders, based on age, gender, income and country. For instance, this could be done with a GET request on the url http://localhost:8080/report1/18/M/20000/UK which asks for customers at the age of 18 and are males and have 20000 income and are UK residents.

For demo purposes I have implemented 3 versions of this rest service, report1, report2, report3. They return the same results, but are implemented in different ways. The controller, service and repository code follow.

JPQL to SQL Query Translation

Report1

Request: GET http://localhost:8080/report1/18/M/20000/UK

DVDStoreService.report1 method calls the DVDStoreRepository.findByAgeAndGenderAndIncomeAndCountry method. Hibernate first generates a selection SQL query on the customers table to get all the customers that match the given inputs, as follows.

This query returns all the matching customers irrespective if they have done an order or not. Therefore, DVDStoreService.report1 method needs to filter them by calling the getOrders method on each one of them. Although, the Customer objects of the returned list have not populated the orders field, due to the LAZY strategy defined in the Customer class, each getOrders call triggers a selection SQL query on the orders table to get the orders of each customer if any. As 106 customers match the first query, 106 queries will run of the following form (1997369 is one of the 106 customerids ).

Notice that only 3 out of the 106 customers have ordered, i.e. the rest 103 queries return nothing.

Report2

Request: GET http://localhost:8080/report2/18/M/20000/UK

DVDStoreService.report2 method calls the DVDStoreRepository.findByAgeAndGenderAndIncomeAndCountryJoin method. As you can observe this method is annotated by a JPQL query that joins customers with orders and returns all the distinct customers. Distinct is needed, as a customer may have ordered multiple times. This query exactly computes what the service needs to return, so no additional logic is needed in the report2 method. Hibernate first generates an SQL join of the following form.

Although this SQL query computes the join of customers and orders relations, it only returns the customers table attributes. Therefore, Hibernate does not populate the orders field of the matching Customer objects. However, during JSON serialization phase, Jackson (the library Spring uses for JSON serialization by default) calls the getOrders method on each one of the 3 matching Customer objects. This triggers Hibernate to send to the database 3 more SQL queries of the oders_selection.sql form, one per matching customerid.

Report3

Request: GET http://localhost:8080/report3/18/M/20000/UK

DVDStoreService.report3 method calls the DVDStoreRepository.findByAgeAndGenderAndIncomeAndCountryFetchJoin method. This is annotated with a very similar JPQL query as findByAgeAndGenderAndIncomeAndCountryJoin, with the only difference that instead of JOIN, it uses JOIN FETCH. This notifies Hibernate to eagerly fetch all the information needed to reconstruct the matching Customer objects. This JPQL query is translated in a single SQL query, which follows.

If you compare this (join_fetch.sql) with the SQL join query (join.sql) generated for report2 method, they only differ in the first line, i.e. in the attributes of the join relation they return (or project in relational algebra terminology). Join_fetch.sql returns all the attributes of the join relation including orders attributes, while join.sql returns only the customer relation attributes.

Concluding, I have shown how the same information need can result to 107 SQL queries (report1), 4 SQL queries (report2) or a single SQL query (report3). Generally, a big number of SQL queries can cause a significant performance problem, especially if each one of them is slow and no caching is used. Minimizing this number is generally beneficial.

SQL Query Evaluation

In this section I focus on the Postgres evaluation plans for each SQL query referred in the previous section. The EXPLAIN ANALYZE Postgres command is used to get the plans.

Report1

The evaluation plan for customer_selection.sql follows.

As one can observe, to evaluate this query Postgres sequentially scans the whole customers table to find the matching customers. Since only 106 out of 2,000,000 customers match, unnecessary I/Os take place to read from disc all the rest, resulting to 2.1 sec execution time.

Each one of the 106 orders_selection.sql queries are evaluated as follows.

This query is evaluated very fast (i.e. 0.075 ms) due to a BTree index on orders.customerid attribute. This index is created by the dvdstore benchmark. Postgres does a Bitmap Index Scan on the index to find the pointers to all the matching tuples and then does a Bitmap Heap Scan on the table, to fetch from disc the pages containing these tuples.

We conclude that customer_selection.sql should be a target for optimization.

Report2

The evaluation plan for join.sql follows.

Postgres accesses the two relations and does a Nested-Loop join and finally removes duplicates with a Hash Aggregate. The query is slow (1.966 sec) and to find why, we should focus on the access paths used to touch the two relations. For orders relation Postgres does not touch the table at all. The only attribute it needs to touch is customerid and the BTree index ix_order_custid is build on that attribute. Therefore, Postgress does an Index Only Scan, which means it accesses only the index, not the table. Since the index is much smaller than the table, less I/O's are done. Moreover the index fits in main memory so that the Nested-Loop join algorithm can use it as the inner relation. This part of the plan is clearly optimal. Let's focus on the other access path, which is a Sequential Scan on customers table. As I explained for the customer_selection.sql query, this is far from optimal. We keep join.sql as a second optimization target.

After join.sql, 3 orders_selection.sql queries are run. As I explained in the above section this is evaluated very fast.

Report3

The evaluation plan for join_fetch.sql follows.

As one can observe this plan is very similar with the one produced for join.sql, except for the fact that join_fetch.sql needs to return all the attributes of the orders relation in addition to customers attributes. So, instead of an Index Only Scan, it does a Bitmap Index Scan on the BTree index and then reads the pages containing the matching orders tuples from the disc with a Bitmap Heap Scan. The response time is 2,023 sec which is unacceptable. We have a third optimization target.

SQL Query Evaluation Optimization

In the above section we found 3 queries with bad execution time, i.e. customers_selection.sql, join.sql, join_fetch.sql. For all of them, the bottleneck is the same, i.e. the Sequential Scan on the customers table that does unnecessary I/Os due to the fact that all the 2,000,000 tuples are read from disc, while only 106 for customers_selection.sql or 3 for join.sql and join_fetch.sql are matching. Therefore, we have a single optimization task, which is to create an alternative access path to the customers relation that avoids as much as possible to read from disc pages containing only irrelevant tuples.

The way to do this is to build an index on the customers table on one or more of the attributes for which we have filtering conditions, i.e. age, gender, income, country. But which of them? To answer this, we need to know the selectivity of each condition, i.e. the percentage of tuples each one matches. As one can easily verify by running count queries:

age=18 matches 27,303 tuples, i.e. 1.3%,
gender='M' matches 1,000,265 tuples, i.e. 50%,
income=20000 matches 398642 tuples, i.e. 20%
country='UK' matches 100169 tuples, i.e. 5%

The lower the matching tuples percentage the higher the benefit from an index on the related attribute. In this case, age and country are two good candidates. On the other hand, gender and income are not, as they match too many tuples. It should be stressed that an Index Scan on income with condition income=20000 would not touch only the 20% of the tuples that match, it would touch all the pages containing at least one matching tuple. And notice that a page typical contains many tuples, how many it depends on the tuple size in bits. Therefore, the number of total tuples read from disc would be much higher than 20%. Notice also, that comparing with a Sequential Scan, an Index Scan has an additional cost of fetching the index from the index, and following the pointers of the matching tuples to the table pages, let alone the cost to create and maintain the index during updates. Therefore, there is a point higher of which a Sequential Scan is preferred from an Index Scan and although this point depends on many factors, it is generally estimated that there is no benefit if the matching tuples are more than 20%.

Having found age and country as the two candidate attributes for indexing, the next question is which type of an index. In theory (and in practice for other DBMS like Oracle) a Hash index is best for equality ('=') or inequality ('!=') conditions, while a BTree one is best for range ('<', '>', '<=', '>=', ) conditions. In practice however, BTrees are always preferred with Postgres, as the Hash index implementation has not very good performance and is not WAL-logged, meaning that in case of a DBMS failure, it is lost and you have to rebuild it from scratch.

The final question is if it is better to build a BTree index on age or on country, or on the combination (age, country). As previously discussed an index comes with an overhead to created and maintain it during updates, e.g. the BTree is a balanced tree and may need to re-balance during an insert or a delete. Therefore, it is a good practice to build an index that can benefit more than one queries. In this demo example we only have 3-4 queries, but in a real app you typically have many. A compound index on (age, country) combination, seems to match our optimization need best, but can only be used by queries with a condition on age or both on age and country. It can not be used for queries with a condition on country only. If it is better to create two separate indices one on age and one on country, or a compound on (age, country) depends on the query workload of the application. But, creating all the three indices is out of question.

I proceed with selecting the compound index for this demo. To create it, run:
CREATE INDEX ix_cust_age_country ON customers USING btree (age, country);

The evaluation plan for customer_selection.sql follows.

As one can observe the access path is changed from Sequential Scan to Bitmap Index Scan on the compound index and the response time is reduced from 2.1 sec to 4.15 ms.

Due to the same change in the access path, the response time for join.sql and join_fetch.sql is reduced from 1.96 sec to 6.12 ms and from 2.02 sec to 7.76 ms respectively, as shown by the following evaluation plans.

Monday, November 16, 2015

Integrate Cucumber with WebDriver: BDD in automation

In this post, I show how Cucumber can be integrated with Selenium WebDriver for bringing BDD in end-to-end testing. As a system under testing, I choose the BBC website and test basic navigation on BBC news pages. The code is available on github.

Background

Make no mistake, cucumber's main purpose is not to be a testing tool, but to enforce collaboration between scrum teem members. The main objective is to ensure that all the team members have the same understanding of the requirements of the feature they have to develop and test. Business analysts that, together with the product owner, have the basic idea of the feature requirements, introduce it to the team and the team as a whole finalizes the requirements in the form of Cucumber Scenarios. These scenarios are written in Gherkin, a DSL very close to natural language. Scenarios can be created in different level of granularity, e.g. for unit, integration or end-to-end tests. Cucumber initiates the tests that are initially failing, as the feature has not yet been developed and then development starts. Scenarios do not only guide the testing, but also consist a live documentation of the feature.

In this post, I focus on end-to-end tests (frequently called 'automation'). These are typically written using the Selenium WebDriver framework. As it is well known, end-to-end tests have a high maintenance cost, e.g. when the UI changes, tests might fail, as some web elements may no longer be present or may have been moved. The work you need to fix them highly depends on the presence or absence in/from your test code of design patterns such as the Page Object and the Loadable Component patterns. In this post, I show the benefit for using them.

Setup

As a build and dependency management tool, I use maven. The pom file follows.

The Hamcrest library is optional, you can integrate Cucumber with Selenium WebDriver without it, but it is nice to have it, as it provides an elegant way to write assertions. Also I use 'cucumber-java8', as this allows for steps to be defined in lamdas instead of methods and this reduces the boilerplate.

BBC News Feature

Let's say we are in the team that develops/maintains the BBC website and in particular the news part. The feature we need to develop is that the user should be able to navigate from the home page to the news page, then select a news category, e.g. 'Science', then select a (let's say the first for simplicity) video page from the 'Watch/Listen' section and then share this video page on Facebook. As the feature is implemented by the real BBC team, you can try it. The sequence of the links could be the following, but always keep in mind that the Cucumber scenarios should be written before the development:

The Cucumber feature file follows.

The first two scenarios test that the user can navigate to the BBC News page by clicking either on the 'Latest news' link of the homepage or on the 'News' link on the navigation bar at the top of the home page.

The third scenario, which more precisely is a 'scenario outline' or a set of scenarios if you wish, tests the main feature which is to navigate from home page to news page, then to a news category page, then to the first video page of the category and finally to share it on facebook.

Gherkin is very easy to follow, shortly Feature, Scenario, Given, When, Then, And are cucumber reserved keywords, while the rest is English. Each sentence starting with Given, When, Then defines a step. For more info about Gherkin, steps and scenario outline, you can read the post BDD with Cucumber vs Spock. The main thing to keep is a cucumber scenario is the analog of a junit/testng test method, while a scenario outline is the analog of a set of calls to the same testng test method that takes an input whose value is provided by a testng data provider. The cucumber analog for the data provider is the Examples table. In short, category is the input, and the scenario outline tests that the feature works for any BBC news category appearing in the Examples table.

The Page Object Pattern

The main goal of the Page object pattern is to encapsulate details related to a page into a class object and expose only methods relevant with the functionality this page offers. Take for instance the news category page, the locator of the first video page link in the 'Watch/Listen' section is an internal detail that should be hidden from tests. The news category page object only needs to provide a 'clickFirstVideo' method and that's the only thing test code needs to be aware of.

The benefit is that when the UI changes and the locator is no longer valid, you will have to fix it on a single place, on the related page object. Experience shows that if you don't use the page object pattern, the end-to-end code maintenance cost will explode sooner or later.

The Loadable Component Pattern

While the page object pattern hides the details behind functionality offered by the page, the Loadable Component pattern hides the details behind the navigation to the page object. To apply this, WebDriver offers an abstract LoadableComponent class which your page object class has to extend. It has two abstract methods 'isLoaded' and 'load' that the page object class has to implement and a concrete method 'get' that calls 'isLoaded' to check if the page is already loaded and if not it calls the 'load' method and then the 'isLoaded' to verify it is finally loaded. The idea is to pass to the constructor of a page object class a reference to the parent page object and in the 'load' method of the child class to call first the 'get' method on the parent object (to load the parent page) and then call some functionality method on the parent object that will load the child page.

The implementation of the NewsCategoryPage class that implements both the page object and the loadable component patterns follows.

Notice the implementation of the 'load' method. It first calls 'parent.get()' to load the news page and then calls the 'selectCategory' method on the news page object to load the news category page.

It should be stressed that the loadable component pattern does not necessarily tight a page object class with a single navigation path. Take for instance the News page, you have 2 ways to navigate to it from the BBC home page by clicking on either 'Latest news' link on the main section, or on the 'News' link on the navigation bar. The relevant part of the news page object class follows (you can find the whole class code on github).

Steps Implementation

The Steps implementation follows.

You can run the scenarios either on eclipse by right clicking on the BBCNewsTest.java -> Run As -> Junit test, or with maven: mvn clean test

Further Improvement

A nice observation for the loadable component pattern is that that if the page object class implements it and you call the 'get' method on it, you implicitly run assertions that verify the page has been 'correctly' loaded, where 'correctly' is defined in the 'isLoaded' method implementation. Despite its name that starts with 'is' this method does not return a boolean value. Instead, its return type is void, while it 'throws Error'. This typically contains all the assertions needed to verify it was loaded properly.

To understand this, recall how get method is implemented in the LoadableComponent class.

It should now be obvious that the second scenario is redundant. If the news page does not load when you click on 'News' link on the home page navigation bar, then the Step "When the user clicks 'News' on the navigation bar", which is shared in the second scenario and the scenatrio outline, will fail, due to the assertions in the NewsCategoryPage 'isLoaded' method. Therefore the second scenario can be safely omitted.

Saturday, November 7, 2015

Groovy Shell Scripting Meets Hadoop

In this post I show how Hadoop map and reduce functions can be written as Groovy shell scripts in a concise and elegant way.

The most common way to kick of a hadoop map-reduce task by command line is a python script. Alternatively, one could write the map and reduce functions in java, but he has to pay the additional cost of packaging them in a jar and running the jar in the command line. The difference between python and java is that python is a scripting language, while java not. For instance, you can run directly python as a script, by adding "#!/usr/bin/env python" at the top of your files.

Groovy fills this gap, being the first scripting language among the family of jvm languages. Reasons to prefer groovy from python inlcude:

You are a java/groovy developer and you want to write the map reduce functions in a familiar language;
You want to leverage the power of jvm as well as of a vast amount of libraries written in any jvm language;
You find python's syntax pathetic, e.g. change indentation and you change the code semantics;
Groovy goodies allow you to quickly write concise and elegant code.

Requirements

For running hadoop, I use the cloudera vm (you can download it here for free). Of course, you can use the hadoop infrastructure of your wish. Hereafter, I refer to cloudera vm and you need to adjust properly if you use an alternative.

Groovy needs to be installed in the machines that run hadoop. It is not on cloudera vm, so you need to do it, read instructions here.

The Wordcount Example

This is probably the most popular "hello word" hadoop example. The problem is stated as follows: Given a collection of files that contain strings separated by blank spaces, compute the number each word occurs across the collection. The output should be a file with records of the form <word #occurences>.

The mapper groovy script follows.

The logic is clearly communicated by the code: for each line read from the standard input, trim it, get all the words (assuming they are separated with each other by blank spaces) and for each word emit "word 1" (a tab separates the key from the value), meaning that we found one occurrence of the word. The first line #!/usr/bin/env groovy tells the OS to use groovy to interpret this file.

The reducer groovy script follows.

Again, the logic is straightforward: for each line read from the standard input, trim it and get the word, #occurence fields out of it. If the current word is different than the previous emit the record "word counter" and reset the counter. In any case, update the previous word with the current one and sum the current value (#occurence) to the total counter. In the end, emit the record for the last word.

You need to make these 2 scripts executable:
chmod +x wordcount_mapper.groovy
chmod +x wordcount_reducer.groovy

Assuming, you have put the input files under the hdfs directory '/user/cloudera/wordcount_input', you can run the hadoop with:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/cloudera/wordcount_input -output /user/cloudera/wordcount_output -mapper wordcount_mapper.groovy -reducer wordcount_reducer.groovy

If you get a failure including the message "Container killed on request. Exit code is 143", this means that groovy is not visible by the datanode (cloudera vm includes just one data node) that needs to run the map or reduce jobs. You may have installed groovy on the cluster node, but you need to install it on all data nodes too, as hadoop pushes computation to them.

A workaround for cloudera vm is to replace "groovy" in the first line of your scripts with the absolute path of the groovy file. To find where groovy is installed, run:
printenv GROOVY_HOME

For instance, if you have installed it with sdkman, groovy is located under '/home/cloudera/.sdkman/candidates/groovy/current/bin/' and you have to replace the first line of both scripts with:
#!/usr/bin/env /home/cloudera/.sdkman/candidates/groovy/current/bin/groovy

Although not needed for this problem, you can reuse any published jvm library. Grape will fetch the dependency from a repository and you need to add import statements as you normally do in java. An example can be found here.