Thursday, November 26, 2015

Hibernate: Good but not Enough

ORM is an abstraction of data (sql/nosql) management systems that help developers decouple the application code from the underlying data management system. Besides the benefits that decoupling of any short of components brings into software development, like easiness to test components separately, flexibility to change the implementation of the one component with no effect on the other, ORM tools also decrease the developer skillset required to built n-tier applications. For instance, a java developer needs to know how to use Hibernate. JPQL allows him to carry on thinking in terms of objects even when he accesses the data layer. It seems there is no need for knowledge on relational algebra, indexing or query optimization.

This perception can be true for many small/medium sized companies with non data intensive applications. If your database is relatively small and all the data as well as the query computation on the data can fit in the database server memory, then clearly a good Hibernate knowledge is enough. In this setting, inefficient mapping of JPQL into SQL queries by Hibernate, or inefficient query evaluation strategy by the database might not significantly harm the application performance. But as long as the database size increases and especially if the query computation can not fit in the database server memory, these inefficiencies are revealed.

This post is not an argument against Hibernate (or ORM in general). The point I want to make is that  the general perception that Hibernate allows the developers to be oblivious of the underlying data storage model (e.g. the relational) or data management system implementation (e.g. PostgreSQL) will not work if you want to handle datasets of big size.

Setup

As a data management system I will use PostgreSQL, v9.4. Postgres is an open-source object-relational database. Although, it supports some functionality inherited from object-oriented databases (e.g. subtables and tuple identifiers), it is commonly used in practice as a merely relational database. One of the reasons explaining its popularity against other open-source relational databases is its advanced query optimizer.

It is critical to refer to the shared_buffers option of the postgresql.conf file. This is the memory Postgres has allocated from the OS to run all its computations. Postgres implements its own buffer replacement policy (LRU by default) for this memory fragment, on top of the OS buffer replacement policy.  It does not matter how match memory your machine has, Postgres will only use the amount selected for this parameter. It is critical to select a number that is much less than the database size, for the performance issues to be revealed. For this demo, where I use a database of 1GB, I select shared_buffers=128MB.

As a dataset I will use a synthetic relational dataset generated using the famous DELL dvdstore generator, that is also used (and extended) by VMWare to benchmark database management systems (hereafter DBMS) on its VMs. The generator allows you to specify the size of the database, the DBMS and the operating system. Steps to generate the dataset on Windows follow:
  1. Download ds21.tar.gz  and also ds21_postgresql.tar.gz on the same directory.
  2. Unzip (e.g. select 'Extract here' with WinRAR) both of them on the same directory.
  3. Go to the /ds2 directory and run from the command line perl Install_DVDStore.pl You will be promted to input some parameters, select  
    1. 1 GB (it first asks for the number and then for specifying if the number denotes MB or GB)
    2. PGSQL
    3. WIN (change to LINUX if you run on Linux)
  4. Step (3) creates the dataset in csv files under the directories cust/, orders/, prod/ under the  ds2/data_files directory. These are to be loaded in next steps into the database by using the 'COPY' Postgres command. Unfortunately, every second line in the csv files is empty, while 'COPY' command does not allow empty lines, it translates them into missing column values and fails. You will need to remove empty lines. You can do it in many ways, I did it with a groovy script. Just put the csv files in a directory, e.g. 'input/' and run (assuming you have groovy installed) from command line removeBlankLines input. This will create a directory 'clean/' with the output csv files.
  5. Create a user ds2 and a database ds2 on Postgres, e.g. on pgAdmin3.
  6. Run the script ds2/pgsqlds2/pgsqlds2_create_all.sh, or open the script and run the commands one-by-one.
I have built a demo spring web app on top of this database using spring-data-jpa and spring-web-mvc components and let spring-boot do the orchestration. I create a domain model for customer and order entities and map them to the existing database schema with JPA annotations. For this demo, I will not touch the rest entities. I use 'spring.jpa.hibernate.ddl-auto=validate' on application.properties file to tell spring to use the existing database, rather than creating the schema based on the JPA annotations every time the app starts. As my focus is on the data layer, I annotate the controller with @RestController, to tell spring to produce JSON responses. The code is uploaded on github.

To log the SQL queries that reach the database I use P6spy. Notice that setting 'hibernate.show_sql=true' shows all generated SQL statements with '?' in place of SQL parameter values. If you want to log the actual SQL queries as they reach the database, you need to use a proxy, like P6spy.

Run the application with:
mvn spring-boot:run

Information Need

Let's say you are asked to develop a REST service that returns the customers that have ordered at least once, along with their orders, based on age, gender, income and country. For instance, this could be done with a GET request on the url http://localhost:8080/report1/18/M/20000/UK which asks for customers at the age of 18 and are males and have 20000 income and are UK residents.

For demo purposes I have implemented 3 versions of this rest service, report1, report2, report3. They return the same results, but are implemented in different ways. The controller, service and repository code follow.


JPQL to SQL Query Translation

Report1

Request: GET http://localhost:8080/report1/18/M/20000/UK

DVDStoreService.report1 method calls the DVDStoreRepository.findByAgeAndGenderAndIncomeAndCountry method. Hibernate first generates a selection SQL query on the customers table to get all the customers that match the given inputs, as follows.


This query returns all the matching customers irrespective if they have done an order or not. Therefore, DVDStoreService.report1 method needs to filter them by calling the getOrders method on each one of them. Although, the Customer objects of the returned list have not populated the orders field, due to the LAZY strategy defined in the Customer class, each getOrders call triggers a selection SQL query on the orders table to get the orders of each customer if any. As 106 customers match the first query, 106 queries will run of the following form (1997369 is one of the 106 customerids ).


Notice that only 3 out of the 106 customers have ordered, i.e. the rest 103 queries return nothing.

Report2

Request: GET http://localhost:8080/report2/18/M/20000/UK

DVDStoreService.report2 method calls the DVDStoreRepository.findByAgeAndGenderAndIncomeAndCountryJoin method. As you can observe this method is annotated by a JPQL query that joins customers with orders and returns all the distinct customers. Distinct is needed, as a customer may have ordered multiple times. This query exactly computes what the service needs to return, so no additional logic is needed in the report2 method. Hibernate first generates an SQL join of the following form.

Although this SQL query computes the join of customers and orders relations, it only returns the customers table attributes. Therefore, Hibernate does not populate the orders field of the matching Customer objects. However, during JSON serialization phase, Jackson (the library Spring uses for JSON serialization by default) calls the getOrders method on each one of the 3 matching Customer objects. This triggers Hibernate to send to the database 3 more SQL queries of the oders_selection.sql form, one per matching customerid.

Report3

Request: GET http://localhost:8080/report3/18/M/20000/UK

DVDStoreService.report3 method calls the DVDStoreRepository.findByAgeAndGenderAndIncomeAndCountryFetchJoin method. This is annotated with a very similar JPQL query as findByAgeAndGenderAndIncomeAndCountryJoin, with the only difference that instead of JOIN, it uses JOIN FETCH. This notifies Hibernate to eagerly fetch all the information needed to reconstruct the matching Customer objects. This JPQL query is translated in a single SQL query, which follows.


If you compare this (join_fetch.sql) with the SQL join query (join.sql) generated for report2 method, they only differ in the first line, i.e. in the attributes of the join relation they return (or project in relational algebra terminology). Join_fetch.sql returns all the attributes of the join relation including orders attributes, while join.sql returns only the customer relation attributes.

Concluding, I have shown how the same information need can result to 107 SQL queries (report1), 4 SQL queries (report2) or a single SQL query (report3). Generally, a big number of SQL queries can cause a significant performance problem, especially if each one of them is slow and no caching is used. Minimizing this number is generally beneficial.

SQL Query Evaluation

In this section I focus on the Postgres evaluation plans for each SQL query referred in the previous section. The EXPLAIN ANALYZE Postgres command is used to get the plans.

Report1

The evaluation plan for customer_selection.sql follows.


As one can observe, to evaluate this query Postgres sequentially scans the whole customers table to find the matching customers. Since only 106 out of 2,000,000 customers match, unnecessary I/Os take place to read from disc all the rest, resulting to 2.1 sec execution time.

Each one of the 106 orders_selection.sql queries are evaluated as follows.

This query is evaluated very fast (i.e. 0.075 ms) due to a BTree index on  orders.customerid attribute. This index is created by the dvdstore benchmark. Postgres does a Bitmap Index Scan on the index to find the pointers to all the matching tuples and then does a Bitmap Heap Scan on the table, to fetch from disc the pages containing these tuples.

We conclude that customer_selection.sql should be a target for optimization.

Report2

The evaluation plan for join.sql follows.

Postgres accesses the two relations and does a Nested-Loop join and finally removes duplicates with a Hash Aggregate. The query is slow (1.966 sec) and to find why, we should focus on the access paths used to touch the two relations. For orders relation Postgres does not touch the table at all. The only attribute it needs to touch is customerid and the BTree index ix_order_custid is build on that attribute. Therefore, Postgress does an Index Only Scan, which means it accesses only the index, not the table. Since the index is much smaller than the table, less I/O's are done. Moreover the index fits in main memory so that the Nested-Loop join algorithm can use it as the inner relation. This part of the plan is clearly optimal. Let's focus on the other access path, which is a Sequential Scan on customers table. As I explained for the customer_selection.sql query, this is far from optimal. We keep join.sql as a second optimization target.

After join.sql, 3 orders_selection.sql queries are run. As I explained in the above section this is evaluated very fast.

Report3

The evaluation plan for join_fetch.sql follows.

As one can observe this plan is very similar with the one produced for join.sql, except for the fact that join_fetch.sql needs to return all the attributes of the orders relation in addition to customers attributes. So, instead of an Index Only Scan, it does a Bitmap Index Scan on the BTree index and then reads the pages containing the matching orders tuples from the disc with a Bitmap Heap Scan. The response time is 2,023 sec which is unacceptable. We have a third optimization target.

SQL Query Evaluation Optimization

In the above section we found 3 queries with bad execution time, i.e. customers_selection.sql, join.sql, join_fetch.sql. For all of them, the bottleneck is the same, i.e. the Sequential Scan on the customers table that does unnecessary I/Os due to the fact that all the 2,000,000 tuples are read from disc, while only 106 for customers_selection.sql or 3 for join.sql and join_fetch.sql are matching. Therefore, we have a single optimization task, which is to create an alternative access path to the customers relation that avoids as much as possible to read from disc pages containing only irrelevant tuples.

The way to do this is to build an index on the customers table on one or more of the attributes for which we have filtering conditions, i.e. age, gender, income, country. But which of them? To answer this, we need to know the selectivity of each condition, i.e. the percentage of tuples each one matches. As one can easily verify by running count queries: 
  1. age=18 matches 27,303 tuples, i.e. 1.3%,  
  2. gender='M' matches 1,000,265 tuples, i.e. 50%, 
  3. income=20000 matches 398642 tuples, i.e. 20%
  4. country='UK' matches 100169 tuples, i.e. 5%
The lower the matching tuples percentage the higher the benefit from an index on the related attribute. In this case, age and country are two good candidates. On the other hand, gender and income are not, as they match too many tuples. It should be stressed that an Index Scan on income with condition income=20000 would not touch only the 20% of the tuples that match, it would touch all the pages containing at least one matching tuple. And notice that a page typical contains many tuples, how many it depends on the tuple size in bits. Therefore, the number of total tuples read from disc would be much higher than 20%. Notice also, that comparing with a Sequential Scan, an Index Scan has an additional cost of fetching the index from the index, and following the pointers of the matching tuples to the table pages, let alone the cost to create and maintain the index during updates. Therefore, there is a point higher of which a Sequential Scan is preferred from an Index Scan and although this point depends on many factors, it is generally estimated that there is no benefit if the matching tuples are more than 20%.

Having found age and country as the two candidate attributes for indexing, the next question is which type of an index. In theory (and in practice for other DBMS like Oracle) a Hash index is best for equality ('=') or inequality ('!=') conditions, while a BTree one is best for range ('<', '>', '<=', '>=', ) conditions. In practice however, BTrees are always preferred with Postgres, as the Hash index implementation has not very good performance and is not WAL-logged, meaning that in case of a DBMS failure, it is lost and you have to rebuild it from scratch.

The final question is if it is better to build a BTree index on age or on country, or on the combination (age, country). As previously discussed an index comes with an overhead to created and maintain it during updates, e.g. the BTree is a balanced tree and may need to re-balance during an insert or a delete. Therefore, it is a good practice to build an index that can benefit more than one queries. In this demo example we only have 3-4 queries, but in a real app you typically have many. A compound index on (age, country) combination, seems to match our optimization need best, but can only be used by queries with a condition on age or both on age and country. It can not be used for queries with a condition on country only. If it is better to create two separate indices one on age and one on country, or a compound on (age, country) depends on the query workload of the application. But, creating all the three indices is out of question.

I proceed with selecting the compound index for this demo. To create it, run:
CREATE INDEX ix_cust_age_country ON customers USING btree (age, country);

The evaluation plan for customer_selection.sql follows.

As one can observe the access path is changed from Sequential Scan to Bitmap Index Scan on the compound index and the response time is reduced from 2.1 sec to 4.15 ms.

Due to the same change in the access path, the response time for join.sql and join_fetch.sql is reduced from 1.96 sec to 6.12 ms and from 2.02 sec to 7.76 ms respectively, as shown by the following evaluation plans.

No comments:

Post a Comment