Saturday, June 13, 2015

Spring Batch with csv4j: The logical choice

In this post I show how spring batch can leverage the power of csv4j to initialize the database of a spring application. In this scenario, CSV input files with incomplete data are hydrated by csv4j into objects of a predefined domain type. Spring data JPA is then used to map these objects into tuples of a database table. For the shake of simplicity, I use H2 memory database, so that you don't need to install any database server to run the code. As usually, I let spring boot to do the orchestration and provide all the configuration in java classes (strictly no xml files). The code is on github.

What is Spring Batch?

Spring batch is a framework for batch processing of repeated, usually heavy jobs. Each job is separated into sequential steps, each of which is decomposed into read-process-write activities. Those activities are modeled by spring batch interfaces ItemReader, ItemProcessor and ItemWriter respectively as follows:

This should sound familiar to java 8 developers, as it has the stream-map-reduce flavor.  

ItemReader is the first node in the pipeline. It produces objects of the domain type to be consumed by the next node in the pipeline in an one-at-a-time fashion. This is analog to calling list.stream() in java 8 (given a list of domain objects).

ItemProcessor is the analog of the mapper Function object that is passed as argument in the map(mapper)method, as process method effectively maps an instance of I type into an instance of O type.

ItemWriter is the final node in the pipeline. It waits until all objects are processed and then consumes them by writing them either on the filesystem or a database. Although, this has some similarity with the java 8 reduce phase, it differs in that the write method does not return anything, as java 8 reduce Function implementations do. Instead, it has the side effect of persisting objects in some format.

If you want to learn more about spring batch, read the Spring Batch Tutorial.

Use Case

Let's say we have an app, whose database schema is as follows:

Spring batch loads any SQL script found in the classpath before running any job. To recognize them they have to end with '.sql'. The -all ending means this script should run against any database that may be in use. Alternatively, we could provide a different schema per database server.

Also, let's say we have the following 4 input CSV files.

Notice how abnormal and incomplete the above data is. We consider that the fields relevant to our app are field0, field1, field2 and field3. Also we may know that field3 in data2.csv refers to the same entity as field1 in other files. That sort of dataset is common when you deal with data compiled from different web sources. It is under this scenario that csv4j shines.

Finally, we need a domain type, that needs to be mapped to both CSV files (for csv4j to work) and database schema (for spring data jpa to work).

The @CsvFields annotation is used to map a java field to one or more CSV fields. JPA annotations are used to map the class to the csvdata relation. Notice that JPA requires an id field that curries no business semantics. This is why we had to create the hibernate_sequence in the schema-all.sql.

Spring Batch Configuration

The whole spring batch configuration class can be found on github: BatchConfiguration.java.
Here I focus on the ItemReader as I don't need an ItemProcessor and I use the standard JpaItemWriter for persisting objects in the database.

We make the assumption that CSV input files are located under the 'data' directory in the classpath. For each input file, csv4j produces a list of domain objects. An Iterator over all the objects derived from all input files is created and passed to IteratorItemReader (one of the spring batch provided implementations of ItemReader) constructor. And that's all.

Run and Verify

Spring batch offers an abstraction for pre- and post- job completion activities. All you have to do is to subclass JobExecutionListenerSupport class and override beforeJob or/and afterJob methods. In this example I override afterJob, with an implementation that queries the database and logs all the domain objects found there after the batch job is completed.

Run the app with
  • maven: mvn spring-boot:run
  • gradle: gradlew bootRun
and get the following log:

No comments:

Post a Comment