A Composite Reader for Batch Processing

N+1 selects are a big deal when it comes to user-facing applications. This is due, in fact, to visible latency introduced by multiple database calls, which are almost always remote.

But what about batch applications?

You may either be able to get away with it, given the larger processing window, or write stored procedures to localize the selects within the database. But what if you had to support multiple databases, and you have a narrow batch window? Maybe you belong to one of those organizations where installing a stored procedure requires prolonged consultations with the dreaded database team. Or just perhaps, you are one of those people (a trained C programmer, maybe?) that revels in extracting every ounce of efficiency from the system. Well, look no further.

Spring Batch is the de-facto framework when it comes to writing batch applications in Java. It has all the design patterns codified and you have pretty much everything available out of the box. You have plain readers, processors, and writers, and if things get more complicated, you have composite processors and writers. Wait, what about composite readers? Well, the reader API deals with retrieving one item at a time.

T read() throws Exception, ...;

I guess this is the common denominator across various data sources like message queues that allow pulling only one message. So, though you might be retrieving a page of rows in the reader select, you only get access to one row at a time.

Enough background, let me give a concrete example of the problem. Say I have these two tables:

Order table with columns (id, order_dt) 
Order_Item table with columns (id, order_id, product_id, quantity, cost)

I need to process orders and their associated items together. Why? Say I want to collapse the product ids of all order items for an order into a single field as a comma-separated String. A join in the paginated reader won't work because there is no guarantee that I will get all the items for the last order on the same page. If I were to read the orders in the reader and the order items in the processor, then I will be making an extra select per order. So, if I had a page size of 100, I would be making 100+1 selects.

select * from Order order by order_dt;
for each id IN Order:
select * from Order_Item where order_id = :id;

If I could get access to the 100 order ids, I could get all the order items for these orders with a single select using an IN clause — that is 2 selects vs 101 per page. When you consider that a batch process could operate on thousands of pages, this is nothing to sniff at.

select * from Order order by order_dt;
select * from Order_Item where order_id in (:ids) order by order_id;

The only way to achieve this is to extend the paging reader, specifically the JdbcPagingItemReader . This reader extends the AbstractPagingItemReader , which contains the following property:

protected volatile List results;

The JdbcPagingItemReader sets the results in this method:

protected void doReadPage()

What we need is to get the results and pass it as input to our custom method, had the above method returned a List<T> instead, extending it would have been clean. Instead, we have to bank on implicit knowledge that doReadPage sets the results property. In the spirit of Spring, let's use the strategy pattern and create an interface for our custom method. I give you the PageProcessor :

public interface PageProcessor<T> {
void process(List<T> page);
}

Putting it all together, we get:

public class CompositeJdbcPagingItemReader<T> extends JdbcPagingItemReader<T> {
private PageProcessor<T> pageProcessor;

public void setPageProcessor(PageProcessor<T> pageProcessor) {
this.pageProcessor = pageProcessor;
}

@Override
public void afterPropertiesSet() throws Exception {
super.afterPropertiesSet();
Assert.notNull(pageProcessor, "Page processor may not be null");
}

@Override
protected void doReadPage() {
super.doReadPage();
if (!results.isEmpty()) {
pageProcessor.process(results);
}
}
}

I have provided a sample Spring Boot project for your reading pleasure. The OrderJob reads orders from a database and writes to a flat-file. The OrderPageProcessor contains the single order_item query for a page of orders, and the logic to collapse the product_ids . You can see that the trade-off for a single query is the additional logic to split the items by order. It uses an embedded h2 database for convenience, so we are not really doing remote database calls. You could easily change it to use your favorite production database and seed it with a lot many rows to compare the benefits of this approach.

In conclusion, efficient batch processing in the middle tier is mostly about reducing remote calls. While you don't find a lot of people clamoring to add this feature to Spring Batch, I have used it in production and gained measurably better throughput. While the above is a made-up example, I have used this approach in various domains like feeding patient information to a rules engine and bulk indexing workflow form content and metadata in Apache Solr. Enough talk. How efficient? Well, YMMV.

But what about batch applications?

Recommend

What’s New in Swift 5.1? [FREE]

技术|精通 Linux 的 ls 命令

Doka.js找到利基市场每月3千美元

Shinguz: Schulung MariaDB/MySQL für Fortgeschrittene im August 2019 in Köln

MySQL/MariaDB: use SQL properly to run less queries

IntelliJ IDEA 2019.2最新解读：性能更好，体验更优，细节处理更完美！

Rustlings: Small exercises to get you used to reading and writing Rust code

istio 庖丁解牛(五) 多集群网格分析

GIEC2019进入倒计时，BAT同台竞逐数字商业

为什么杨紫可以拿到这么多一线资源，而且都能大爆，只是因为她在同龄小花里演技比较好...

About Joyk