I wrote this post in November 2021. At that time, the latest Qlik Replicate release was 2021.5. If you wanted to adapt the Chunk Size in the Salesforce endpoint in Qlik Replicate, you needed to do it as shown here. For later Replicate versions, check if there is new information in the online documentation about Chunk Size optimisation before adjusting it.
This document focuses on the optimisation opportunities that a Salesforce feature, PK Chunking, offers to replicate data with Qlik Replicate. This optimisation is critical when the customer extracts large amounts of data, Qlik Replicate shares the network with other applications, and competes for resources over the network, in Salesforce, etc.
As an example, we will use the high-level architecture below in this post.
Salesforce: PK Chunking
Salesforce has a feature called PK Chunking in their Bulk API. It improves the extraction of large data sets. Qlik Replicate relies on PK Chunking to improve performance.
Salesforce explains in its blog:
«By default the Bulk API will split the query into 100,000 record chunks – you can use the ‘chunkSize ‘header field to configure smaller chunks or larger ones up to 250,000. Larger chunk sizes will use up fewer Bulk API batches, but may not perform as well. For each object you are extracting, you might need to experiment a bit to determine the optimal chunk size.»
So the smaller the chunkSize is, the better performance you get and the larger the expenses to extract data from Salesforce.
Note that PK Chunking is not just optimisation. In the Bulk API 2.0 and Bulk API Developer Guide, v52.0, Salesforce explains that the Bulk API does not work well in some cases with large tables (>= 1 million rows). Their recommendation is to use PK Chunking for large tables.
«If the query succeeds, Salesforce attempts to retrieve the results. If the results exceed the 1-GB file size limit or take longer than 10 minutes to retrieve, the completed results are cached and another attempt is made. After 15 attempts, the job fails and the error message “Retried more than fifteen times” is returned. In this case, consider using the PK Chunking header to split the query results into smaller chunks. If the attempts succeed, the results are returned and stored for seven days.»
How to leverage PK Chunking with Qlik Replicate
You can adjust chunkSize in Replicate at the endpoint level. By default, the Chunk Size is in Replicate is 200k – the chunkSize default value in Salesforce is 100k.
As Salesforce recommends, you must test the best value for chunkSize for the different tables and find the sweet spot for them.
You may need to use a chunk size for some tables you replicate from Salesforce and a different one for a few others. The reasons could be:
- Having a more optimum replication performance,
- Reduce or eliminate errors due to large data packages (e.g., network errors, connection resets, issues for Replicate to close the Bulk jobs, etc.),
- If a task fails while executing, you will need to reload the tables again, which implies calling the Salesforce Bulk API more times. Examples of reasons that make a task fail: a connection reset, a suspended table, not enough space in the target, etc.
The diagram below explains how to set up Replicate to use two different Chunk Sizes and optimise the workload. You need to create two Salesforce endpoints; each of them has a different Chunk Size. You must split the tasks among the two endpoints and associate the Salesforce objects with the best Chunk Size for them.
To explain the setup, let’s assume that you must set up the Chunk Size in your configuration as below – again, you must test the best Chunk Size values for your environment:
- Chunk Size = 40k à for three tables, and
- Chunk Size = 250k for all the rest.
So you could set up endpoint 1 with Chunk Size = 40k and endpoint 2 with the Chunk Size = 250k.
How to set up the Salesforce endpoints in Replicate
There are several screenshots in this section to guide you through setting up two Salesforce endpoints in Replicate.
Leave a Reply