Delta load
A delta load can significantly reduce the amount of data that has to be fetched from a source DB during the data model load.
To set up a delta load, Celonis 4 allows partitioning of the input tables. Having split the input tables into partitions, Celonis 4 then only fetches those partitions from the DB that it identified as changed since the last load. For identifying if the data inside partitions changed, Celonis 4 currently provides two strategies: "Timestamp Delta Load" and "Checksum Delta Load". For each table in the data model, the activation and selection of the partitioning strategy can be done in the table configuration of the data model editor:
The "Delta Load table" checkbox is available if the table is loaded from a database. As soon as it is checked, the "Delta Load" tab in the menu bar on the top appears:
In the Delta Load configuration tab, one of the available strategies can be chosen:
Timestamp Delta Load
This strategy partitions a table along a timestamp column. When a data model load is triggered in Celonis 4, this strategy checks for each partition if the number of entries, the maximum value or both changed. The retrieved values are called checksums and compared to the values from the previous load. If a checksum changes, the corresponding partition is reloaded from the source database.
Pro | Con |
---|---|
+ Easy to configure |
|
+ Applicable in many cases |
|
Example
In the below example, the column "Eventtime" is used as Timestamp Delta Load column. Here, the partitioning is based on month and we use the count of timestamps for the checksum.
First load
We first load this table from the source DB.
During the first data model load, checksum labels are created in the background (not visible for the user). The checksum here is the count of rows per month.
Second load
Now the input data on the source DB changed with an additional entry.
When we now trigger the data model load again, the checksums in the background (not visible for the user) are recalculated. The count per month changes for the month October from 2 to 3. The checksum for November stays the same as no value for November was added.
Now, only the data of the table that is in the month October is updated in Celonis, the data for which the Eventtime is in November is not updated. Even if the activity name in November changed in the source DB from "C" to "F", this value would not be updated with the defined Delta Load because the row count did not change.
Checksum delta load
To be more flexible, the checksum strategy allows you to define yourself which partitions are reloaded. The strategy relies on a separate checksum table. In this table you define the partitions and its checksums.
If you want a partition from the source table to be reloaded during the data model load, you only have to change the checksum value from the partition table. A data model reload will only load those partitions for which the checksum value of the partition key changed. The exact value of the checksum is not important.
Example
In this example the source table is partitioned by country code:
Source table | ||
Customer | City | Country |
---|---|---|
Customer A | Bristol | GB |
Customer B | Paris | FR |
Co ustumer C | London | GB |
Partition table | |
Partition key | Checksum |
---|---|
GB | 1 |
FR | 1 |
If in this example the checksum for FR is set to 2 and a reload from source is triggered, all customers from France are reloaded. The customers from Great Britain will not be updated.
The partition table is not managed by Celonis 4 and has to be maintained by the user. Also it is noteworthy that partitions which are not mentioned in the partition table are not fetched from the source database and are therefore not available in Celonis 4. In the example above, customers from Spain would not be loaded because there is no partition key ES in the partition table.
Pro | Con |
---|---|
+ Very flexible |
|
Queries sent to the source database
In Celonis 4.6, two queries are executed against the source database for loading a table. The first query is used to identify the partitions which have changed. The second query is used to load the changed partitions from the database.
Identify changed partitions
The query to identify the changed partitions depends on the chosen strategy.
Checksum:
SELECT $CHECKSUM_TABLE.$CHECKSUM_COLUMN, $CHECKSUM_TABLE.$KEY_COLUMN FROM $CHECKSUM_TABLE
The query fetches the KEY and the CHECKSUM column from the CHECKSUM_TABLE. It checks for each partition/KEY if the checksum has changed. To be able to perform this check, the checksums of the last load are stored on disk.
Timestamp:
SELECT <ROUNDED_TIMESTAMP>, <CHECKSUM_AGGREGATOR> FROM $SOURCE_TABLE GROUP BY <ROUNDED_TIMESTAMP> <ROUNDED_TIMESTAMP> = [EXTRACT(MINUTE FROM $TIMESTAMP_COLUMN) | EXTRACT(MINUTE FROM $TIMESTAMP_COLUMN) || EXTRACT(HOUR FROM $TIMESTAMP_COLUMN) | EXTRACT(MINUTE FROM $TIMESTAMP_COLUMN) || EXTRACT(HOUR FROM $TIMESTAMP_COLUMN) || EXTRACT(DAY FROM $TIMESTAMP:COLUMN) | ...] // This is database specific <CHECKSUM_AGGREGATOR> = [COUNT($TIMESTAMP_COLUMN) | MAX($TIMESTAMP_COLUMN) | COUNT($TIMESTAMP_COLUMN) || MAX($TIMESTAMP_COLUMN)]
This query groups the SOURCE_TABLE by rounding the timestamps. It calculates a checksum by using MAX , COUNT, or both combined. The part for rounding the timestamps is database specific because there is no way which works on all databases. The shown example uses the syntax which works on most databases, but there is for example a special syntax for SAP HANA in place.
An example for such a query is:
SELECT EXTRACT(MINUTE FROM Activity_Table.Timestamp) || EXTRACT(HOUR FROM Activity_Table.Timestamp), COUNT(Activity_Table.Timestamp) FROM Activity_Table GROUP BY EXTRACT(MINUTE FROM Activity_Table.Timestamp) || EXTRACT(HOUR FROM Activity_Table.Timestamp)
Load changed partitions
After the changed partitions are identified, the delta load executes a query to load them. All changed partitions are loaded in by one query:
SELECT $SOURCE_TABLE.$SOURCE_COLUMN1, $SOURCE_TABLE.$SOURCE_COLUMN2, ... FROM $SOURCE_TABLE WHERE $SOURCE_TABLE.$KEY_COLUMN IN($PARTITION_KEY1, $PARTITION_KEY2, ...) ORDER BY $SOURCE_TABLE.$KEY_COLUMN
The load is filtered down to the changed partitions by the IN clause. The result needs to be sorted so that Celonis 4 can process it.
Comparison against standard load
A standard load executes a single query, which looks like this:
SELECT $SOURCE_TABLE.$SOURCE_COLUMN1, $SOURCE_TABLE.$SOURCE_COLUMN2, ... FROM $SOURCE_TABLE WHERE $SOURCE_TABLE.$KEY_COLUMN
This query is simpler and also no additional query to identify the changed partitions is necessary. It is crucial for the performance of the delta load that the query optimizer of the source database is able to create a query plan which efficiently makes use of the IN where clause. If that is not the case, the delta load can take longer then a standard load even though less data is loaded from the source database. Especially for complex views, it can be hard for the query optimizer to make the right decisions. Therefore in case of a bad performance of the delta load, we advise consulting your database expert.