Historical loads, pipeline concurrency and data checkpoints

July 23, 2024

Historical loads

During the life of a data pipline, you may need to run historical loads for different reasons including:

initial load of the data
backdated data is added to the system that needs to be loaded
destination data gets corrupted and the entire data needs to be synced again with the source system
new fields are added to the destination dataset, requiring a complete load of historical data

Additional context and decision points for a historical data load (say for the last 24 months):

some source applications don't let you fetch more than 30/x days data at a time. In such a case you would have to submit 24/y data pipeline runs with different date parameter values and monitor the execution of all the individual runs.
some source applications let you fetch the entire data in one go but then you have to ensure that the pipeline doesn't get timed out / stalled for any reason, or else you will have to rerun the entire process.
choose whether the multiple pipeline executions (24/y) should run concurrently or synchronously.
ensure that your historical data loads do not interfere with the scheduled pipeline runs.
check that your source system geared to handle a large number of execution requests in a short period of time.

Historical loads are a resource intensive processes. In addition to proper resource allocation, historial loads require planning and thorough monitoring.

Pipeline concurrency

If not properly managed, concurrent runs of a pipline with different parameters can corrupt your data. This can cause continuity issues or downtime in your operations and reporting.

By default, ELT Data runs pipelines with a data concurrency of 1, i.e. only one instance of a pipeline will run at any given time. ELT Data allows concurrent runs (concurrency > 1) of the same pipeline under supervision.

Data checkpoints and rollback

Since data and business requirements are dynamic, the destination data may occasionally get corrupted during a data load. To address this, ELT Data creates automated checkpoints to which users can rollback the warehouse tables as required. This enables business continuity (with data delay), version control and data debugging to diagnose and fix the destination data.