Skip to main content

Handling schema evolution in ELT Data

Data schema evolution in Extract, Load, Transform (ELT) pipelines is a critical aspect of managing data as it grows and changes over time. Source systems and business requirements are dynamic leading to schema updates. Consider the following:

  • Source systems - Assume your source application added or removed a field from the API. This shouldn't cause your pipelines to come to a stall.

  • Business requirements - Your business wants to report on new KPIs and requests you to add a few more columns to the data pipelines.

Both the above scenarios are extremely common during the lifecycly of data pipliens and handling schema updates is necessary for operational continuity and data integrity.

Schema evolution in ELT Data

ELT Data automatically evolves your schema as the data or the business requirements change. Consider the following scenarios:

1. Adding new columns:

ELT Data adds the new columns to your dataset. For the existing rows, the data is set to null.

  • Consider the following original data:
order_idcustomer_idorder_datetotal_amount
11012024-07-01100.00
21022024-07-02150.00
  • New data:
order_idcustomer_idorder_datecustomer_emailtotal_amount
31032024-07-03unknown@example.com200.00
  • Final dataset: New column added from the new data, col value for old data set to null.
order_idcustomer_idorder_datecustomer_emailtotal_amount
11012024-07-01null100.00
21022024-07-02null150.00
31032024-07-03unknown@example.com200.00

2. Dropping columns

ELT Data will not drop the column from your dataset but will set the value to null for the rows where the column is missing.

  • Consider the following original data:
order_idcustomer_idorder_datecustomer_emailtotal_amount
11012024-07-01unknown@example.com100.00
21022024-07-02unknown@example.com150.00
  • New data drops "total_amount" column
order_idcustomer_idorder_datecustomer_email
31032024-07-03unknown@example.com
41042024-07-04customer@domain.com
  • Final dataset
order_idcustomer_idorder_datecustomer_emailtotal_amount
11012024-07-01unknown@example.com100.00
21022024-07-02unknown@example.com150.00
31032024-07-03unknown@example.comnull
41042024-07-04customer@domain.comnull

3. Renaming Columns

ELT Data will handle the renaming of the columns in the following method:

  • Consider the following original data:
order_idcustomer_idorder_datecustomer_emailtotal_amount
11012024-07-01unknown@example.com100.00
21022024-07-02unknown@example.com150.00
  • New data renames "total_amount" to "sales_amount" column
order_idcustomer_idorder_datecustomer_emailsales_amount
31032024-07-03unknown@example.com200.00
41042024-07-04customer@domain.com250.00
  • Final dataset
order_idcustomer_idorder_datecustomer_emailtotal_amountsales_amount
11012024-07-01unknown@example.com100.00null
21022024-07-02unknown@example.com150.00null
31032024-07-03unknown@example.comnull200.00
41042024-07-04customer@domain.comnull250.00

4. Changing Data Types

! ELT Data will raise an error and stop the pipeline in case of changing data types.

Changing data types can cause further downstream chaos and we require manual intervention to resolve the changing data types.

Documenting the schema evolution

  • Version controlled schema: ELT Data automatically documents each schema update and you can view how the schema has evolved over a period of time.

  • Rollback schema changes: In addition, if you want to roll back to a previous version of the data (prior to schema evolution), ELT Data enables you to do that.

  • Monitoring schema changes: ELT Data alerts you in case of schema updates.