Handling schema evolution in ELT Data

July 23, 2024

Data schema evolution in Extract, Load, Transform (ELT) pipelines is a critical aspect of managing data as it grows and changes over time. Source systems and business requirements are dynamic leading to schema updates. Consider the following:

Source systems - Assume your source application added or removed a field from the API. This shouldn't cause your pipelines to come to a stall.
Business requirements - Your business wants to report on new KPIs and requests you to add a few more columns to the data pipelines.

Both the above scenarios are extremely common during the lifecycly of data pipliens and handling schema updates is necessary for operational continuity and data integrity.

Schema evolution in ELT Data

ELT Data automatically evolves your schema as the data or the business requirements change. Consider the following scenarios:

1. Adding new columns:

ELT Data adds the new columns to your dataset. For the existing rows, the data is set to null.

Consider the following original data:

order_id	customer_id	order_date	total_amount
1	101	2024-07-01	100.00
2	102	2024-07-02	150.00

New data:

order_id	customer_id	order_date	customer_email	total_amount
3	103	2024-07-03	unknown@example.com	200.00

Final dataset: New column added from the new data, col value for old data set to null.

order_id	customer_id	order_date	customer_email	total_amount
1	101	2024-07-01	null	100.00
2	102	2024-07-02	null	150.00
3	103	2024-07-03	unknown@example.com	200.00

2. Dropping columns

ELT Data will not drop the column from your dataset but will set the value to null for the rows where the column is missing.

Consider the following original data:

order_id	customer_id	order_date	customer_email	total_amount
1	101	2024-07-01	unknown@example.com	100.00
2	102	2024-07-02	unknown@example.com	150.00

New data drops "total_amount" column

order_id	customer_id	order_date	customer_email
3	103	2024-07-03	unknown@example.com
4	104	2024-07-04	customer@domain.com

Final dataset

order_id	customer_id	order_date	customer_email	total_amount
1	101	2024-07-01	unknown@example.com	100.00
2	102	2024-07-02	unknown@example.com	150.00
3	103	2024-07-03	unknown@example.com	null
4	104	2024-07-04	customer@domain.com	null

3. Renaming Columns

ELT Data will handle the renaming of the columns in the following method:

Consider the following original data:

order_id	customer_id	order_date	customer_email	total_amount
1	101	2024-07-01	unknown@example.com	100.00
2	102	2024-07-02	unknown@example.com	150.00

New data renames "total_amount" to "sales_amount" column

order_id	customer_id	order_date	customer_email	sales_amount
3	103	2024-07-03	unknown@example.com	200.00
4	104	2024-07-04	customer@domain.com	250.00

Final dataset

order_id	customer_id	order_date	customer_email	total_amount	sales_amount
1	101	2024-07-01	unknown@example.com	100.00	null
2	102	2024-07-02	unknown@example.com	150.00	null
3	103	2024-07-03	unknown@example.com	null	200.00
4	104	2024-07-04	customer@domain.com	null	250.00

4. Changing Data Types

! ELT Data will raise an error and stop the pipeline in case of changing data types.

Changing data types can cause further downstream chaos and we require manual intervention to resolve the changing data types.

Documenting the schema evolution

Version controlled schema: ELT Data automatically documents each schema update and you can view how the schema has evolved over a period of time.
Rollback schema changes: In addition, if you want to roll back to a previous version of the data (prior to schema evolution), ELT Data enables you to do that.
Monitoring schema changes: ELT Data alerts you in case of schema updates.

Schema evolution in ELT Data​

1. Adding new columns:​

2. Dropping columns​

3. Renaming Columns​

4. Changing Data Types​

Documenting the schema evolution​