Data schema evolution in Extract, Load, Transform (ELT) pipelines is a critical aspect of managing data as it grows and changes over time. Source systems and business requirements are dynamic leading to schema updates. Consider the following:
Source systems - Assume your source application added or removed a field from the API. This shouldn't cause your pipelines to come to a stall.
Business requirements - Your business wants to report on new KPIs and requests you to add a few more columns to the data pipelines.
Both the above scenarios are extremely common during the lifecycly of data pipliens and handling schema updates is necessary for operational continuity and data integrity.
Schema evolution in ELT Data
ELT Data automatically evolves your schema as the data or the business requirements change. Consider the following scenarios:
1. Adding new columns:
ELT Data adds the new columns to your dataset. For the existing rows, the data is set to null.
- Consider the following original data:
order_id | customer_id | order_date | total_amount |
---|---|---|---|
1 | 101 | 2024-07-01 | 100.00 |
2 | 102 | 2024-07-02 | 150.00 |
- New data:
order_id | customer_id | order_date | customer_email | total_amount |
---|---|---|---|---|
3 | 103 | 2024-07-03 | unknown@example.com | 200.00 |
- Final dataset: New column added from the new data, col value for old data set to null.
order_id | customer_id | order_date | customer_email | total_amount |
---|---|---|---|---|
1 | 101 | 2024-07-01 | null | 100.00 |
2 | 102 | 2024-07-02 | null | 150.00 |
3 | 103 | 2024-07-03 | unknown@example.com | 200.00 |
2. Dropping columns
ELT Data will not drop the column from your dataset but will set the value to null for the rows where the column is missing.
- Consider the following original data:
order_id | customer_id | order_date | customer_email | total_amount |
---|---|---|---|---|
1 | 101 | 2024-07-01 | unknown@example.com | 100.00 |
2 | 102 | 2024-07-02 | unknown@example.com | 150.00 |
- New data drops "total_amount" column
order_id | customer_id | order_date | customer_email |
---|---|---|---|
3 | 103 | 2024-07-03 | unknown@example.com |
4 | 104 | 2024-07-04 | customer@domain.com |
- Final dataset
order_id | customer_id | order_date | customer_email | total_amount |
---|---|---|---|---|
1 | 101 | 2024-07-01 | unknown@example.com | 100.00 |
2 | 102 | 2024-07-02 | unknown@example.com | 150.00 |
3 | 103 | 2024-07-03 | unknown@example.com | null |
4 | 104 | 2024-07-04 | customer@domain.com | null |
3. Renaming Columns
ELT Data will handle the renaming of the columns in the following method:
- Consider the following original data:
order_id | customer_id | order_date | customer_email | total_amount |
---|---|---|---|---|
1 | 101 | 2024-07-01 | unknown@example.com | 100.00 |
2 | 102 | 2024-07-02 | unknown@example.com | 150.00 |
- New data renames "total_amount" to "sales_amount" column
order_id | customer_id | order_date | customer_email | sales_amount |
---|---|---|---|---|
3 | 103 | 2024-07-03 | unknown@example.com | 200.00 |
4 | 104 | 2024-07-04 | customer@domain.com | 250.00 |
- Final dataset
order_id | customer_id | order_date | customer_email | total_amount | sales_amount |
---|---|---|---|---|---|
1 | 101 | 2024-07-01 | unknown@example.com | 100.00 | null |
2 | 102 | 2024-07-02 | unknown@example.com | 150.00 | null |
3 | 103 | 2024-07-03 | unknown@example.com | null | 200.00 |
4 | 104 | 2024-07-04 | customer@domain.com | null | 250.00 |
4. Changing Data Types
! ELT Data will raise an error and stop the pipeline in case of changing data types.
Changing data types can cause further downstream chaos and we require manual intervention to resolve the changing data types.
Documenting the schema evolution
Version controlled schema: ELT Data automatically documents each schema update and you can view how the schema has evolved over a period of time.
Rollback schema changes: In addition, if you want to roll back to a previous version of the data (prior to schema evolution), ELT Data enables you to do that.
Monitoring schema changes: ELT Data alerts you in case of schema updates.