Data dedupe
When ELT Data fetches your application data, it carries out the following transformations:
- Dedupes data, i.e. eliminates duplicate copies of data
- Flattens data - the API response might be a nested or a complex JSON. ELT Data flattens the JSON to make it easier for users to consume.
ELT Data supports the following data dedupe strategies:
- Full Refresh
- Full Refresh and Append
- Incremental Dedupe
- Incremental Drop and Load
Based on the data and your business requirement, you can select one of these dedupe strategies for each API or table. The strategies are detailed below.
Full Refresh
A full refresh data strategy is a data synchronization approach where the entire dataset is completely refreshed or reloaded from the source system into the destination system on a regular basis.
This strategy is often used when the source data changes infrequently or when it's more efficient to reload the entire dataset rather than tracking individual changes.
Benefits of full refresh data strategy
Simplicity: It's a straightforward approach where the entire dataset is replaced, making it easier to manage and troubleshoot data synchronization processes.
Data Consistency: The reporting database is guaranteed to be consistent with the source data after each refresh, eliminating the need to track and apply incremental updates.
Considerations
Data Volume: Full refreshes can be resource-intensive, so it's essential to consider the volume of data being processed and the impact on system resources.
Frequency: This strategy is most suitable for data sources that change infrequently, as frequent full refreshes can lead to unnecessary resource usage.
Historical Data: You'll need to decide whether to retain historical data in the reporting database or replace it with each refresh.
Example
- [Initial State, Data in DB] Consider the following data in database before full refresh deduplication.
1. Name: John Doe, Email: john@example.com, Phone: 123-456-7890
2. Name: Jane Smith, Email: jane@example.com, Phone: 987-654-3210
3. Name: Alice Johnson, Email: alice@example.com, Phone: 555-555-5555
4. Name: John Doe, Email: john@example.com, Phone: 123-456-7890 (Duplicate)
5. Name: Bob Johnson, Email: bob@example.com, Phone: 111-222-3333
- [New Data] Following is the incoming data:
1. Name: John Doe, Email: john@example.com, Phone: 123-456-7890
2. Name: Jane Smith, Email: jane@example.com, Phone: 987-654-3210
3. Name: Alice Johnson, Email: alice@example.com, Phone: 555-555-5555
4. Name: Bob Johnson, Email: bob@example.com, Phone: 111-222-3333
- [Final State, Data in DB] Final state of the DB after the full refresh dedupe
1. Name: John Doe, Email: john@example.com, Phone: 123-456-7890
2. Name: Jane Smith, Email: jane@example.com, Phone: 987-654-3210
3. Name: Alice Johnson, Email: alice@example.com, Phone: 555-555-5555
4. Name: Bob Johnson, Email: bob@example.com, Phone: 111-222-3333
Full Refresh And Append
This is same as Full Refresh data strategy.
The main difference between a Full Refresh data strategy and a Full Refresh And Append data strategy is in how they handle existing data. Full refresh replaces the entire dataset, while full refresh with append retains existing data and appends the new data. The choice between these strategies depends on your specific data management needs and whether you need to maintain historical data while keeping the dataset up to date.
Example
- [Initial State, Data in DB] Consider the following data in database before full refresh and append deduplication.
1. Name: John Doe, Email: john@example.com, Phone: 123-456-7890, ingestion_date: 2022-01-01
2. Name: Jane Smith, Email: jane@example.com, Phone: 987-654-3210, ingestion_date: 2022-01-01
3. Name: Alice Johnson, Email: alice@example.com, Phone: 555-555-5555, ingestion_date: 2022-01-01
4. Name: John Doe, Email: john@example.com, Phone: 123-456-7890 (Duplicate), ingestion_date: 2022-01-01
5. Name: Bob Johnson, Email: bob@example.com, Phone: 111-222-3333, ingestion_date: 2022-01-01
- [New Data] Following is the incoming data:
1. Name: John Doe, Email: john@example.com, Phone: 123-456-7890, ingestion_date: 2022-01-02
2. Name: Jane Smith, Email: jane@example.com, Phone: 987-654-3210, ingestion_date: 2022-01-02
3. Name: Alice Johnson, Email: alice@example.com, Phone: 555-555-5555, ingestion_date: 2022-01-02
4. Name: Bob Johnson, Email: bob@example.com, Phone: 111-222-3333, ingestion_date: 2022-01-02
- [Final State, Data in DB] Final state of the DB after the full refresh dedupe and append
1. Name: John Doe, Email: john@example.com, Phone: 123-456-7890, ingestion_date: 2022-01-01
2. Name: Jane Smith, Email: jane@example.com, Phone: 987-654-3210, ingestion_date: 2022-01-01
3. Name: Alice Johnson, Email: alice@example.com, Phone: 555-555-5555, ingestion_date: 2022-01-01
4. Name: John Doe, Email: john@example.com, Phone: 123-456-7890 (Duplicate), ingestion_date: 2022-01-01
5. Name: Bob Johnson, Email: bob@example.com, Phone: 111-222-3333, ingestion_date: 2022-01-01
6. Name: John Doe, Email: john@example.com, Phone: 123-456-7890, ingestion_date: 2022-01-02
7. Name: Jane Smith, Email: jane@example.com, Phone: 987-654-3210, ingestion_date: 2022-01-02
8. Name: Alice Johnson, Email: alice@example.com, Phone: 555-555-5555, ingestion_date: 2022-01-02
9. Name: Bob Johnson, Email: bob@example.com, Phone: 111-222-3333, ingestion_date: 2022-01-02
Incremental Dedupe
Incremental deduplication is a strategy for identifying and eliminating duplicate records from a dataset by focusing only on the new or changed data that has been added since the last deduplication process. This approach is useful for minimizing processing overhead. To run incremental dedupe the following fields are required:
- Unique Key - In order to identify if a record is new or existing, i.e. should a record be inserted or updated in teh DB
- Sort Key / Cursor - If we get a duplicate record, sort key helps us identify its latest version.
Example
Suppose you maintain a dataset of customer records for your business, and you want to perform incremental deduplication.
- Initial Data Deduplication: Initially, you perform a full deduplication of the data from the source application:
Initial Deduplicated Dataset:
- Customer ID: 1, Name: John Smith, Created Date: 2023-01-15, Last Modified Date: 2023-02-20
- Customer ID: 2, Name: Jane Doe, Created Date: 2023-02-10, Last Modified Date: 2023-03-25
- Customer ID: 3, Name: Alice Johnson, Created Date: 2023-03-05, Last Modified Date: 2023-04-15
New Customer Records and Updates: Over time, you receive new customer records and updates to existing records. Let's consider the following changes:
- New Customer: "Bob Johnson" is added with "Created Date" as 2023-04-01.
- Existing Customer Update: "John Smith" has some updated information, and the "Last Modified Date" is changed to 2023-04-05.
- New Customer: "David Lee" is added with "Created Date" as 2023-04-10.
- Customer ID: 4, Name: Bob Johnson, Created Date: 2023-04-01, Last Modified Date: 2023-04-01
- Customer ID: 1, Name: John Smith, Created Date: 2023-01-15, Last Modified Date: 2023-04-01
- Customer ID: 5, Name: David Lee, Created Date: 2023-04-01, Last Modified Date: 2023-04-01
- Incremental Dedupe Process: Perform incremental deduplication based on the "Customer ID" and "Last Modified Date."
For each new or updated record in the incoming dataset, compare it to existing records and:
- If no record with the same Customer ID (4) exists in the "existing records", then insert the record in the existing dataset.
- If a record with the same Customer ID (1) exists then check the latest record using the last modified date. Using the latest record update the ID=1 in the "existing records".
[Final dataset after dedupe]
- Customer ID: 2, Name: Jane Doe, Created Date: 2023-02-10, Last Modified Date: 2023-03-25 (Retained from the last fetch)
- Customer ID: 3, Name: Alice Johnson, Created Date: 2023-03-05, Last Modified Date: 2023-04-15 (Retained from the last fetch)
- Customer ID: 4, Name: Bob Johnson, Created Date: 2023-04-01, Last Modified Date: 2023-04-01 (Inserted record)
- Customer ID: 1, Name: John Smith, Created Date: 2023-01-15, Last Modified Date: 2023-04-01 (Updated record)
- Customer ID: 5, Name: David Lee, Created Date: 2023-04-01, Last Modified Date: 2023-04-01 (Inserted record)
Deleted Records
The deletes in incremental mode can be handled only by using Soft deleted approach, i.e. the rows have a marker indicating that the record has been deleted and is not in use anymore.
- Customer ID: 1, Name: John Smith, Status: Active
- Customer ID: 2, Name: Jane Doe, Status: Deleted
- Customer ID: 3, Name: Alice Johnson, Status: Active
Hard Delete - If a record is permanently delete from the source system, then it will not appear in the next data fetch. In this case it will continue appearing in the database. To accomodate hard deletes, occasionally run a FULL REFRESH of the dataset or consider the Incremental Drop And Load data strategy.
Incremental Drop and Load
Consider the following data scenarios:
- Records are being deleted from your dataset.
- You don't have a unique key or a sort key in your dataset to identify the latest records.
Incremental drop and load works in the following way:
- Pick up the sort key and drop all the values for the specific sort key in the existing dataset.
- Append the new dataset to the existing dataset.