Skip to main content

Concepts

Source

Source is a system from which you want to ingest the data. The system might make you data available via the API, Email Reports or SFTP folder.

Destination

Destination is the system where the ingested data will be written. ELT Data by default writes data to your cloud storage - S3, Azure Blob and Google Cloud Storage. ELT Data can write the same data to your DBs such as Postgres, MySQL, AzureSQL, etc too.

Integration

A connection between the source system and destination is defined as an integration. For e.g. Netsuite -> Azure SQL can be an integration. Integration just establishes a connection between your source and destination. We will start copying data in the next section (Data Pipeline)

Data Pipeline

A data pipeline is a component to copy specific data from source to destination using an integration. For e.g. Copy GL Details to Azure SQL using my Netsuite -> Azure SQL integration.

Components of a data pipeline

Data Dedupe

While ingesting the data from the source to the destination ELT Data makes sure that no data is duplicated. To ensure uniqueness of the data records ELT Data implements various dedupe strategies including Full Refresh and Incremental. These strategies are:

  • Full Refresh - Sync all records from the source and replace data in the destination by overwriting it.
  • Full Refresh And Append - Sync all records from the source and add them to the destination without deleting any data.
  • Incremental Dedupe - Sync new records from the source and add them to the destination without deleting any data.
  • Incremental Drop and Load - Sync new records from the source where unique keys are not available.

Unique Keys

The columns in your dataset which define data to be a unique record. For e.g. employee_id, transaction_id, etc.

info

Unique keys are for define the uniqueness of the data.

Sort Keys

The columns in your dataset using which one can identify the latest records between two records having same unique key. For e.g. last_modified_date. While running dedupe last_modified_date column can be used to get the latest record for two records having same employee_id.

info

Sort keys are for defining the recency of the data.

Pipeline Variables

The dynamic input to be passed to your pipelines. For e.g. To run a pipeline with yesterday's date as an input, we will define a pipeline variable having the following value:

(datetime.date.today() - datetime.timedelta(days=1)).strftime('%m/%d/%Y')

info

Pipeline variables are python functions and for dates one can use datetime library.

Pipeline Schedule

This defines the schedule at which the pipeline will run. The pipelines can be run on demand as well.

info

The schedules are time zone and day light savings aware.

Data Flattening / Normalization

ELT Data flattens the ingested data. If the response is a JSON it is converted into a flat table to make it easy to consume.

Data Schema Management

ELT Data automatically manages the schema of the data. If a new field is added to the response or report ELT Data will automatically include it in the final dataset. These schema changes are tracked and can be used to monitor the schema evolution.

Data Backload/Historical load

ELT Data provides features to help you load the historical data. While loading historical data you can define how much data you want to load in a single API call. ELT Data will internally generate multiple API calls for loading the data for the entire date range. For e.g. you can define load last 2 years of data and while loading load 1 month at a time. Internally ELT Data will generate 24 executions and load the data.

The amount of data that can be loaded in a single API call is defined by the limitation imposed by the application. For e.g. an API call can't return more than 50,000 records or the response size can't be more than 5 MB, etc.

Data QA

ELT Data runs automated test cases on your pipelines and raises alerts if the test cases fail. ELT Data runs the following test cases on all the incoming data:

  • Uniqueness test: Check if all the primary keys are unique
  • Not null test: Check if primary keys do not contain a null value
  • Data freshness test: Check if the data is getting refreshed on schedule.

Data Documentation

ELT Data creates an automated documentation for all your pipelines. The documentation is automatically kept in sync with your pipelines.

Data Rollback

ELT Data provides you a provision to roll back to a previous state of data. With ELT Data you can discover how the data has evolved.

Pipeline Retries and Failure Notifications

In case of a failure three retry attempts are made and a pipeline is marked as failed and the users are notified.

danger

In case of pipeline failures the failed runs need to be run manually.

Pipeline Concurrency

Concurrency is defined as the concurrent runs of a single data pipeline run. By default the pipeline concurrency is set to 1 to avoid any race conditions.