Skip to main content


Content hub includes guides and case studies on data ingestion.
Go deep into data ingestion from a business, product and technology point of view.

Product

Business

Technology

Data organization in ELT Data

In ELT Data we organize data into a classic three layer strategy. This pattern is now referred as Medallion architecture.

Medallion Architecture

Three layers for data organization:

Bronze layer: Raw data

This is where the raw data lands from APIs, Emails and SFTP servers. The data is stored in the form as it is received from the sources. In this layer the data can be in any format (JSON, XML, CSV, XLSX). The data in this layer should be used with most caution as it can have duplicates, inconsistencies and errors.

The data in this layer can be used for debugging and traceability in case any errors are reported in the downstream layers.

The technical expertise required to handle this layer is high.

Silver layer: Clean data

This is where data is read from the Bronze layer, cleaned, flattened, schema checked, deduped and checked for basic quality. The data in this layer is usually in Parquet or Delta file format. The data is this layer is good for consumption by data analysts and by data analysis tools.

The technical expertise required to handle the data in this layer is moderate.

Gold layer: Business data

The data in this layer is read from Silver layer and transformed as per your business requirements. For e.g. you want to join the data coming from Netsuite and ServiceTitan or if you want to perform a data aggregation. The data format in this layer can be anything that your business requires (Excel, CSV, Parquet).

You can have business rules in this layer to test the data for accuracy of KPIs and Metrics. The data in this layer is most suited for your reporting tools or reporting requirements.

The technical expertise required to handle the data in this layer is moderate.

Databases and SQL Query Engines

The data for the above three layers is created and stored in cloud storage (AWS S3, Azure Blob or Google Cloud storage).

With the data available in the Silver and Gold layers for consumption, what is the requirement of having a DB? Please note that the data in cloud storage is kept in form of files and you still need a query engine to run analysis on the data. This is why the data from Silver or Gold layer needs to either pushed to a database or use other tools that can offer you a querying mechanism on top of Gold or Silver layers directly.

E.g. of databases - Snowflake, Azure SQL, Postgres. E.g. of tools operating directly on your data in storage - SQL analytics endpoint in Microsoft Fabric endpoint (internally it too is a warehouse!)

Other ways of organizing the data

Not only Medallion but there are other data organization architectures too. For e.g. Lambda, Kappa, Data Marts, Data Mesh and your choice of architecture is dictated by your use case.

For ELT Data we find Medallion architecture serving the use case of our customers the best. Medallion in addition is easier to understand, defines the ownership of each of the layers cleanly and makes it easier to establish lineage and is generic enough to support a broad set of use cases.

API Authentication methods

APIs are used for machine to machine data exchange. When a machine tries to access the data from the application, the machine needs to identify itself. The machines can identify themselves in broadly one of the following ways:

  • API Key
  • Basic Authentication
  • OAuth2 Authentication

API Key

The source application provides a static API key and the caller has to present this API key while fetching the information. Leaking of the API keys is equivalent to leaking of the credentials.

Basic Authentication

In Basic Authentication, the caller needs to present the username and password to access the data.

OAuth2 Authentication

In OAuth2, the caller needs to generate a access token (which is temporary) and pass the generated access token to the application to fetch the data. This is one of the most secure methods to fetch the data.

OAuth2 is a standard spec and the security can be further enhanced with additional properties (Refer to the High Security OAuth in OAuth2 spec)

In addition to security, OAuth2 also allows you to have a fine grained access control on your resources.

Comparison of Authentication types

Authentication MethodProsCons
API Key- Simple to implement and use- Less secure, keys can be easily shared or exposed
- Minimal overhead, fast performance- Difficult to provide fine-grained access control and permissions
- Compatible with most HTTP clients and servers- Managing and revoking API keys can be cumbersome
Basic Authentication- Simple to implement and understand- Transmits credentials (username and password) in Base64 encoding, which is not secure unless over HTTPS
- Supported by most HTTP clients and servers- Does not provide a way to handle token expiration or revocation
- Not suitable for complex, multi-user systems
OAuth2- Highly secure, supports granular access control with scopes- Complex to implement and requires more setup
- Provides token expiration and revocation mechanisms- Additional overhead due to token exchange process
- Suitable for third-party integrations and delegated access- Requires client and server to manage tokens and refresh tokens

Which Auth method should I use?

Depending upon your application, one or more auth methods might be allowed. Our preference in order of security is:

  • OAuth2
  • Basic Auth
  • API Key

Custom Authentication

We have often seen applications implementing their own authentication methods or the apps being partially compliant with the OAuth2 spec. With custom auth, the users need to be clear of the security impact and the best practices for such methods.

Data pipelines provided by SaaS applications

The ETL/ELT providers in the market provide syncing of the data between the business applications. Each of the ELT providers has its own USP and broadly these providers vary along the dimensions of pricing, security, number of connectors, services and added support for the tools to support the rest of the workflows in the data ecosystem.

However most of the above dimensions would render meaningless if the Saas applications start providing the data pipelines on their own. What's stopping the SaaS owners from providing data pipelines to sync their business data with the customer cloud. It will be a new revenue stream and will increase the user stickiness. In addition, the data pipelines provided by the app owners will be more robust and the users will have a better experience as compared to the pipelines provided by the ELT providers.

Off late we have seen the larger players move in this direction with Stripe and Salesforce being the most prominent examples:

Why application owners don't provide data pipelines as a service?

We can think of the following reasons why an app owner might not be interested in providing data pipelines as a service:

  1. Data pipelines are brittle and it takes a lot to keep them running and consistent - Retries, failures, historical loads, data quality and schema checks, etc can suddenly increase the support costs and distract the owners from their core offering.

  2. ELT pipelines is not a core business - For mid scale apps the product and engineering investment into providing ELT pipelines might not be worth it and they would rather focus on their core product.

  3. Incomplete offering and delayed/extended sale without the rest of the connectors - Let's assume users want to join Stripe and Netsuite data. The Stripe pipelines are of no use without having the Netsuite data coming in.

  4. Varied infrastructure stack and customer requirements - The customers might have their infra hosted on AWS, Azure or GCP. In addition, users can request output in various formats (CSV, Parquet, Delta, etc.) and in different destinations(Azure SQL, Redshift, BigQuery, Fabric Lakehouse, etc).

  5. Added technical debt and complexity - If ETL pipelines are added to the product suite, the core product rollouts can become slow and difficult. In addition in cases of breaking changes, the app owners would have to make sure their ETL customers are well supported.

What's next

The larger products, backed by the resources and large customer base, might take the path same as Stripe and Salesforce. For them the investment might be worth it and the data might flow back into their ecosystem (Salesforce + Tableau, Google Analytics + Google Cloud + BigQuery + Looker). However for the smaller applications, providing ELT services is still a formidable challenge.

Historical loads

During the life of a data pipline, you may need to run historical loads for different reasons including:

  • initial load of the data
  • backdated data is added to the system that needs to be loaded
  • destination data gets corrupted and the entire data needs to be synced again with the source system
  • new fields are added to the destination dataset, requiring a complete load of historical data

Additional context and decision points for a historical data load (say for the last 24 months):

  • some source applications don't let you fetch more than 30/x days data at a time. In such a case you would have to submit 24/y data pipeline runs with different date parameter values and monitor the execution of all the individual runs.

  • some source applications let you fetch the entire data in one go but then you have to ensure that the pipeline doesn't get timed out / stalled for any reason, or else you will have to rerun the entire process.

  • choose whether the multiple pipeline executions (24/y) should run concurrently or synchronously.

  • ensure that your historical data loads do not interfere with the scheduled pipeline runs.

  • check that your source system geared to handle a large number of execution requests in a short period of time.

Historical loads are a resource intensive processes. In addition to proper resource allocation, historial loads require planning and thorough monitoring.

Pipeline concurrency

If not properly managed, concurrent runs of a pipline with different parameters can corrupt your data. This can cause continuity issues or downtime in your operations and reporting.

By default, ELT Data runs pipelines with a data concurrency of 1, i.e. only one instance of a pipeline will run at any given time. ELT Data allows concurrent runs (concurrency > 1) of the same pipeline under supervision.

Data checkpoints and rollback

Since data and business requirements are dynamic, the destination data may occasionally get corrupted during a data load. To address this, ELT Data creates automated checkpoints to which users can rollback the warehouse tables as required. This enables business continuity (with data delay), version control and data debugging to diagnose and fix the destination data.

Data schema evolution in Extract, Load, Transform (ELT) pipelines is a critical aspect of managing data as it grows and changes over time. Source systems and business requirements are dynamic leading to schema updates. Consider the following:

  • Source systems - Assume your source application added or removed a field from the API. This shouldn't cause your pipelines to come to a stall.

  • Business requirements - Your business wants to report on new KPIs and requests you to add a few more columns to the data pipelines.

Both the above scenarios are extremely common during the lifecycly of data pipliens and handling schema updates is necessary for operational continuity and data integrity.

Schema evolution in ELT Data

ELT Data automatically evolves your schema as the data or the business requirements change. Consider the following scenarios:

1. Adding new columns:

ELT Data adds the new columns to your dataset. For the existing rows, the data is set to null.

  • Consider the following original data:
order_idcustomer_idorder_datetotal_amount
11012024-07-01100.00
21022024-07-02150.00
  • New data:
order_idcustomer_idorder_datecustomer_emailtotal_amount
31032024-07-03unknown@example.com200.00
  • Final dataset: New column added from the new data, col value for old data set to null.
order_idcustomer_idorder_datecustomer_emailtotal_amount
11012024-07-01null100.00
21022024-07-02null150.00
31032024-07-03unknown@example.com200.00

2. Dropping columns

ELT Data will not drop the column from your dataset but will set the value to null for the rows where the column is missing.

  • Consider the following original data:
order_idcustomer_idorder_datecustomer_emailtotal_amount
11012024-07-01unknown@example.com100.00
21022024-07-02unknown@example.com150.00
  • New data drops "total_amount" column
order_idcustomer_idorder_datecustomer_email
31032024-07-03unknown@example.com
41042024-07-04customer@domain.com
  • Final dataset
order_idcustomer_idorder_datecustomer_emailtotal_amount
11012024-07-01unknown@example.com100.00
21022024-07-02unknown@example.com150.00
31032024-07-03unknown@example.comnull
41042024-07-04customer@domain.comnull

3. Renaming Columns

ELT Data will handle the renaming of the columns in the following method:

  • Consider the following original data:
order_idcustomer_idorder_datecustomer_emailtotal_amount
11012024-07-01unknown@example.com100.00
21022024-07-02unknown@example.com150.00
  • New data renames "total_amount" to "sales_amount" column
order_idcustomer_idorder_datecustomer_emailsales_amount
31032024-07-03unknown@example.com200.00
41042024-07-04customer@domain.com250.00
  • Final dataset
order_idcustomer_idorder_datecustomer_emailtotal_amountsales_amount
11012024-07-01unknown@example.com100.00null
21022024-07-02unknown@example.com150.00null
31032024-07-03unknown@example.comnull200.00
41042024-07-04customer@domain.comnull250.00

4. Changing Data Types

! ELT Data will raise an error and stop the pipeline in case of changing data types.

Changing data types can cause further downstream chaos and we require manual intervention to resolve the changing data types.

Documenting the schema evolution

  • Version controlled schema: ELT Data automatically documents each schema update and you can view how the schema has evolved over a period of time.

  • Rollback schema changes: In addition, if you want to roll back to a previous version of the data (prior to schema evolution), ELT Data enables you to do that.

  • Monitoring schema changes: ELT Data alerts you in case of schema updates.

Problem statement

Let's assume we are running a field services business and we have to identify revenue generated per branch. Our financial information is managed in Netsuite and we use ServiceTitan to manage operations.

Reporting requirements

In order to accomplish the above:

  • We need to fetch the revenue data (Invoices) from Netsuite. Next identify what all APIs in Netsuite constitute Invoices data.

  • We need to fetch the operations data per branch (jobs) from ServiceTitan. Next identify what all APIs in ServiceTitan constitute Jobs data.

  • Find the common link between Netsuite Invoices and Service Titan Jobs to be able to establish a link between revenue and branch operations.

  • Clearly identify the definition of "revenue". Do we exclude a few invoices or skip a few branches.

  • Clearly identify how you want to report on the data - daily, weekly or multiple times a day.

  • Finalize the report design (look and feel) and identify whom all should the report be shared with. Executives should be able to see all the branches while the individual branch managers should be restricted to their branches.

Teams involved in implementation

The above activities can be broadly split across the following four team members:

  • Business User - The user driving the reporting reqiurements

  • Product Manager - Bridge between the technology and business user.

  • Data Engineers - Responsible for fetching the data from Netsuite, ServiceTitan and for joining the data from the two.

  • BI Developer - Responsible for developing the BI reports and sharing it with the correct audience.

  • Optional - You may have to interact with Netsuite and Service Titan experts to map the correct APIs.

Tools required to implement the above

  • Product Manager - Project Management tools

  • Data Engineers -

    • ETL Tools for fetching the data (Airbyte, FiveTran, ELT Data, ADF, AWS Glue, etc.) and to make sure the data is not duplicated.
    • Transformation tools to join the Netsuite and ServiceTitan data. Commonly used tools - Spark, ADF Dataflows, etc.
    • Orchestration tools - To schedule, manage and coordinate the data ingestion and transformation processes. Commonly used tools - Airflow, Prefect, Dagster, etc.
  • BI Developer - BI Tools for data visualization. Commonly used tools - Power BI, Tableau, Qlik, etc.

Total costs

The total cost to run the above reporting requirements will be inclusive of the people + technology costs + cloud costs. The annual costs to keep reporting up can easily exceed $50K.

Overview

In this article we look at all the stages of the data journey from raw input data to analysis and output. We will briefly explore the process flows at each stage.

We classify the data journey from source application to reporting output into the following stages:

  1. Data Ingestion - Get raaw data from the source application
  2. Data Transformation - Apply business logic / rules to convert the data into a form that supports business analysis
  3. Data Analysis / Reporting - Analyze and present the transformed data in a manner that it can be consumed by the end-user

Integration options

Source - https://a16z.com/emerging-architectures-for-modern-data-infrastructure/

Data Ingestion

This stage deals with reading data from source applications and moving it to a central storage location. For example, ingest accounting data from NetSuite and store it in AWS S3. While doing this, ELT Data ensures that:

  1. Raw data is available in a consistent and reliable manner.
  2. A copy of deduped data is made in addition to the raw data.
  3. Input data meets basic quality parameters, e.g., no null records, fresh data check, primary key check.
  4. The data to be ingested is standardized and documented.

The above steps ensure input data quality and freshness and do not involve business logic. This stage is implemented by the data engineering team of ELT Data.

Popular data connectors include ELT Data, Fivetran and Airbyte.

Data Transformation

In this stage we tranform the ingested raw data into business data. In this stage we focus on:

  1. Transforming the input data into the form required by the business.
  2. Defining relationships between the entities in the data warehouse.
  3. Writing data checks on business KPIs. E.g. sales this month > $10,000 OR NumofJobPerTechnician > 3
  4. Document and write the business transformed data to a warehouse table for further consumption.
info

While running data transformations, it is essential to ensure that the definitions of business KPIs are consistent across pipelines. For example, we may write one pipeline for sales by month and another for sales per employee per month. It is important that we keep the definition of sales consistent across the two transformation phases.

This stage is jointly implemented by the ELT Data engineering team (which provides the platform to run transformations) and the client's business analyst (who define the business rules and transformations).

Popular data transformation tools include DBT and Apache Spark.

Data Analysis / Reporting

The transformed data is finally used for business analysis and reporting. This stage involves presenting the business transfotmed data to different end-users in forms consumable by them. For example, financial reports for management often vary from the ones required for the operations team, though the underlying data for both reports is the same.

This stage is implemented by BI developers and consultants in the client organization.

Popular analytics and reporting tools include Power BI, Tableau, R and Python.

Supporting Stages/Tools

To ensure that data processes including the above three stages run in a consistent, predictable and manageable manner, ELT Data uses numerous supporting technologies and stages. In this section we briefly describe the key tools and the process stages in which they are used.

Workflow managers

Workflow managers are critical to managing complex data pipelines. Their primary functions include scheduling, automation, task dependency management, monitoring and logging, scalability, error handling and retries.

Popular workflow management tools include Airflow and Prefect.

Data quality checks

To make sure that our data pipelines are running and fetching data as expected, we can add on additional tooling to run data quality checks when it is fetched and transformed.

Popular data quality check tools include Great Expectations and AccelData.

Data discovery

As the data and the people working on it increase, keeping track of all the data assets becomes an issue. To keep our datasets fresh and documented, we can add tools to help us with the data organization.

Popular tools to aid discovery include Amundsen and Atlan.

Data governance

With increasing data volume and consumption, we may need to define access controls on the data assets. To help us with this, we can add tools around governance.

info

Based on the maturity and necessity of your organization, you can decide which of the supporting stages and tools to add.

Problem statement

In continuation of the data integration article, we will be exploring different mediums of fetching the data from the application.

Integration options

Data Fetch options

Depending upon the application, it will share data with you in one of the following ways:

  1. API
  2. Database
  3. Webhook
  4. Email reports
  5. Data in SFTP/One Drive/Google Drive/S3
  6. Custom scripts in customers' app environment

Above sequence is our preference of working with a data fetch medium and the pros and cons for each of the mediums is described below.

Pros and Cons

Data Fetch MethodProsCons
API- Standardized format.
- High level of automation.
- Decoupled from the underlying application changes.
- Users can control when, what and how much data to fetch.
- Dependency on external service availability.
- More technical and less business user friendly.
- Requires API integration and maintenance.
Database- Direct access to raw data.
- High performance for complex queries.
- Can be highly secure if properly managed.
- Users can control when, what and how much data to fetch.
- Tightly coupled with the underlying application changes.
- Potential security risks if not properly secured.
Webhook- Real-time data delivery.
- Can be configured for specific events/data.
- Requires endpoint setup and maintenance.
- Limited by the capabilities of the webhook provider.
- Handling high volumes of data can be challenging.
- No way of fetching historical records.
- The data is pushed and user cannot control when to fetch data.
Email Reports- Simple to set up and use.
- Familiar format for most users.
- No special infrastructure required.
- Data is not real-time.
- Limited automation potential.
- Handling and processing email data is cumbersome.
- Challenges with historical load or backload of the data
Data in SFTP/OneDrive/Google Drive/S3- Easy to set up file sharing and storage.
- Accessible from multiple locations.
- Providers offer robust security features.
- Not suitable for real-time data needs.
- Manual process to update and maintain files.
- Data format can vary, requiring additional processing.
Custom Scripts in Customer's App Environment- Highly customizable to specific needs.
- Direct integration with the app environment.
- Can be optimized for performance.
- Requires development and maintenance resources.
- Potential security risks if not properly managed.
- Dependent on the customer's infrastructure.

Problem statement

Let's assume you are a Product Manager in a fintech organization. You are speaking to one of your clients and you request general ledger and sales data from your client. You need to feed this data into your product.

Your client specifies the data is in Netsuite and you are free to take it from there. However before commiting to any timelines, you need to have answers on the data, technology and the business use case.

The following steps document the steps that you will need to take to accomplish the integration:

Integration outline

The integration activity can be split into the following sections:

  • Data discovery - Get the entities from the source that will constitue general ledger and sales data.
  • Data pipelines - Fetching data from the source system
  • Data integration - Data modelling/transformation to work with your product.

1. Data discovery

[Ownership - Product Manager and Business Analyst]

In data discovery you identify the data elements that will constitute the data you require.

Netsuite exposes 800+ data entities.

Can you identify the tables required to get general ledger and sales data?

In order to work through the Netsuite records, you need to answer the following:

  • Are you familiar with the application or will you need to ask your customer to bring in a subject matter expert or external consultant for data discovery or other product related questions?
  • Which entities or API or DB tables will constitute complete GL Details and Sales ledger.
  • Can you define the unique records in the above dataset? This is used to make sure to dedupe data to keep unique records.
  • What should be the frequency of the ingesting the data and if the source application supports the desired data refresh frequency?
tip

Always preferable to ask customer for sample data from production.

Helps tremendously with data discovery and validation.

2. Data pipelines

[Ownership - Product Manager, Data Engineering Team]

In this stage we need to gather requirements for writing the data pipelines to fetch the data from the source app. The tech team will need the answers to the following to get started:

  • Which version of Netsuite/SAP/Macola is being used? Is the version cloud-based or on-prem?
  • How is the data exposed via the application - APIs, databases, emails, or via the SFTP folders?
  • Does the customer have the right licenses to share the data via one of the above-defined mediums?
  • The entities/tables/APIs in SAP will constitute sales data and general ledger?
  • How often do you want to sync the data from the application to your system? Every 5 mins or 24 hours?
  • Do you have any estimate of the data volume?
  • How will you define the unique records when fetching the data so that we don't duplicate the records?
  • Once ingested how will you verify/reconcile the data against the source application?
  • How much data do you want to retain in the system and does the customer have any security/data compliance requirements?
tip

Here's a simple flowchart you can use to get basic answers about data ingestion.

Integration options

** #1,#2,.... - our preference of the medium through which data is exposed.

info

Pros and Cons of data exposed via different mediums listed in this article.

3. Data integration / modelling / transformation

[Ownership - Product Manager, App Software Team, Data Engineering Team]

With stages #1, #2 we have identified and decided what data we want to fetch and how will we fetch it.

In this stage we need to transform data so that it has the datastructure and metrics as required by your application.

For e.g You might want to club the finance data (general ledger) with the ops data and then push the merged data into your application from where your users can consume the ops+finance data.

The business specific transformations will be carried out in this stage.

  • The Data Engineering Team will help you with merging the finance and the ops data.
  • The App Software Team will pick up the merged data and push it into the application.

Integration hygiene

While the integration is running, you also have make sure that the integration remains in a healthy state. Few challenges that you might encounter once the integration is live:

  1. Disconnect between the Data Engineering and the app software team - If the data pipelines fail, the app won't have the latest data. The app and the DE team need to be in sync wrt the health of the data pipelines.

  2. Data validation/reconciliation between source application and your application - Let's assume there is some issue with the data duplication in your data pipelines. The data pipelines continue to work but the data in the source and destination will diverge very quickly. Better to have checks in place to catch these errors before your customer reports these.

  3. Data source goes down

  4. Data source changes the schema of the data fetched from the APIs

ELT Data

ELT Data is a data ingestion tool (#2) and helps you reduce the time to go live with built in automations for data ingestion.