Why ETL-Zero? Understanding the shift in Data Integration as a Beginner

Zero-ETL allows data from other data warehouses, data lakes or databases to be accessed directly in cloud systems. Zero copy - without extract, transform, load.

Nov 12, 2024

When I was preparing for the Salesforce Data Cloud certification, I came across the term Zero-ETL. The Data Cloud offers the possibility to access data directly from other systems such as data warehouses or data lakes or sharing data with these systems without the data being copied. Salesforce describes this as Bring Your Own Lake (BYOL), referring to the term Bring Your Own Device (BYOD). I wanted to better understand the concept of Zero-ETL and illustrate it in an understandable way.

In this article, I’ll show you how you can create a simplified ETL process with Python to better understand this concept, what Zero-ETL or Zero-Copy means and how this new approach to data integration is implemented in the Salesforce Data Cloud.

1) Traditional ETL-Process: Step-by-Step Guide with Python for Beginners

If you are already familiar with the ETL and ELT processes, you can skip this section. If you are new to this topic, take a look at the super simplified example to better understand the Extract — Transform — Load process. Or even better, build it yourself — by applying it, you will usually understand the concepts better.

1 — Extract

In a traditional ETL data processing pipeline, the data is collected from a source such as a database, an API, a JSON file, an XML file or another data warehouse.

For our example, we will first create a CSV file containing customer data. I have put together a file with sample data that contains the columns First name, Last name, Email address, Purchased product and Paid price. You can find the CSV file and the code on GitHub.

We then read the CSV file with pandas and display the first 5 lines:

import pandas as pd
import sqlite3
import matplotlib.pyplot as plt

# Step 1: Extract
# Reading data from the csv file
file_path = 'YOURPATH' # For Windows you have to separate your path with /
data = pd.read_csv(file_path)
print("Extracted data:")
print(data.head())

If you need help setting up a Python environment, it is best to read the steps in the article Python Data Analysis Ecosystem — A Beginner’s Roadmap. For the code to work, you need to have installed the pandas, sqlite3 and matplotlib. If you are using Anaconda, you can enter the command ‘conda install pandas, sqlite, matplotlib’ in your Anaconda prompt terminal.

2 — Transform

As soon as the data has been extracted, data transformations follow in the traditional ETL process. This can mean that column values are combined, calculations are performed, tables are merged or unnecessary information is removed.

For our example, we will carry out two simple transformations in this step. Firstly, we create a new column that stores the full name based on the first name and last name. Then, in a new column, we want to distinguish the customers who have spent a high amount from those who have spent a lower amount. To do this, we also create a new column (Boolean) that enters Yes for all rows with an amount over 20.

# Step 2: Transform
# Creating a new column Full_Name by combining First_Name and Last_Name
data['Full Name'] = data['First Name'] + ' ' + data['Last Name']

# Create a new column High_Payment with "Yes" if Paid_Price > 20, otherwise "No"
data['High_Payment'] = data['Price_Paid'].apply(lambda x: 'Yes' if x > 20 else 'No')

We display the first 5 lines again to check whether these transformations have been carried out successfully (two new columns Full Name and High_Payment):

# Displaying the 5 first rows
print("Transformed data:")
print(data.head())

3 — Loading

After transformation, the traditional ETL process involves loading the data into a platform for further analyses. For example, machine learning methods can be applied to the data or the data can be visualised for dashboards and reports.

For our example, we load the transformed data into an SQLite database in this step. SQLite is MySQL’s little sister, so to speak, and is well-suited for simple projects with small to medium-sized data volumes. Here we also carry out a small analysis on the data and visualise it.

# Step 3: Load 
# Connecting to the SQLite database (or create it if it doesn't exist)
conn = sqlite3.connect('output_database.db')

# Loading the DataFrame into a new table in the SQLite database
data.to_sql('transformed_data', conn, if_exists='replace', index=False)

# Analysis: Identifying how many customers made high payments
high_payment_count = data[data['High_Payment'] == 'Yes'].count()
print("Number of High Payments:", high_payment_count['High_Payment'])

# Close the database connection
conn.close()

print("ETL process completed. Transformed data saved to 'output_database.db'.")

# Visualizing the data 
data['Price_Paid'].hist(bins=10)
plt.title('Distribution of Prices Paid')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

As you can see, the example is very simplified. Of course, much larger amounts of data are extracted in real projects, the transformations are usually much more complex and the data is typically loaded into systems such as other databases, data warehouses, data lakes or data visualisation tools.

So, what are some challenges of this traditional ETL process?

With this process, the data is not available in real-time but is usually processed and copied in batches. Furthermore, the process needs more resources and therefore more costs are consumed. This is where the term Zero-ETL comes into play.

2) So what is Zero-ETL?

We live in an age of instant. Every message, every movie, every song must be available immediately at any time — thanks, of course, to the success of WhatsApp, Netflix and Spotify, to name just a few examples.

This is exactly what cloud providers such as Amazon Web Services, Google Cloud and Microsoft Azure have told themselves: Data should be able to be processed and analysed almost in real-time and without major delays.

Zero-ETL is a concept from data integration. Instead of requiring the explicit extraction, transformation and loading of data in separate steps, as is traditionally the case, data should flow seamlessly between different systems. The term was introduced by AWS in 2022 for the integration of Amazon Aurora into Amazon Redshift. What is new about this concept is that the technology makes it possible to use or analyse data directly in its original format and almost in real-time. There is no need to move data. Data latency is minimised. Data can be transformed and analysed within a single platform.

Imagine that traditional ETL is like collecting water in a bucket outside your house and then carrying it to your shower. Not only does this process take time and energy, but you may also spill water on the way. Okay, that’s how I showered when I spent 4 months in Tanzania 10 years ago ;) But normally you would most probably prefer to shower the way Zero-ETL would transport the water in our metaphor: Zero-ETL, on the other hand, means that you stand in the shower, turn on the tap and fresh water flows straight away. Instead of the water — or the data — having to be transported somewhere, it is available right there, even if it is stored in a different location.

If you would like to find out more about the technologies that make zero ETL possible, I recommend the blog article from DataCamp. The terms database replication, federated querying, data streaming and in-place data analytics are well explained there.

Extract-Transform-Load is not the case for Zero-ETL. With Zero-ETL Cloud Platforms can access and share data in near real time. — Own Visualization

3) Why Zero-ETL? Advantages and Disadvantages

Companies want to minimise the time it takes for data to be available for analyses and further processing (e.g. in marketing or sales). Zero-ETL or zero-copy makes it possible for a system to access data from several different databases at the same time. Access to current data is particularly important — or at least very helpful — for machine learning features to precisely train models and achieve relevant predictions.

Advantages

No data replication: The data does not have to be physically copied from one system to another.
Possibility of using real-time data: What does this mean in concrete terms? For example, in the Data Lakehouse, data on shopping basket abandonments in the last 24 hours can be used directly in the marketing automation tool to send these customers personalised offers. This is certainly an advantage, especially in fast-moving industries such as e-commerce.
Fewer errors: As the data is no longer copied or transformed, the risk of errors is lower with this step. The system accesses the data in the target system, whereby the data remains in the target system.

Disadvantages

Network connections can lead to problems: If connections fail, are unstable or slow, delays can occur. It would then be more convenient to have a copy of the data in the system.
Complexity of data management: Managing data across multiple systems and possibly across multiple departments could be more difficult.
Dependence on cloud solutions: Many companies, especially medium-sized ones, have not necessarily implemented cloud technologies in their architectures. For such companies, it could be a relatively large effort to integrate Zero-ETL.

4) What does Zero-ETL look like in the Salesforce Data Cloud?

The Data Cloud is a customer data platform (CDP) from Salesforce and has integrated the zero-ETL concept. This means that the Data Cloud can access data stored in different databases without having to move, copy or reformat this data. Conversely, data warehouses such as Snowflake or GoogleBigQuery can view and use data from the Data Cloud.

Data from the data cloud in the data warehouse

Let’s imagine that an electronics company stores all its lead and contact data in the Salesforce CRM. This CRM is linked to the data cloud so that the data can then be used in a marketing tool. The company stores data on online behaviour, customer service interactions and logistics in the Data Cloud. It also uses Calculated Insights to calculate the customer lifetime value (CLV), for example. The company also uses a data warehouse such as Snowflake. In it, the company stores transaction data on all products sold on the website, information on the products sold and data on deliveries and stock levels.

Instead of having to physically copy the data from the data cloud, the company can now create a virtual table in Snowflake that points directly to the data in the Salesforce Data Cloud. This means that the company can make queries on the data directly in Snowflake, even though it remains in the data cloud.

How can you do this?
Within the Data Cloud, you must first define the Data Lake Objects, Data Model Objects or Calculated Insights. You then set up data sharings. That means, you have to link these objects to the data share target — in our example Snowflake. In Snowflake, you need to create virtual tables that contain the structure and the reference to the actual data in the Data Cloud. You can then run queries in Snowflake on these virtual tables as if the data were stored locally.

Data from the data warehouse in the data cloud

To illustrate the other way around, let’s imagine a company that sells household appliance products. The data warehouse stores records of all purchases that have taken place online or in physical shops, information on all products in the range and data on the logistics chain. The company uses the Data Cloud to use the data from Snowflake and the attached CRM in the marketing product and to personalise the marketing campaigns to a greater extent using this data.

How can you do this?
Within the Data Cloud, you must first establish a connection to Snowflake. This is where the term ‘mounting’ comes into play. The Data Cloud can mount tables as external data objects. Simply put, this means that the data cloud creates a virtual link to this data. Once the data is available as external objects in the data cloud, you can continue to use it as if you had ingested it via data streams (the normal way to load data into the data cloud).

This certainly allows a company to increase efficiency, reduce costs and access the latest data in real time. However, it can be more difficult to maintain an overview of the management of data sources across the various systems. For example, I ask myself which users should have which permissions and should be able to access which data. It can also be the case that if network connections fail or complex requests have to be made via several data sources, delays occur.

Cloud platforms has the possibility to access data almost in real-time with Zero-ETL. — Own visualization — Illustrations from unDraw.co

5) Final Thoughts

ETL, ELT, ETLT or Zero-ETL: We could discuss whether it is necessary for companies to have even faster access to all of a person’s data in order to be able to send even more personalised and up-to-date marketing emails, for example. But regardless of this sociological conclusion, companies want to be able to access their data from different systems with as little delay as possible. Zero-ETL technology makes this possible in cloud solutions, giving companies a competitive advantage, at least for the time being.

Where can you continue learning?

Data Science Espresso by Sarah Lea

Discussion about this post