Anyone working with business intelligence, data science, data analysis or cloud computing will have come across SQL at some point. We can use it to extract, manipulate and analyze data — in relational databases as well as in modern cloud environments. For a Salesforce Data Cloud implementation, I had to brush up on my SQL and data modeling knowledge.
Fun Fact: Do you know which is the most commonly used SQL command?
Read to the end of the article and you’ll find out ;)
Content
1 — What are data lakes, data warehouses and data lakehouses?
2 — What is the difference to a business intelligence tool & cloud storages?
3 — Why SQL & Data Modeling is important for Data Lakehouses or specific tools such as Salesforce Data Cloud
4 — Basics of SQL & Data Modeling for Cloud Applications
5 — Differences between Salesforce Data Cloud and other Cloud Tools
6 — Key Use Cases of Data Lakehouses for Data Scientists
7. Final Thoughts
1 — What are data lakes, data warehouses and data lakehouses?
When we talk about data platforms or data architectures, we need to understand what data lakes, data warehouses and data lakehouses are. Since we live in a world in which there is more and more data — a few years ago it was always said that ‘data is the new gold’ — we consequently also need systems that can store, process and utilize large volumes of data. Before the emergence of these three terms, data was mainly stored in relational databases (for structured data). The world of data in companies has not only increased in volume but also in variety. Think about the data formats you encounter in your day-to-day work: structured data such as customer data, stock levels or sales figures, unstructured data such as emails, social media posts or support tickets, semi-structured data such as JSON and XML files. There are also sensor or IoT data from machines and devices, often recorded in real-time. Companies need flexible storage and processing tools for this diversity.
The concept of the data warehouse emerged in the late 1980s. The first warehouse is considered to be the ‘Information Warehouse’ developed by IBM. However, Bill Inmon is considered the actual father of the concept. Data lakes then emerged from around the 2010s because more flexible storage for unstructured data became necessary. And for a few years now, there have been data lakehouses that combine the flexibility of data lakes with the structure of a data warehouse.
So what exactly does what mean?
Data Warehouse
As the term suggests, this is a warehouse for data. You store structured and organized data in it. The data is already available in clearly defined formats, such as Excel tables. One example in a business intelligence tool is a company’s sales data for various products, which are processed every week for analysis.
Data Lake
This concept represents a large, open space (or lake…) for raw data. In addition to structured data, you can also store unstructured data here. For example, social media posts and images can be stored alongside sales data.
Data Lakehouse
And now we have a data lake in a warehouse. This concept combines the strengths of both previous data platforms. You can store structured data and perform quick analyses, but also store unstructured data at the same time. For example, in addition to sales figures, a company can also store customer conversations and photos in the system and prepare them for analysis.
The need for this data architecture has arisen primarily because warehouses are often too restrictive, while lakes are difficult to search. I think that data lakehouses are also needed, especially as AI features are increasingly being integrated into tools.

Where can you continue learning?
2 — What is the difference to…
… a business intelligence tool?
Data platforms are used to store and process data. BI tools such as Tableau or Power BI are used to analyze and visualize this data. For example, you can use a BI tool to access data from a data platform, but you mainly use it to create reports and dashboards. You use the data platforms to manage and structure the data.
… cloud storage?
Cloud storage is purely a storage space that is used to store data. Such a tool does not offer any direct processing tools, whereas a data platform has integrated tools for data processing, analysis, ETL processes and machine learning in addition to storage. Examples include Amazon S3, Google Cloud Storage and Microsoft Azure Blog Storage. However, these cloud storages offers features for data processing in connection with other tools.
3 — Why SQL & Data Modeling is important for Data Lakehouses or specific tools such as the Salesforce Data Cloud
The Salesforce Data Cloud is one such data lakehouse. The platform combines various data sources to give you a comprehensive view of the customer. You can use Data Streams to load data from the Salesforce CRM, from cloud storage such as Amazon S3 or via API into the Data Cloud. You can also connect marketing solutions such as the Marketing Cloud or the latest Salesforce product Marketing Cloud Growth in order to use segmentation and personalization in marketing. SQL is central in this solution to process data effectively, enabling structured queries and transformations. In addition to queries for writing, there is also a builder with which you can “click together” SQL.
SQL is the de facto standard for accessing relational databases and has been established since the 1970s. The syntax is descriptive (declarative) and relatively simple. In addition, the language is extremely efficient when querying and manipulating (structured data).
Are there alternatives?
Yes, there are. Alternatives are NoSQL databases such as MongoDB or Cassandra for unstructured data or GraphQL for flexible, optimized data queries in APIs. But SQL is still necessary for many applications.
Besides SQL, the second important term is data modeling. This means that you structure the data in the system in such a way that it is logically organized and easily accessible. In the Data Cloud (or also in BI tools), it is important to define the data relationships in order to subsequently enable efficient queries and analyses. If you use the Salesforce Data Cloud for Marketing Cloud Growth, another component is added. To create segments of customers to whom the company can send emails, you must also consider in the data modeling how the contact data is grouped into a unified profile and how you can then access the information from the other entities.
4 — Basics of SQL & Data Modeling for Cloud Applications
Structured Query Language (SQL) is essential for retrieving, editing and managing relational databases. It is also the language that is practically always used in cloud applications.
SQL-Basics: Query
SELECT: With this command we can retrieve data from one or more tables. To improve performance, only select the columns that you really need in the query.
SELECT column1, column2 FROM table;
WHERE: With this command we can specify conditions so that only certain data records are selected.
SELECT * FROM table WHERE condition;
JOIN: With a join, we can link two tables on the basis of a common column (e.g. ID). There are three different types: INNER JOIN, LEFT JOIN, RIGHT JOIN.
SELECT * FROM table1 JOIN table2 ON table1.id = table2.id;
SQL-Basics: Grouping
GROUP BY: With this command we can group data records and use aggregate functions such as COUNT, SUM, AVG. This allows us to categorize data according to certain criteria. You could use this command, for example, if you want to display the number of orders per customer.
SELECT column, COUNT(*) FROM table GROUP BY column;
HAVING: With this command we can filter groups according to a condition. HAVING works in a similar way to WHERE — but we use it for filters based on aggregates (COUNT, SUM).
SELECT column, COUNT(*) FROM table GROUP BY column HAVING COUNT(*) > 1;
SQL-Basics: Data Modification
INSERT: We use this command to insert new data records into a table. It is important that the data types of the inserted values correspond to the data types of the columns.
INSERT INTO table (column1, column2) VALUES (value1, value2);
UPDATE: This allows us to change existing data records. Based on the WHERE condition, the data records that fulfill this condition are updated. Without WHERE you update all data records — so use this command with caution.
UPDATE table SET column1 = value1 WHERE condition;
DELETE: Here we delete the data records from the table that fulfill the condition (without WHERE all data records are deleted).
DELETE FROM table WHERE condition;
SQL in the Salesforce Data Cloud
In the Data Cloud, SQL is used in the features in the foreground as well as in the background, e.g. for data preparation and integration. Features that use SQL include the following:
Data Streams: This feature is used to load data from the CRM (Sales Cloud / Service Cloud) or from Amazon S3 into the Data Cloud. The data is imported and transformed using SQL.

Data Explorer: Here you can visualize the data of individual data objects such as leads, contacts, individuals or all custom objects. You can also set filters to analyze the data in real-time, which is also done with SQL. Although you define filters with a mouse click, an SQL SELECT with a WHERE condition actually takes place in the background.
Segments: You use filters to create segments. This also works with SQL-like queries, but you drag and drop in fields from different data objects.

Data Modeling in the Salesforce Data Cloud
Before you load data into such a tool, it is best to consider the following points first:
Understand the customer’s requirements and the way in which the data has been structured and stored so far.
Analyze the data sources and consider where the data comes from (data streams / ingestion) and how it needs to be processed.
Plan the data structure by creating tables and defining relationships. For example, create an ERD model with Lucidchart and a data inventory.
Normalization: Organize the data in such a way that redundancies are minimized but the data can still be queried quickly. Explain more and picture
On top: Ensure data quality by integrating validation rules and cleansing processes. This will help you avoid incorrect or incomplete data.

Where can you continue learning?
What is an Entity Relationshop Diagram (ERD)? — Lucidchart Blog (no affiliate link)
5 — Differences between Salesforce Data Cloud and other Cloud Tools
The Salesforce Data Cloud is a data lakehouse. This platform focuses heavily on CRM data and real-time integration from other sources such as the Amazon Cloud. As described above, SQL-like queries are often used in the background for data preparation or data filtering. The Data Cloud offers tools for identity resolution and segmentation, which makes it valuable for marketing teams (in combination with a Marketing Cloud).
Alternatives to the Salesforce Data Cloud
Snowflake is a data warehouse solution that is widely used and is also used for queries and real-time data processing. It is more flexible for different data sources.
Databricks is a data lakehouse that integrates data from various sources. This data platform is mainly used for data engineering, machine learning and business intelligence.
Google BigQuery is a serverless data warehouse solution. This data platform is particularly optimized for large volumes of data and is suitable for highly scalable requirements.
Amazon Redshift is a data warehouse for efficiently storing large amounts of data and querying it via SQL. Amazon S3 is used as a data lake and supports many lakehouse applications. For example, you can set up a data stream from Amazon S3 to the Data Cloud.
6 — Key Use Cases of Data Lakehouses for Data Scientists
One main use case is ETL/ELT processes, in which data is extracted from various sources, cleansed and brought into a standardized structure. For example, data on sales documents, from social media and IoT devices can be integrated for analysis. Lakehouses are also suitable for machine learning as they provide large data sets for training predictive models. Einstein Analytics, for example, uses CRM data to forecast lead conversions or customer churn. Finally, business intelligence tools such as Tableau or Power BI can access real-time data from lakehouses and allow us data scientists to combine different data sets for reports and dashboards.
Where can you continue learning?
What is ETL (Extract, Transform, Load)? — YouTube Video from IBM Technology
ETL vs ELT: Understanding the Differences and Making the Right Choice — DataCamp Blog
The Definititve Guide to Salesforce Einstein AI — SF BEN Blog
7. Final Thoughts
And here is now the resolution for the fun fact:
The most common SQL command is “SELECT”.