Data Lakes may sound like the new Data Warehouses but there are some important differences and you need to choose the right one…
Data Warehouses and Data Lakes
You’ve no doubt heard the terms Data Warehouse and Data Lake thrown around, probably along with “Big Data” but what are they, what’s the difference and most importantly – which is right for you?
I’ll set out to answer that below, specifically through the lens of the Microsoft Azure environment, because we’re a Microsoft Partner and it’s our technology of choice but the concepts are applicable regardless of technology and I’ll try to point out some other options for each.
What is a Data Warehouse?
A data warehouse is a store for holding data from a variety of sources. There may be individual databases within the data warehouse holding data about different domains (for example marketing, production, compliance etc).
The warehouse itself holds structured data which can be organised and codified to follow standard rules so that data from different sources looks the same when it is queried (this is called “conforming” the data). Where performance is important, it may also hold a presentation layer where the data has been further transformed for fast and efficient consumption by an end-use application.
Applying structure like this tends to mean that a data warehouse holds more than one copy of the data, with increasing levels of structure applied. This is because it is generally desirable to hold the raw, untransformed data, in case there is a need to revise of adjust the business logic in future, plus a conformed set, and sometimes a further optimised presentation layer.
Within Azure, the Azure SQL, Cosmos DB and Azure Data Factory platforms or the relational database elements of Azure Synapse platform are examples of data warehouse components (although we’ll talk more about Synapse later) – holding data in relational tables that can be accessed via SQL.
Other mature data warehouse technologies include Oracle, Teradata, MySQL and Postgres SQL as well as to No SQL technologies such as MongoDB.
The main organisational approach in a data warehouse is to apply structure on load and hence it follows the Extract, Transform, Load (ETL) model where raw data is extracted from source, transformed to fit the database model and then loaded to it.
What is a Data Lake?
A data lake can be thought of as a semi-organised store for unstructured data. This may include image or video files, text files, data from sensor telemetry, log files or any other form of data that does not lend itself to being organised into tables of columns and rows or is too large or too ephemeral to be efficient to do so.
Within Azure, the main data lake offering is Azure Synapse, which includes BLOB (Binary Large Object – that’s files to you and I!) storage for files, images and videos and Azure Streaming Hubs. A data lake differs from a simple file storage area in that it allows the contents to be interrogated using a query language (SQL / HQL or same variant). For example, it allows values to be extracted from JSON files and worked with directly without having to load the data into a database first.
Examples of other data lake technologies include Hadoop and Amazon S3.
The main organisational approach in a data lake is to apply structure on query and hence it follows the Extract, Load, Transform (ELT) paradigm where the data is extracted from source, loaded as is, and transformed to fit the user’s requirements on query.
Data lakes are very efficient for storing large volumes of unstructured or raw data because they tend not to duplicate data – instead holding everything once in raw form. For this reason, you will often hear about them in the context of Big Data, because these datasets are typically too large to efficiently store in a structured manner (and in the case of realtime streaming data, it may take too long to insert each data point into a warehouse)
Data Lakes are also fantastic for Data Scientists, who will want to explore a lot of raw data, before deciding which elements are useful.
However, the trade-off is that with transformation being required on read, it makes data lakes generally less performant for reading the data and places the onus on application developers or business users to apply the correct business logic to interpret the data – which can lead to differing answers if applied inconsistently.
Which do I need?
There are clearly pros and cons of each approach and for that reason sometimes raw data is landed in a data lake and has a subset of the most critical and frequently used data organised and structured into a warehouse. Data Lake query languages support the ability to query across both the structured and unstructured parts of the data in a single step. This hybrid model is sometimes called a Data Lakehouse. This is where Azure Synapse works particularly well since it contains both warehouse and lake components, and advanced analytics and modelling tools all under one hood.
There’s a temptation to infer that Data Lakes are modern and Data Warehouses are old-fashioned, but in reality the both solve different problems. If you need to store vast quantities of unstructured and potentially undocumented data which can later be used to inform business strategy, or otherwise used then a Data Lake is the best solution. On the other hand, if your main concern is storing data that can be used for business decisions today then a Data Warehouse will serve you better.
Increasingly, we are seeing a demand from our clients for a Data Lakehouse type solution, with their core business data curated and structured into a properly governed data warehouse, and a larger set of data being more loosely collected and stored in a Data Lake so that it can be used in future, and potentially brought into the Warehouse as it grows.
That said, from a commercial perspective, if you can only afford one thing, then our advice remains to build the Data Warehouse first. That’s the platform that is going to most immediately serve your business with reporting and analysis capabilities, and provide you with a stable and optimised dataset that will support your Power BI dashboards and business applications. It also remains the case that for most loads Azure SQL will work out lower cost than Azure Synapse for the same set of structured data.
There are very few use cases where building a Data Lake instead of a Warehouse is the right approach – over the long term the inefficiencies around usage, lack of governance and need to push all the business logic to the end user is likely to catch on with you if you are taking this approach for production data. However, if you can afford both, then we’d always recommend keeping the broader set of unstructured data available for future R&D (subject to GDPR considerations of course – but that’s a different topic!)
Find Out More
Do you want to know more about data warehouses and data lakes? Get in touch.
Find out how to make your business more efficient and the trends and tech that we see emerging in the market.
Why is Excel so hard to kill when it comes to reporting processes? There are 5 main reasons, but a solution is in sight…
Many industries including banking and financial services, aviation, food and pharmaceuticals have strict mandatory requirements to produce and provide data or reporting to their regulatory bodies. Beyond that, many more organisations have other statutory reporting to shareholders, certification bodies and others.
Biodiversity reporting is on the way but many business haven’t yet thought beyond their climate related reporting obligations. The data challenges of biodiversity reporting are far more complex and here’s why…