Industry Insights
Data Warehouse vs. Data Lake - What's the Difference?
Dibyendu Dasgupta
Technical Consultant Data & AI
June 3rd, 2021
As a technical consultant working on data and analytics projects, I often talk to clients who want to know more about the finer details of their solution, including how and where data will be stored. One of the top questions I'm asked by these non-technical stakeholders is: "What's the difference between a data warehouse and a data lake?" closely follow by "Which is better?". It's true that data lakes and data warehouses are both places for storing data. But each has a unique purpose, as well as distinct strengths and weaknesses. In this blog post, I'll detail the role and capabilities of data warehouses and data lakes, explain how they differ, and explore when to use one over the other.
What is a data warehouse?
A data warehouse stores structured data (this is data that is highly organised and searchable, like Excel spreadsheets, or customer contact information). Before data can be stored in a data warehouse, you have to know exactly what you're storing, where it's being stored, and how to take out that data or search it when you need it. The information stored in a data warehouse is already cleaned and transformed. It can be easily accessed by anyone in the organisation - provided they know what they're looking for. Data warehouses have been around for a long time. They're predictable, reliable, and a familiar concept in most business environments. Strengths- Fast indexing, cataloguing, and search
- Easy to run and maintain
- Accessible to business users
- Expensive data storage costs, so not ideal for storing large data volumes
- Not suitable for storing semi-structured and unstructured data
What is a data lake?
Compared to data warehouses, data lakes are much less fussy about what data they store. A data lake can store structured data, as well as unstructured and semi-structured data. Unstructured data includes things like videos, sensor data or social media posts, while semi-structured data is loosely structured/organised data, such as XML and JSON files. If you have an internal data science team, chances are you already have a data lake. Designed specifically to handle big data volumes, data lakes retain all data and allow for deep analysis like predictive modelling and statistical analysis. (For more about data lakes and big data, you might like to read my previous blog post, Data Lake Analytics.) Strengths- Lower storage costs
- Can store all types of data
- Easy to scale
- Enables deep analysis
- Difficult and inefficient for understanding relationships between data sets
- Lacks a query engine, so it's not optimised for query performance
- Good folder structure is critical for both generating reports directly out of data lake and querying data across multiple files