Article by Ankit Sharma, Big Data and Cloud Senior Solution Architect
Under Data Management Platform architecture world, we are pretty much familiar with the term Data Warehouse, which is popular from the era of 80’s and Data Lake which gained a lot of attraction in the last decade starting from 2011. In start of 2020 Databricks discussed a new data management paradigm Lakehouse. As the name suggests, this concept combines the best of both worlds Data Lake + Warehouse.
In this blog we will discuss what’s good and bad with Warehouse and Data Lake, and how Lakehouse is going to achieve best out of these two approaches.
A data warehouse is a central source of information that can be analyzed to make more informed business decisions. The data warehouse stores data from many heterogeneous sources into a single area which gets transformed according to decision support system, which contributes to business decision making. Concept of Modern data warehousing evolved to a multi-platform, distributed approach to meet business need for enterprises and MPP architectures helps organizations to handle large data sets.
But while Data warehouses are great for structured data, in the world of IoT, a lot of enterprises have to deal with unstructured, semi-structured, data with high variety, velocity and volume. Also, with limited flexibility with SQL like tools, there is no data science and machine learning capabilities offered. Supporting schema on write, makes warehouse less efficient for schema evolution and storage cost adds another drawback. Modern cloud-based EDW’s, addresses most of these challenges, but some challenges still remain, the obvious one being, lack of unstructured data support.
Idea of Data Lake is basically a dumping ground of raw data in a variety of formats within a lower-cost storage environment. ‘Schema on read’ provides high agility and ease of data capture and also support of unstructured and semi-structured data allows enterprises to choose not to reject any data inflow. Compatibility with data science and machine learning tools gives more power to data scientists to get more insights and build models out of it.
But while storing data with variety and volume, data lake lacks some critical features:
- Does not support transactions (ACID properties)
- Lacks on Data Quality and poor support for BI.
- Lack of consistency makes it hard for appends and reads.
- Multiple sources dumping huge volumes of data to data lake, and ungoverned and non-catalogued data turning data lake to data swamp.
Although data lake is good for data science and machine learning as it provides deep analytics, it misses operational use cases where data should be structured to get key metrics and reports. Hence to get Business intelligence, enterprises need to load data from data lake to data warehouse.
Enterprises need BI and reporting capabilities from Warehouse so they should be able to bring data in an open format to one place easily and cost-effectively.
Data Lakehouse is a hybrid concept that offers the key feature of both a Data Lake and a Data Warehouse. It enables a new system design : implementing similar data structure and data management feature of warehouse , directly on kind of low cost storage used for data lakes. This approach saves a lot of operational cost as it eliminates the ETL/ELT workloads which transform data from data lake to data warehouse as now query engine will be querying directly from data lake. Along with this aspect of Lakehouse , some other key features that Lakehouse offers:
- Transaction Support : Support of ACID transaction, as there can be multiple data pipeline, which requires read and write transactions without compromising data integrity.
- Schema Enforcement and Governance: Lakehouse should support schema enforcement and evolution. It should also provide robust governance and auditing mechanism.
- BI tools directly over source data: Lakehouse enable using BI tools directly to source data. It eventually reduce the time taken from RAW Zone to visualization.
- Compute Storage Decoupling: Compute and storage use separate cluster to enable more concurrent users and large data size handling.
- Openness: Capable to use open and standardized format with ability to provide API, so a variety of tools and engine can directly access data.
- Support for diverse workloads: Lakehouse should support the capability of SQL and Analytics with data science and machine learning.
- Support diverse data types: Lakehouse should provide access to broad range of data including files, videos, audio, system logs semi-structured and text.
- End-to-End Streaming: Lakehouse should support streaming analytics, which facilitates read time reporting.
Data Lakehouse is an emerging structure that combines benefits of both data warehouse and data lake. Microsoft Azure Synapse Analytics service with Databricks, enables a similar pattern of Lakehouse. BigQuery and Redshift Spectrum also provides some feature of Lakehouse, but these are primarily on BI and SQL application. Delta-Lake, Apache Hudi, provide transactional feature from open storage format. The next step would be to leverage Delta Engine with Delta Lake which can eventually provide more power to the Lakehouse concept. These are some early stage examples though as the concept of Lakehouse is at an early stage. However, with the contribution of data engineers this new innovation will be available to a larger audience.